← Back to Blog List

Git Internals: Fixing Pathspec Ambiguity in git add

Sometimes, the simplest Git commands hide surprising complexity. A recent dive into a peculiar `git add` behavior turned into a fascinating exploration of pathspec matching internals, collaboration on the Git mailing list, and ultimately, a satisfying fix.

This post walks through the journey of diagnosing and patching a bug where `git add` would disregard a wildcard if a filename matched the wildcard pattern literally.

The Unexpected Behavior

The issue was first reported by Piotr Siupa on the Git mailing list. He provided a crystal-clear minimal test case:

git init
touch 'foo' 'f*'  # Create two files: 'foo' and one literally named 'f*'
git add 'f*'      # Attempt to add files matching the wildcard

Intuitively, one would expect `git add 'f*'` to stage both `foo` and the literal `f*` file, as both match the wildcard pattern. However, the observed behavior was different: only the literal `f*` file was staged on the first run. Running the exact same `git add 'f*'` command a second time would then correctly stage `foo`.

This inconsistency pointed towards a subtle bug within Git's handling of pathspecs during the `add` operation, rather than shell expansion issues (since the wildcard was quoted).

Initial Investigation: The `MATCHED_EXACTLY` Hypothesis

My initial dive into the `git add` codebase (primarily `builtin/add.c`) suggested the problem might lie in an optimization within the pathspec matching logic. The process roughly goes like this:

  1. `cmd_add` gathers untracked files (`foo` and `f*`).
  2. `prune_directory` filters these files using the provided pathspec (`'f*'`).
  3. This involves calling `do_match_pathspec`, which iterates through files and pathspec patterns.

Inside `do_match_pathspec`, a `seen` array tracks which pathspecs have already found a match. My hypothesis centered on this check:

// Inside the loop within do_match_pathspec
if (seen && seen[i] == MATCHED_EXACTLY)
    continue; // Skip checking this pathspec item further

The theory was:

  • When processing the literal file `f*`, it matches the pathspec `'f*'` exactly.
  • The `seen` array marks the `'f*'` pathspec as `MATCHED_EXACTLY`.
  • When processing the `foo` file against the same `'f*'` pathspec, the code sees the `MATCHED_EXACTLY` flag and `continue`s, skipping the wildcard match for `foo`.

This explained why only `f*` was added initially. On the second run, `f*` was already tracked, so `foo` was processed normally against the wildcard.

Refining the Diagnosis: Pathspec Magic vs. Wildcards

While the `MATCHED_EXACTLY` optimization seemed involved, further debugging revealed it wasn't the root cause. Digging into `init_pathspec_item` (in `pathspec.c`) showed that Git distinguishes between pathspecs that *contain* wildcard characters and pathspecs that have explicit "magic" flags set (like `PATHSPEC_GLOB`).

Crucially, just providing `'f*'` doesn't automatically set `PATHSPEC_GLOB` unless you use prefixes like `:(glob)f*` or environment variables. The optimization check I initially focused on wasn't the primary issue because the pathspec wasn't necessarily being treated with full "glob magic" at that level in the first place, despite containing a `*`.

The Breakthrough: Guidance from Jeff King (`nowildcard_len`)

Thankfully, Jeff King (Peff) chimed in on the mailing list thread, confirming the initial track was close but needed refinement. He pointed out that Git *does* track the presence of wildcards internally, just not via the `magic` flags in this context.

The key is the `nowildcard_len` field in the `pathspec_item` struct. This field stores the length of the initial part of the pathspec string *before* any wildcard character (`*`, `?`, `[`). If `nowildcard_len` is equal to the total length (`len`) of the pathspec string, it means the pathspec contains no wildcards and is a literal.

Peff suggested modifying the optimization check to incorporate this knowledge:

This refined check correctly distinguishes:

  • A true literal pathspec (like `'foo'`) that gets an exact match. Here, `nowildcard_len == len`, so the optimization can safely `continue`.
  • A pathspec with wildcards (like `'f*'`) that happens to exactly match a literal filename (`f*`). Here, `nowildcard_len < len`, so the `continue` is *not* executed, allowing the same pathspec item to be evaluated against other files (like `foo`) for wildcard matches.

The Fix and Patch Submission

Based on this guidance, the fix became clear. The final patch incorporated this check into `dir.c`:

--- a/dir.c
+++ b/dir.c
@@ -519,7 +519,8 @@ static int do_match_pathspec(struct index_state *istate,
 		    ( exclude && !(ps->items[i].magic & PATHSPEC_EXCLUDE)))
 			continue;

-		if (seen && seen[i] == MATCHED_EXACTLY)
+		// Only skip if it was an exact match AND the pathspec item
+		// is a literal (has no wildcards).
+		if (seen && seen[i] == MATCHED_EXACTLY &&
+			ps->items[i].nowildcard_len == ps->items[i].len)
 			continue;
 		/*
 		 * Make exclude patterns optional and never report

This ensures that wildcard pathspecs are always fully evaluated against all candidate files, even if one of those files happens to share the exact name as the pattern itself.

The patch, along with explanations and credits, was formatted using `git format-patch` and submitted to the mailing list for review. You can view the final patch here: Patch 1/1 add: fix handling literal filenames and wildcards.

Key Takeaways

  • Pathspec matching in Git has subtle layers involving literal matching, wildcard expansion, and explicit "magic" flags.
  • Seemingly simple optimizations can have unintended consequences in edge cases.
  • The Git mailing list provides an invaluable platform for reporting bugs, discussing internals, and receiving guidance from experienced developers like Jeff King.
  • Understanding internal data structures (like `struct pathspec_item` and its `nowildcard_len` field) is often key to diagnosing complex behavior.

This bug hunt was a great learning experience, reinforcing the importance of clear test cases, persistent debugging, and leveraging the collective knowledge of the Git community.

Pro Tip

If you need to reliably match files using wildcards, especially in scripts or complex scenarios, consider using the explicit `:(glob)` prefix (e.g., `git add ':(glob)f*'`). If you need to add a file whose name literally contains a wildcard character, escape it: `git add 'f\*'`.