has_dir_name(): do not get confused by characters < '/'

There is a bug in directory/file ("D/F") conflict checking optimization:
It assumes that such a conflict cannot happen if a newly added entry's
path is lexicgraphically "greater than" the last already-existing index
entry _and_ contains a directory separator that comes strictly after the
common prefix (`len > len_eq_offset`).

This assumption is incorrect, though: `a-` sorts _between_ `a` and
`a/b`, their common prefix is `a`, the slash comes after the common
prefix, and there is still a file/directory conflict.

Let's re-design this logic, taking these facts into consideration:

- It is impossible for a file to sort after another file with whose
  directory it conflicts because the trailing NUL byte is always smaller
  than any other character.

- Since there are quite a number of ASCII characters that sort before
  the slash (e.g. `-`, `.`, the space character), looking at the last
  already-existing index entry is not enough to determine whether there
  is a D/F conflict when the first character different from the
  existing last index entry's path is a slash.

  If it is not a slash, there cannot be a file/directory conflict.

  And if the existing index entry's first different character is a
  slash, it also cannot be a file/directory conflict because the
  optimization requires the newly-added entry's path to sort _after_ the
  existing entry's, and the conflicting file's path would not.

So let's fall back to the regular binary search whenever the newly-added
item's path differs in a slash character. If it does not, and it sorts
after the last index entry, there is no D/F conflict and the new index
entry can be safely appended.

This fix also nicely simplifies the logic and makes it much easier to
reason about, while the impact on performance should be negligible:
After this fix, the optimization will be skipped only when index
entry's paths differ in a slash and a space, `!`,  `"`,  `#`,  `$`,
`%`, `&`,  `'`,  | (  `)`,  `*`,  `+`,  `,`,  `-`, or  `.`, which should
be a rare situation.

Signed-off-by: Filip Hejsek <filip.hejsek@gmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
This commit is contained in:
Filip Hejsek 2024-01-28 04:30:25 +01:00 committed by Johannes Schindelin
parent e69ac42fcc
commit c30a574a0b
2 changed files with 47 additions and 53 deletions

View file

@ -1186,19 +1186,32 @@ static int has_dir_name(struct index_state *istate,
istate->cache[istate->cache_nr - 1]->name,
&len_eq_last);
if (cmp_last > 0) {
if (len_eq_last == 0) {
if (name[len_eq_last] != '/') {
/*
* The entry sorts AFTER the last one in the
* index and their paths have no common prefix,
* so there cannot be a F/D conflict.
* index.
*
* If there were a conflict with "file", then our
* name would start with "file/" and the last index
* entry would start with "file" but not "file/".
*
* The next character after common prefix is
* not '/', so there can be no conflict.
*/
return retval;
} else {
/*
* The entry sorts AFTER the last one in the
* index, but has a common prefix. Fall through
* to the loop below to disect the entry's path
* and see where the difference is.
* index, and the next character after common
* prefix is '/'.
*
* Either the last index entry is a file in
* conflict with this entry, or it has a name
* which sorts between this entry and the
* potential conflicting file.
*
* In both cases, we fall through to the loop
* below and let the regular search code handle it.
*/
}
} else if (cmp_last == 0) {
@ -1222,53 +1235,6 @@ static int has_dir_name(struct index_state *istate,
}
len = slash - name;
if (cmp_last > 0) {
/*
* (len + 1) is a directory boundary (including
* the trailing slash). And since the loop is
* decrementing "slash", the first iteration is
* the longest directory prefix; subsequent
* iterations consider parent directories.
*/
if (len + 1 <= len_eq_last) {
/*
* The directory prefix (including the trailing
* slash) also appears as a prefix in the last
* entry, so the remainder cannot collide (because
* strcmp said the whole path was greater).
*
* EQ: last: xxx/A
* this: xxx/B
*
* LT: last: xxx/file_A
* this: xxx/file_B
*/
return retval;
}
if (len > len_eq_last) {
/*
* This part of the directory prefix (excluding
* the trailing slash) is longer than the known
* equal portions, so this sub-directory cannot
* collide with a file.
*
* GT: last: xxxA
* this: xxxB/file
*/
return retval;
}
/*
* This is a possible collision. Fall through and
* let the regular search code handle it.
*
* last: xxx
* this: xxx/file
*/
}
pos = index_name_stage_pos(istate, name, len, stage, EXPAND_SPARSE);
if (pos >= 0) {
/*

View file

@ -1200,6 +1200,34 @@ test_expect_success 'very long name in the index handled sanely' '
test $len = 4098
'
# D/F conflict checking uses an optimization when adding to the end.
# make sure it does not get confused by `a-` sorting _between_
# `a` and `a/`.
test_expect_success 'more update-index D/F conflicts' '
# empty the index to make sure our entry is last
git read-tree --empty &&
cacheinfo=100644,$(test_oid empty_blob) &&
git update-index --add --cacheinfo $cacheinfo,path5/a &&
test_must_fail git update-index --add --cacheinfo $cacheinfo,path5/a/file &&
test_must_fail git update-index --add --cacheinfo $cacheinfo,path5/a/b/file &&
test_must_fail git update-index --add --cacheinfo $cacheinfo,path5/a/b/c/file &&
# "a-" sorts between "a" and "a/"
git update-index --add --cacheinfo $cacheinfo,path5/a- &&
test_must_fail git update-index --add --cacheinfo $cacheinfo,path5/a/file &&
test_must_fail git update-index --add --cacheinfo $cacheinfo,path5/a/b/file &&
test_must_fail git update-index --add --cacheinfo $cacheinfo,path5/a/b/c/file &&
cat >expected <<-\EOF &&
path5/a
path5/a-
EOF
git ls-files >actual &&
test_cmp expected actual
'
test_expect_success 'test_must_fail on a failing git command' '
test_must_fail git notacommand
'