A future feature will want to load the sparse-checkout patterns into a
pattern_list, but the current mechanism to do so is a bit complicated.
This is made difficult due to needing to find the sparse-checkout file
in different ways throughout the codebase.
The logic implemented in the new get_sparse_checkout_patterns() was
duplicated in populate_from_existing_patterns() in unpack-trees.c. Use
the new method instead, keeping the logic around handling the struct
unpack_trees_options.
The callers to get_sparse_checkout_filename() in
builtin/sparse-checkout.c manipulate the sparse-checkout file directly,
so it is not appropriate to replace logic in that file with
get_sparse_checkout_patterns().
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A specialization of hashmap that uses a string as key has been
introduced. Hopefully it will see wider use over time.
* en/strmap:
shortlog: use strset from strmap.h
Use new HASHMAP_INIT macro to simplify hashmap initialization
strmap: take advantage of FLEXPTR_ALLOC_STR when relevant
strmap: enable allocations to come from a mem_pool
strmap: add a strset sub-type
strmap: split create_entry() out of strmap_put()
strmap: add functions facilitating use as a string->int map
strmap: enable faster clearing and reusing of strmaps
strmap: add more utility functions
strmap: new utility functions
hashmap: provide deallocation function names
hashmap: introduce a new hashmap_partial_clear()
hashmap: allow re-use after hashmap_free()
hashmap: adjust spacing to fix argument alignment
hashmap: add usage documentation explaining hashmap_free[_entries]()
hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed
for a while, but aren't necessarily the clearest names, especially with
hashmap_partial_clear() being added to the mix and lazy-initialization
now being supported. Peff suggested we adopt the following names[1]:
- hashmap_clear() - remove all entries and de-allocate any
hashmap-specific data, but be ready for reuse
- hashmap_clear_and_free() - ditto, but free the entries themselves
- hashmap_partial_clear() - remove all entries but don't deallocate
table
- hashmap_partial_clear_and_free() - ditto, but free the entries
This patch provides the new names and converts all existing callers over
to the new naming scheme.
[1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Alex Vandiver <alexmv@dropbox.com>
Signed-off-by: Nipunn Koorapati <nipunn@dropbox.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We don't use the untracked_cache_dir parameter that is passed in, but
instead look at the untracked_cache_dir inside the cached_dir struct we
are passed. It's been this way since the introduction of
treat_path_fast() in 91a2288b5f (untracked cache: record/validate dir
mtime and reuse cached output, 2015-03-08).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up.
* jk/leakfix:
submodule--helper: fix leak of core.worktree value
config: fix leak in git_config_get_expiry_in_days()
config: drop git_config_get_string_const()
config: fix leaks from git_config_get_string_const()
checkout: fix leak of non-existent branch names
submodule--helper: use strbuf_release() to free strbufs
clear_pattern_list(): clear embedded hashmaps
"ls-files -o" mishandled the top-level directory of another git
working tree that hangs in the current git working tree.
* en/dir-nonbare-embedded:
dir: avoid prematurely marking nonbare repositories as matches
t3000: fix some test description typos
The dir structure seemed to have a number of leaks and problems around
it. First I noticed that parent_hashmap and recursive_hashmap were
being leaked (though Peff noticed and submitted fixes before me). Then
I noticed in the previous commit that clear_directory() was only taking
responsibility for a subset of fields within dir_struct, despite the
fact that entries[] and ignored[] we allocated internally to dir.c.
That, of course, resulted in many callers either leaking or haphazardly
trying to free these arrays and their contents.
Digging further, I found that despite the pretty clear documentation
near the top of dir.h that folks were supposed to call clear_directory()
when the user no longer needed the dir_struct, there were four callers
that didn't bother doing that at all. However, two of them clearly
thought about leaks since they had an UNLEAK(dir) directive, which to me
suggests that the method to free the data was too unclear. I suspect
the non-obviousness of the API and its holes led folks to avoid it,
which then snowballed into further problems with the entries[],
ignored[], parent_hashmap, and recursive_hashmap problems.
Rename clear_directory() to dir_clear() to be more in line with other
data structures in git, and introduce a dir_init() to handle the
suggested memsetting of dir_struct to all zeroes. I hope that a name
like "dir_clear()" is more clear, and that the presence of dir_init()
will provide a hint to those looking at the code that they need to look
for either a dir_clear() or a dir_free() and lead them to find
dir_clear().
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The calling convention for the dir API is supposed to end with a call to
clear_directory() to free up no longer needed memory. However,
clear_directory() didn't free dir->entries or dir->ignored. I believe
this was an oversight, but a number of callers noticed memory leaks and
started free'ing these. Unfortunately, they did so somewhat haphazardly
(sometimes freeing the entries in the arrays, and sometimes only
free'ing the arrays themselves). This suggests the callers weren't
trying to make sure any possible memory used might be free'd, but just
the memory they noticed their usecase definitely had allocated.
Fix this mess by moving all the duplicated free'ing logic into
clear_directory(). End by resetting dir to a pristine state so it could
be reused if desired.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 96cc8ab531 (sparse-checkout: use hashmaps for cone patterns,
2019-11-21) added some auxiliary hashmaps to the pattern_list struct,
but they're leaked when clear_pattern_list() is called.
Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Nonbare repositories are special directories. Unlike normal directories
that we might recurse into to list the files they contain, nonbare
repositories must themselves match and then we always report only on the
nonbare repository directory itself and not on any of its contents.
Separately, when traversing directories to try to find untracked or
excluded files, we often think in terms of paths either matching the
specified pathspec, or not matching them. However, there is a special
value that do_match_pathspec() uses named
MATCHED_RECURSIVELY_LEADING_PATHSPEC which means "this directory does
not match any pathspec BUT it is possible a file or directory underneath
it does." That special value prevents us from prematurely thinking that
some directory and everything under it is irrelevant, but also allows us
to differentiate from "this is a match".
The combination of these two special cases was previously uncovered.
Add a test to the testsuite to cover it, and make sure that we return a
nonbare repository as a non-match if the best match it got was
MATCHED_RECURSIVELY_LEADING_PATHSPEC.
Reported-by: christian w <usebees@gmail.com>
Simplified-testcase-and-bisection-by: Kyle Meyer <kyle@kyleam.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 95c11ecc73 ("Fix error-prone fill_directory() API; make it only
return matches", 2020-04-01), we taught `fill_directory()`, or more
specifically `treat_path()`, to check against any pathspecs so that we
could simplify the callers.
But in doing so, we added a slightly-too-early return for the "excluded"
case. We end up not checking the pathspecs, meaning we return
`path_excluded` when maybe we should return `path_none`. As a result,
`git status --ignored -- pathspec` might show paths that don't actually
match "pathspec".
Move the "excluded" check down to after we've checked any pathspecs.
Reported-by: Andreas Schwab <schwab@linux-m68k.org>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up of "git clean" resulted in a fix of recent
performance regression.
* en/clean-cleanups:
clean: optimize and document cases where we recurse into subdirectories
clean: consolidate handling of ignored parameters
dir, clean: avoid disallowed behavior
dir: fix a few confusing comments
Use of negative pathspec, while collecting paths including
untracked ones in the working tree, was broken.
* en/do-match-pathspec-fix:
dir: fix treatment of negated pathspecs
dir.h documented quite clearly that DIR_SHOW_IGNORED and
DIR_SHOW_IGNORED_TOO are mutually exclusive, with a big comment to this
effect by the definition of both enum values. However, a command like
git clean -fx $DIR
would set both values for dir.flags. I _think_ it happened to work
because:
* As dir.h points out, DIR_KEEP_UNTRACKED_CONTENTS only takes effect
if DIR_SHOW_IGNORED_TOO is set.
* As coded, I believe DIR_SHOW_IGNORED would just happen to take
precedence over DIR_SHOW_IGNORED_TOO in the code as currently
constructed.
Which is a long way of saying "we just got lucky".
Fix clean.c to avoid setting these mutually exclusive values at the same
time, and add a check to dir.c that will throw a BUG() to prevent anyone
else from making this mistake.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
do_match_pathspec() started life as match_pathspec_depth_1() and for
correctness was only supposed to be called from match_pathspec_depth().
match_pathspec_depth() was later renamed to match_pathspec(), so the
invariant we expect today is that do_match_pathspec() has no direct
callers outside of match_pathspec().
Unfortunately, this intention was lost with the renames of the two
functions, and additional calls to do_match_pathspec() were added in
commits 75a6315f74 ("ls-files: add pathspec matching for submodules",
2016-10-07) and 89a1f4aaf7 ("dir: if our pathspec might match files
under a dir, recurse into it", 2019-09-17). Of course,
do_match_pathspec() had an important advantge over match_pathspec() --
match_pathspec() would hardcode flags to one of two values, and these
new callers needed to pass some other value for flags. Also, although
calling do_match_pathspec() directly was incorrect, there likely wasn't
any difference in the observable end output, because the bug just meant
that fill_diretory() would recurse into unneeded directories. Since
subsequent does-this-path-match checks on individual paths under the
directory would cause those extra paths to be filtered out, the only
difference from using the wrong function was unnecessary computation.
The second of those bad calls to do_match_pathspec() was involved -- via
either direct movement or via copying+editing -- into a number of later
refactors. See commits 777b420347 ("dir: synchronize
treat_leading_path() and read_directory_recursive()", 2019-12-19),
8d92fb2927 ("dir: replace exponential algorithm with a linear one",
2020-04-01), and 95c11ecc73 ("Fix error-prone fill_directory() API; make
it only return matches", 2020-04-01). The last of those introduced the
usage of do_match_pathspec() on an individual file, and thus resulted in
individual paths being returned that shouldn't be.
The problem with calling do_match_pathspec() instead of match_pathspec()
is that any negated patterns such as ':!unwanted_path` will be ignored.
Add a new match_pathspec_with_flags() function to fulfill the needs of
specifying special flags while still correctly checking negated
patterns, add a big comment above do_match_pathspec() to prevent others
from misusing it, and correct current callers of do_match_pathspec() to
instead use either match_pathspec() or match_pathspec_with_flags().
One final note is that DO_MATCH_LEADING_PATHSPEC needs special
consideration when working with DO_MATCH_EXCLUDE. The point of
DO_MATCH_LEADING_PATHSPEC is that if we have a pathspec like
*/Makefile
and we are checking a directory path like
src/module/component
that we want to consider it a match so that we recurse into the
directory because it _might_ have a file named Makefile somewhere below.
However, when we are using an exclusion pattern, i.e. we have a pathspec
like
:(exclude)*/Makefile
we do NOT want to say that a directory path like
src/module/component
is a (negative) match. While there *might* be a file named 'Makefile'
somewhere below that directory, there could also be other files and we
cannot pre-emptively rule all the files under that directory out; we
need to recurse and then check individual files. Adjust the
DO_MATCH_LEADING_PATHSPEC logic to only get activated for positive
pathspecs.
Reported-by: John Millikin <jmillikin@stripe.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The directory traversal code had redundant recursive calls which
made its performance characteristics exponential with respect to
the depth of the tree, which was corrected.
* en/fill-directory-exponential:
completion: fix 'git add' on paths under an untracked directory
Fix error-prone fill_directory() API; make it only return matches
dir: replace double pathspec matching with single in treat_directory()
dir: include DIR_KEEP_UNTRACKED_CONTENTS handling in treat_directory()
dir: replace exponential algorithm with a linear one
dir: refactor treat_directory to clarify control flow
dir: fix confusion based on variable tense
dir: fix broken comment
dir: consolidate treat_path() and treat_one_path()
dir: fix simple typo in comment
t3000: add more testcases testing a variety of ls-files issues
t7063: more thorough status checking
Traditionally, the expected calling convention for the dir.c API was:
fill_directory(&dir, ..., pathspec)
foreach entry in dir->entries:
if (dir_path_match(entry, pathspec))
process_or_display(entry)
This may have made sense once upon a time, because the fill_directory() call
could use cheap checks to avoid doing full pathspec matching, and an external
caller may have wanted to do other post-processing of the results anyway.
However:
* this structure makes it easy for users of the API to get it wrong
* this structure actually makes it harder to understand
fill_directory() and the functions it uses internally. It has
tripped me up several times while trying to fix bugs and
restructure things.
* relying on post-filtering was already found to produce wrong
results; pathspec matching had to be added internally for multiple
cases in order to get the right results (see commits 404ebceda0
(dir: also check directories for matching pathspecs, 2019-09-17)
and 89a1f4aaf7 (dir: if our pathspec might match files under a
dir, recurse into it, 2019-09-17))
* it's bad for performance: fill_directory() already has to do lots
of checks and knows the subset of cases where it still needs to do
more checks. Forcing external callers to do full pathspec
matching means they must re-check _every_ path.
So, add the pathspec matching within the fill_directory() internals, and
remove it from external callers.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
treat_directory() had a call to both do_match_pathspec() and
match_pathspec(). These calls have migrated through the code somewhat
since their introduction, but we don't actually need both. Replace the
two calls with one, and while at it, move the check earlier in order to
reduce the need for callers of fill_directory() to do post-filtering of
results.
The next patch will address post-filtering more forcefully and provide
more relevant history and context.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Handling DIR_KEEP_UNTRACKED_CONTENTS within treat_directory() instead of
as a post-processing step in read_directory():
* allows us to directly access and remove the relevant entries instead
of needing to calculate which ones need to be removed
* keeps the logic for directory handling in one location (and puts it
closer the the logic for stripping out extra ignored entries, which
seems logical).
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
dir's read_directory_recursive() naturally operates recursively in order
to walk the directory tree. Treating of directories is sometimes weird
because there are so many different permutations about how to handle
directories. Some examples:
* 'git ls-files -o --directory' only needs to know that a directory
itself is untracked; it doesn't need to recurse into it to see what
is underneath.
* 'git status' needs to recurse into an untracked directory, but only
to determine whether or not it is empty. If there are no files
underneath, the directory itself will be omitted from the output.
If it is not empty, only the directory will be listed.
* 'git status --ignored' needs to recurse into untracked directories
and report all the ignored entries and then report the directory as
untracked -- UNLESS all the entries under the directory are
ignored, in which case we don't print any of the entries under the
directory and just report the directory itself as ignored. (Note
that although this forces us to walk all untracked files underneath
the directory as well, we strip them from the output, except for
users like 'git clean' who also set DIR_KEEP_TRACKED_CONTENTS.)
* For 'git clean', we may need to recurse into a directory that
doesn't match any specified pathspecs, if it's possible that there
is an entry underneath the directory that can match one of the
pathspecs. In such a case, we need to be careful to omit the
directory itself from the list of paths (see commit 404ebceda0
("dir: also check directories for matching pathspecs", 2019-09-17))
Part of the tension noted above is that the treatment of a directory can
change based on the files within it, and based on the various settings
in dir->flags. Trying to keep this in mind while reading over the code,
it is easy to think in terms of "treat_directory() tells us what to do
with a directory, and read_directory_recursive() is the thing that
recurses". Since we need to look into a directory to know how to treat
it, though, it is quite easy to decide to (also) recurse into the
directory from treat_directory() by adding a read_directory_recursive()
call. Adding such a call is actually fine, IF we make sure that
read_directory_recursive() does not also recurse into that same
directory.
Unfortunately, commit df5bcdf83a ("dir: recurse into untracked dirs
for ignored files", 2017-05-18), added exactly such a case to the code,
meaning we'd have two calls to read_directory_recursive() for an
untracked directory. So, if we had a file named
one/two/three/four/five/somefile.txt
and nothing in one/ was tracked, then 'git status --ignored' would
call read_directory_recursive() twice on the directory 'one/', and
each of those would call read_directory_recursive() twice on the
directory 'one/two/', and so on until read_directory_recursive() was
called 2^5 times for 'one/two/three/four/five/'.
Avoid calling read_directory_recursive() twice per level by moving a
lot of the special logic into treat_directory().
Since dir.c is somewhat complex, extra cruft built up around this over
time. While trying to unravel it, I noticed several instances where the
first call to read_directory_recursive() would return e.g.
path_untracked for some directory and a later one would return e.g.
path_none, despite the fact that the directory clearly should have been
considered untracked. The code happened to work due to the side-effect
from the first invocation of adding untracked entries to dir->entries;
this allowed it to get the correct output despite the supposed override
in return value by the later call.
I am somewhat concerned that there are still bugs and maybe even
testcases with the wrong expectation. I have tried to carefully
document treat_directory() since it becomes more complex after this
change (though much of this complexity came from elsewhere that probably
deserved better comments to begin with). However, much of my work felt
more like a game of whackamole while attempting to make the code match
the existing regression tests than an attempt to create an
implementation that matched some clear design. That seems wrong to me,
but the rules of existing behavior had so many special cases that I had
a hard time coming up with some overarching rules about what correct
behavior is for all cases, forcing me to hope that the regression tests
are correct and sufficient. Such a hope seems likely to be ill-founded,
given my experience with dir.c-related testcases in the last few months:
Examples where the documentation was hard to parse or even just wrong:
* 3aca58045f (git-clean.txt: do not claim we will delete files with
-n/--dry-run, 2019-09-17)
* 09487f2cba (clean: avoid removing untracked files in a nested git
repository, 2019-09-17)
* e86bbcf987 (clean: disambiguate the definition of -d, 2019-09-17)
Examples where testcases were declared wrong and changed:
* 09487f2cba (clean: avoid removing untracked files in a nested git
repository, 2019-09-17)
* e86bbcf987 (clean: disambiguate the definition of -d, 2019-09-17)
* a2b13367fe (Revert "dir.c: make 'git-status --ignored' work within
leading directories", 2019-12-10)
Examples where testcases were clearly inadequate:
* 502c386ff9 (t7300-clean: demonstrate deleting nested repo with an
ignored file breakage, 2019-08-25)
* 7541cc5302 (t7300: add testcases showing failure to clean specified
pathspecs, 2019-09-17)
* a5e916c745 (dir: fix off-by-one error in match_pathspec_item,
2019-09-17)
* 404ebceda0 (dir: also check directories for matching pathspecs,
2019-09-17)
* 09487f2cba (clean: avoid removing untracked files in a nested git
repository, 2019-09-17)
* e86bbcf987 (clean: disambiguate the definition of -d, 2019-09-17)
* 452efd11fb (t3011: demonstrate directory traversal failures,
2019-12-10)
* b9670c1f5e (dir: fix checks on common prefix directory, 2019-12-19)
Examples where "correct behavior" was unclear to everyone:
https://lore.kernel.org/git/20190905154735.29784-1-newren@gmail.com/
Other commits of note:
* 902b90cf42 (clean: fix theoretical path corruption, 2019-09-17)
However, on the positive side, it does make the code much faster. For
the following simple shell loop in an empty repository:
for depth in $(seq 10 25)
do
dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
rm -rf dir
mkdir -p $dirs
>$dirs/untracked-file
/usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
done
I saw the following timings, in seconds (note that the numbers are a
little noisy from run-to-run, but the trend is very clear with every
run):
10: 0.03
11: 0.05
12: 0.08
13: 0.19
14: 0.29
15: 0.50
16: 1.05
17: 2.11
18: 4.11
19: 8.60
20: 17.55
21: 33.87
22: 68.71
23: 140.05
24: 274.45
25: 551.15
For the above run, using strace I can look for the number of untracked
directories opened and can verify that it matches the expected
2^($depth+1)-2 (the sum of 2^1 + 2^2 + 2^3 + ... + 2^$depth).
After this fix, with strace I can verify that the number of untracked
directories that are opened drops to just $depth, and the timings all
drop to 0.00. In fact, it isn't until a depth of 190 nested directories
that it sometimes starts reporting a time of 0.01 seconds and doesn't
consistently report 0.01 seconds until there are 240 nested directories.
The previous code would have taken
17.55 * 2^220 / (60*60*24*365) = 9.4 * 10^59 YEARS
to have completed the 240 nested directories case. It's not often
that you get to speed something up by a factor of 3*10^69.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The logic in treat_directory() is handled by a multi-case
switch statement, but this switch is very asymmetrical, as
the first two cases are simple but the third is more
complicated than the rest of the method. In fact, the third
case includes a "break" statement that leads to the block
of code outside the switch statement. That is the only way
to reach that block, as the switch handles all possible
values from directory_exists_in_index();
Extract the switch statement into a series of "if" statements.
This simplifies the trivial cases, while clarifying how to
reach the "show_other_directories" case. This is particularly
important as the "show_other_directories" case will expand
in a later change.
Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Despite having contributed several fixes in this area, I have for months
(years?) assumed that the "exclude" variable was a directive; this
caused me to think of it as a different mode we operate in and left me
confused as I tried to build up a mental model around why we'd need such
a directive. I mostly tried to ignore it while focusing on the pieces I
was trying to understand.
Then I finally traced this variable all back to a call to is_excluded(),
meaning it was actually functioning as an adjective. In particular, it
was a checked property ("Does this path match a rule in .gitignore?"),
rather than a mode passed in from the caller. Change the variable name
to match the part of speech used by the function called to define it,
which will hopefully make these bits of code slightly clearer to the
next reader.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 16e2cfa909 ("read_directory(): further split treat_path()",
2010-01-08) split treat_one_path() out of treat_path(), because
treat_leading_path() would not have access to a dirent but wanted to
re-use as much of treat_path() as possible. Not re-using all of
treat_path() caused other bugs, as noted in commit b9670c1f5e ("dir:
fix checks on common prefix directory", 2019-12-19). Finally, in commit
ad6f2157f9 ("dir: restructure in a way to avoid passing around a
struct dirent", 2020-01-16), dirents were removed from treat_path() and
other functions entirely.
Since the only reason for splitting these functions was the lack of a
dirent -- which no longer applies to either function -- and since the
split caused problems in the past resulting in us not using
treat_one_path() separately anymore, just undo the split.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git sparse-checkout" learned a new "add" subcommand.
* ds/sparse-add:
sparse-checkout: allow one-character directories in cone mode
sparse-checkout: work with Windows paths
sparse-checkout: create 'add' subcommand
sparse-checkout: extract pattern update from 'set' subcommand
sparse-checkout: extract add_patterns_from_input()
In 9e6d3e64 (sparse-checkout: detect short patterns, 2020-01-24), a
condition on the minimum length of a cone-mode pattern was introduced.
However, this condition was off-by-one.
If we have a directory with a single character, say "b", then the
command
git sparse-checkout set b
will correctly add the pattern "/b/" to the sparse-checkout file. When
this is interpeted in dir.c, the pattern is "/b" with the
PATTERN_FLAG_MUSTBEDIR flag. This string has length two, which satisfies
our inclusive inequality (<= 2).
The reason for this inequality is that we will start to read the pattern
string character-by-character using three char pointers: prev, cur,
next. In particular, next is set to the current pattern plus two. The
mistake was that next will still be a valid pointer when the pattern
length is two, since the string is null-terminated.
Make this inequality strict so these patterns work.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some codepaths were given a repository instance as a parameter to
work in the repository, but passed the_repository instance to its
callees, which has been cleaned up (somewhat).
* mt/use-passed-repo-more-in-funcs:
sha1-file: allow check_object_signature() to handle any repo
sha1-file: pass git_hash_algo to hash_object_file()
sha1-file: pass git_hash_algo to write_object_file_prepare()
streaming: allow open_istream() to handle any repo
pack-check: use given repo's hash_algo at verify_packfile()
cache-tree: use given repo's hash_algo at verify_one()
diff: make diff_populate_filespec() honor its repo argument
Some rough edges in the sparse-checkout feature, especially around
the cone mode, have been cleaned up.
* ds/sparse-checkout-harden:
sparse-checkout: fix cone mode behavior mismatch
sparse-checkout: improve docs around 'set' in cone mode
sparse-checkout: escape all glob characters on write
sparse-checkout: use C-style quotes in 'list' subcommand
sparse-checkout: unquote C-style strings over --stdin
sparse-checkout: write escaped patterns in cone mode
sparse-checkout: properly match escaped characters
sparse-checkout: warn on globs in cone patterns
sparse-checkout: detect short patterns
sparse-checkout: cone mode does not recognize "**"
sparse-checkout: fix documentation typo for core.sparseCheckoutCone
clone: fix --sparse option with URLs
sparse-checkout: create leading directories
t1091: improve here-docs
t1091: use check_files to reduce boilerplate
In cone mode, the sparse-checkout feature uses hashset containment
queries to match paths. Make this algorithm respect escaped asterisk
(*) and backslash (\) characters.
Create dup_and_filter_pattern() method to convert a pattern by
removing escape characters and dropping an optional "/*" at the end.
This method is available in dir.h as we will use it in
builtin/sparse-checkout.c in a later change.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In cone mode, the sparse-checkout commmand will write patterns that
allow faster pattern matching. This matching only works if the patterns
in the sparse-checkout file are those written by that command. Users
can edit the sparse-checkout file and create patterns that cause the
cone mode matching to fail.
The cone mode patterns may end in "/*" but otherwise an un-escaped
asterisk or other glob character is invalid. Add checks to disable
cone mode when seeing these values.
A later change will properly handle escaped globs.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow hash_object_file() to work on arbitrary repos by introducing a
git_hash_algo parameter. Change callers which have a struct repository
pointer in their scope to pass on the git_hash_algo from the said repo.
For all other callers, pass on the_hash_algo, which was already being
used internally at hash_object_file(). This functionality will be used
in the following patch to make check_object_signature() be able to work
on arbitrary repos (which, in turn, will be used to fix an
inconsistency at object.c:parse_object()).
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In cone mode, the shortest pattern the sparse-checkout command will
write into the sparse-checkout file is "/*". This is handled carefully
in add_pattern_to_hashsets(), so warn if any other pattern is this
short. This will assist future pattern checks by allowing us to assume
there are at least three characters in the pattern.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When core.sparseCheckoutCone is enabled, the 'git sparse-checkout set'
command creates a restricted set of possible patterns that are used
by a custom algorithm to quickly match those patterns.
If a user manually edits the sparse-checkout file, then they could
create patterns that do not match these expectations. The cone-mode
matching algorithm can return incorrect results. The solution is to
detect these incorrect patterns, warn that we do not recognize them,
and revert to the standard algorithm.
Check each pattern for the "**" substring, and revert to the old
logic if seen. While technically a "/<dir>/**" pattern matches
the meaning of "/<dir>/", it is not one that would be written by
the sparse-checkout builtin in cone mode. Attempting to accept that
pattern change complicates the logic and instead we punt and do
not accept any instance of "**".
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 777b420347 (dir: synchronize treat_leading_path() and
read_directory_recursive(), 2019-12-19) tried to add two warning
comments in those functions, pointing at each other. But the one in
treat_leading_path() just points at itself.
Let's fix that. Since the comment also redirects the reader for more
details to "the commit that added this warning", and since we're now
modifying the warning (creating a new commit without those details),
let's mention the actual commit id.
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Restructure the code slightly to avoid passing around a struct dirent
anywhere, which also enables us to avoid trying to manufacture one.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
I was going to title this "dir: more synchronizing of
treat_leading_path() and read_directory_recursive()", a nod to commit
777b420347 ("dir: synchronize treat_leading_path() and
read_directory_recursive()", 2019-12-19), but the title was too long.
Anyway, first the backstory...
fill_directory() has always had a slightly error-prone interface: it
returns a subset of paths which *might* match the specified pathspec; it
was intended to prune away some paths which didn't match the specified
pathspec and keep at least all the ones that did match it. Given this
interface, callers were responsible to post-process the results and
check whether each actually matched the pathspec.
builtin/clean.c did this. It would first prune out duplicates (e.g. if
"dir" was returned as well as all files under "dir/", then it would
simplify this to just "dir"), and after pruning duplicates it would
compare the remaining paths to the specified pathspec(s). This
post-processing itself could run into problems, though, as noted in
commit 404ebceda0 ("dir: also check directories for matching
pathspecs", 2019-09-17):
For the case of git-clean and a set of pathspecs of "dir/file" and
"more", this caused a problem because we'd end up with dir entries
for both of
"dir"
"dir/file"
Then correct_untracked_entries() would try to helpfully prune
duplicates for us by removing "dir/file" since it's under "dir",
leaving us with
"dir"
Since the original pathspec only had "dir/file", the only entry left
doesn't match and leaves nothing to be removed. (Note that if only
one pathspec was specified, e.g. only "dir/file", then the
common_prefix_len optimizations in fill_directory would cause us to
bypass this problem, making it appear in simple tests that we could
correctly remove manually specified pathspecs.)
That commit fixed the issue -- when multiple pathspecs were specified --
by making sure fill_directory() wouldn't return both "dir" and
"dir/file" outside the common_prefix_len optimization path. This is
where it starts to get fun.
In commit b9670c1f5e ("dir: fix checks on common prefix directory",
2019-12-19), we noticed that the common_prefix_len wasn't doing
appropriate checks and letting all kinds of stuff through, resulting in
recursing into .git/ directories and other craziness. So it started
locking down and doing checks on pathnames within that code path. That
continued with commit 777b420347 ("dir: synchronize
treat_leading_path() and read_directory_recursive()", 2019-12-19), which
noted the following:
Our optimization to avoid calling into read_directory_recursive()
when all pathspecs have a common leading directory mean that we need
to match the logic that read_directory_recursive() would use if we
had just called it from the root. Since it does more than call
treat_path() we need to copy that same logic.
...and then it more forcefully addressed the issue with this wonderfully
ironic statement:
Needing to duplicate logic like this means it is guaranteed someone
will eventually need to make further changes and forget to update
both locations. It is tempting to just nuke the leading_directory
special casing to avoid such bugs and simplify the code, but
unpack_trees' verify_clean_subdirectory() also calls
read_directory() and does so with a non-empty leading path, so I'm
hesitant to try to restructure further. Add obnoxious warnings to
treat_leading_path() and read_directory_recursive() to try to warn
people of such problems.
You would think that with such a strongly worded description, that its
author would have actually ensured that the logic in
treat_leading_path() and read_directory_recursive() did actually match
and that *everything* that was needed had at least been copied over at
the time that this paragraph was written. But you'd be wrong, I messed
it up by missing part of the logic.
Copy the missing bits to fix the new final test in t7300.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Assorted fixes to the directory traversal API.
* en/fill-directory-fixes:
dir.c: use st_add3() for allocation size
dir: consolidate similar code in treat_directory()
dir: synchronize treat_leading_path() and read_directory_recursive()
dir: fix checks on common prefix directory
dir: break part of read_directory_recursive() out for reuse
dir: exit before wildcard fall-through if there is no wildcard
dir: remove stray quote character in comment
Revert "dir.c: make 'git-status --ignored' work within leading directories"
t3011: demonstrate directory traversal failures
Management of sparsely checked-out working tree has gained a
dedicated "sparse-checkout" command.
* ds/sparse-cone: (21 commits)
sparse-checkout: improve OS ls compatibility
sparse-checkout: respect core.ignoreCase in cone mode
sparse-checkout: check for dirty status
sparse-checkout: update working directory in-process for 'init'
sparse-checkout: cone mode should not interact with .gitignore
sparse-checkout: write using lockfile
sparse-checkout: use in-process update for disable subcommand
sparse-checkout: update working directory in-process
sparse-checkout: sanitize for nested folders
unpack-trees: add progress to clear_ce_flags()
unpack-trees: hash less in cone mode
sparse-checkout: init and set in cone mode
sparse-checkout: use hashmaps for cone patterns
sparse-checkout: add 'cone' mode
trace2: add region in clear_ce_flags
sparse-checkout: create 'disable' subcommand
sparse-checkout: add '--stdin' option to set subcommand
sparse-checkout: 'set' subcommand
clone: add --sparse mode
sparse-checkout: create 'init' subcommand
...
When preparing a manufactured dirent instance, we add a length of
path to the size of struct to decide how many bytes to allocate.
Make sure this addition does not wrap-around to cause us
underallocate.
Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both the DIR_SKIP_NESTED_GIT and DIR_NO_GITLINKS cases were checking for
whether a path was actually a nonbare repository. That code could be
shared, with just the result of how to act differing between the two
cases.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Our optimization to avoid calling into read_directory_recursive() when
all pathspecs have a common leading directory mean that we need to match
the logic that read_directory_recursive() would use if we had just
called it from the root. Since it does more than call treat_path() we
need to copy that same logic.
Alternatively, we could try to change treat_path to return path_recurse
for an untracked directory under the given special circumstances that
this logic checks for, but a simple switch results in many test failures
such as 'git clean -d' not wiping out untracked but empty directories.
To work around that, we'd need the caller of treat_path to check for
path_recurse and sometimes special case it into path_untracked. In
other words, we'd still have extra logic in both places.
Needing to duplicate logic like this means it is guaranteed someone will
eventually need to make further changes and forget to update both
locations. It is tempting to just nuke the leading_directory special
casing to avoid such bugs and simplify the code, but unpack_trees'
verify_clean_subdirectory() also calls read_directory() and does so with
a non-empty leading path, so I'm hesitant to try to restructure further.
Add obnoxious warnings to treat_leading_path() and
read_directory_recursive() to try to warn people of such problems.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Many years ago, the directory traversing logic had an optimization that
would always recurse into any directory that was a common prefix of all
the pathspecs without walking the leading directories to get down to
the desired directory. Thus,
git ls-files -o .git/ # case A
would notice that .git/ was a common prefix of all pathspecs (since
it is the only pathspec listed), and then traverse into it and start
showing unknown files under that directory. Unfortunately, .git/ is not
a directory we should be traversing into, which made this optimization
problematic. This also affected cases like
git ls-files -o --exclude-standard t/ # case B
where t/ was in the .gitignore file and thus isn't interesting and
shouldn't be recursed into. It also affected cases like
git ls-files -o --directory untracked_dir/ # case C
where untracked_dir/ is indeed untracked and thus interesting, but the
--directory flag means we only want to show the directory itself, not
recurse into it and start listing untracked files below it.
The case B class of bugs were noted and fixed in commits 16e2cfa909
("read_directory(): further split treat_path()", 2010-01-08) and
48ffef966c ("ls-files: fix overeager pathspec optimization",
2010-01-08), with the idea being that we first wanted to check whether
the common prefix was interesting. The former patch noted that
treat_path() couldn't be used when checking the common prefix because
treat_path() requires a dir_entry() and we haven't read any directories
at the point we are checking the common prefix. So, that patch split
treat_one_path() out of treat_path(). The latter patch then created a
new treat_leading_path() which duplicated by hand the bits of
treat_path() that couldn't be broken out and then called
treat_one_path() for the remainder. There were three problems with this
approach:
* The duplicated logic in treat_leading_path() accidentally missed the
check for special paths (such as is_dot_or_dotdot and matching
".git"), causing case A types of bugs to continue to be an issue.
* The treat_leading_path() logic assumed we should traverse into
anything where path_treatment was not path_none, i.e. it perpetuated
class C types of bugs.
* It meant we had split logic that needed to kept in sync, running the
risk that people introduced new inconsistencies (such as in commit
be8a84c526, which we reverted earlier in this series, or in commit
df5bcdf83a which we'll fix in a subsequent commit)
Fix most these problems by making treat_leading_path() not only loop
over each leading path component, but calling treat_path() directly on
each. To do so, we have to create a synthetic dir_entry, but that only
takes a few lines. Then, pay attention to the path_treatment result we
get from treat_path() and don't treat path_excluded, path_untracked, and
path_recurse all the same as path_recurse.
This leaves one remaining problem, the new inconsistency from commit
df5bcdf83a. That will be addressed in a subsequent commit.
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* hw/doc-in-header:
trace2: move doc to trace2.h
submodule-config: move doc to submodule-config.h
tree-walk: move doc to tree-walk.h
trace: move doc to trace.h
run-command: move doc to run-command.h
parse-options: add link to doc file in parse-options.h
credential: move doc to credential.h
argv-array: move doc to argv-array.h
cache: move doc to cache.h
sigchain: move doc to sigchain.h
pathspec: move doc to pathspec.h
revision: move doc to revision.h
attr: move doc to attr.h
refs: move doc to refs.h
remote: move doc to remote.h and refspec.h
sha1-array: move doc to sha1-array.h
merge: move doc to ll-merge.h
graph: move doc to graph.h and graph.c
dir: move doc to dir.h
diff: move doc to diff.h and diffcore.h
When a user uses the sparse-checkout feature in cone mode, they
add patterns using "git sparse-checkout set <dir1> <dir2> ..."
or by using "--stdin" to provide the directories line-by-line over
stdin. This behaviour naturally looks a lot like the way a user
would type "git add <dir1> <dir2> ..."
If core.ignoreCase is enabled, then "git add" will match the input
using a case-insensitive match. Do the same for the sparse-checkout
feature.
Perform case-insensitive checks while updating the skip-worktree
bits during unpack_trees(). This is done by changing the hash
algorithm and hashmap comparison methods to optionally use case-
insensitive methods.
When this is enabled, there is a small performance cost in the
hashing algorithm. To tease out the worst possible case, the
following was run on a repo with a deep directory structure:
git ls-tree -d -r --name-only HEAD |
git sparse-checkout set --stdin
The 'set' command was timed with core.ignoreCase disabled or
enabled. For the repo with a deep history, the numbers were
core.ignoreCase=false: 62s
core.ignoreCase=true: 74s (+19.3%)
For reproducibility, the equivalent test on the Linux kernel
repository had these numbers:
core.ignoreCase=false: 3.1s
core.ignoreCase=true: 3.6s (+16%)
Now, this is not an entirely fair comparison, as most users
will define their sparse cone using more shallow directories,
and the performance improvement from eb42feca97 ("unpack-trees:
hash less in cone mode" 2019-11-21) can remove most of the
hash cost. For a more realistic test, drop the "-r" from the
ls-tree command to store only the first-level directories.
In that case, the Linux kernel repository takes 0.2-0.25s in
each case, and the deep repository takes one second, plus or
minus 0.05s, in each case.
Thus, we _can_ demonstrate a cost to this change, but it is
unlikely to matter to any reasonable sparse-checkout cone.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>