development/git - git

mirror of https://github.com/git/git synced 2024-07-07 19:39:27 +00:00

Author	SHA1	Message	Date
Elijah Newren	baf889c2cd	sparse-index.h: move declarations for sparse-index.c from cache.h Note in particular that this reverses the decision made in `118a2e8bde` ("cache: move ensure_full_index() to cache.h", 2021-04-01). Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-06-21 13:39:53 -07:00
Derrick Stolee	cf9cd8b55c	fsck: use local repository In `0d30feef3c` (fsck: create scaffolding for rev-index checks, 2023-04-17) and later `5a6072f631` (fsck: validate .rev file header, 2023-04-17), the check_pack_rev_indexes() method was created with a 'struct repository *r' parameter. However, this parameter was unused and instead 'the_repository' was used in its place. Fix this situation with the obvious replacement. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-05-02 08:48:23 -07:00
Derrick Stolee	756f1bcd29	fsck: verify checksums of all .bitmap files If a filesystem-level corruption occurs in a .bitmap file, Git can react poorly. This could take the form of a run-time error due to failing to parse an EWAH bitmap or be more subtle such as returning the wrong set of objects to a fetch or clone. A natural first response to either of these kinds of errors is to run 'git fsck' to see if any files are corrupt. This currently ignores all .bitmap files. Add checks to 'git fsck' for all .bitmap files that are currently associated with a multi-pack-index or pack file. Verify their checksums using the hashfile API. We iterate through all multi-pack-indexes and pack-files to be sure to check all .bitmap files, not just the one that would be read by the process. For example, a multi-pack-index bitmap overrules a pack-bitmap. However, if the multi-pack-index is removed, the pack-bitmap may be selected instead. Be thorough to include every file that could become active in such a way. This includes checking files in alternates. There is potential that we could extend this effort to check the structure of the reachability bitmaps themselves, but it is very expensive to do so. At minimum, it's as expensive as generating the bitmaps in the first place, and that's assuming that we don't use the trivial algorithm of verifying each bitmap individually. The trivial algorithm will result in quadratic behavior (number of objects times number of bitmapped commits) while the bitmap building operation constructs a lattice of commits to build bitmaps incrementally and then generate the final bitmaps from a subset of those commits. If we were to extend 'git fsck' to check .bitmap file contents more closely like this, then we would likely want to hide it behind an option that signals the user is more willing to do expensive operations such as this. For testing, set up a repository with a pack-bitmap _and_ a multi-pack-index bitmap. This requires some file movement to avoid deleting the pack-bitmap during the repack that creates the multi-pack-index bitmap. We can then verify that 'git fsck' is checking all files, not just the "active" bitmap. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-05-02 08:48:22 -07:00
Junio C Hamano	a02675ad90	Merge branch 'ds/fsck-pack-revindex' "git fsck" learned to validate the on-disk pack reverse index files. * ds/fsck-pack-revindex: fsck: validate .rev file header fsck: check rev-index position values fsck: check rev-index checksums fsck: create scaffolding for rev-index checks	2023-04-27 16:00:59 -07:00
Derrick Stolee	5a6072f631	fsck: validate .rev file header While parsing a .rev file, we check the header information to be sure it makes sense. This happens before doing any additional validation such as a checksum or value check. In order to differentiate between a bad header and a non-existent file, we need to update the API for loading a reverse index. Make load_pack_revindex_from_disk() non-static and specify that a positive value means "the file does not exist" while other errors during parsing are negative values. Since an invalid header prevents setting up the structures we would use for further validations, we can stop at that point. The place where we can distinguish between a missing file and a corrupt file is inside load_revindex_from_disk(), which is used both by pack rev-indexes and multi-pack-index rev-indexes. Some tests in t5326 demonstrate that it is critical to take some conditions to allow positive error signals. Add tests that check the three header values. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-04-17 14:39:05 -07:00
Derrick Stolee	0d30feef3c	fsck: create scaffolding for rev-index checks The 'fsck' builtin checks many of Git's on-disk data structures, but does not currently validate the pack rev-index files (a .rev file to pair with a .pack and .idx file). Before doing a more-involved check process, create the scaffolding within builtin/fsck.c to have a new error type and add that error type when the API method verify_pack_revindex() returns an error. That method does nothing currently, but we will add checks to it in later changes. For now, check that 'git fsck' succeeds without any errors in the normal case. Future checks will be paired with tests that corrupt the .rev file appropriately. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-04-17 14:39:04 -07:00
Elijah Newren	87bed17907	object-file.h: move declarations for object-file.c functions from cache.h Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-04-11 08:52:10 -07:00
Elijah Newren	dabab1d6e6	object-name.h: move declarations for object-name.c functions from cache.h Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-04-11 08:52:09 -07:00
Junio C Hamano	6047b28eb7	Merge branch 'en/header-split-cleanup' Split key function and data structure definitions out of cache.h to new header files and adjust the users. * en/header-split-cleanup: csum-file.h: remove unnecessary inclusion of cache.h write-or-die.h: move declarations for write-or-die.c functions from cache.h treewide: remove cache.h inclusion due to setup.h changes setup.h: move declarations for setup.c functions from cache.h treewide: remove cache.h inclusion due to environment.h changes environment.h: move declarations for environment.c functions from cache.h treewide: remove unnecessary includes of cache.h wrapper.h: move declarations for wrapper.c functions from cache.h path.h: move function declarations for path.c functions from cache.h cache.h: remove expand_user_path() abspath.h: move absolute path functions from cache.h environment: move comment_line_char from cache.h treewide: remove unnecessary cache.h inclusion from several sources treewide: remove unnecessary inclusion of gettext.h treewide: be explicit about dependence on gettext.h treewide: remove unnecessary cache.h inclusion from a few headers	2023-04-06 13:38:31 -07:00
Junio C Hamano	72871b198f	Merge branch 'ab/remove-implicit-use-of-the-repository' Code clean-up around the use of the_repository. * ab/remove-implicit-use-of-the-repository: libs: use "struct repository " argument, not "the_repository" post-cocci: adjust comments for recent repo_ migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"	2023-04-06 13:38:30 -07:00
Junio C Hamano	e7dca80692	Merge branch 'ab/remove-implicit-use-of-the-repository' into en/header-split-cache-h * ab/remove-implicit-use-of-the-repository: libs: use "struct repository " argument, not "the_repository" post-cocci: adjust comments for recent repo_ migration cocci: apply the "revision.h" part of "the_repository.pending" cocci: apply the "rerere.h" part of "the_repository.pending" cocci: apply the "refs.h" part of "the_repository.pending" cocci: apply the "promisor-remote.h" part of "the_repository.pending" cocci: apply the "packfile.h" part of "the_repository.pending" cocci: apply the "pretty.h" part of "the_repository.pending" cocci: apply the "object-store.h" part of "the_repository.pending" cocci: apply the "diff.h" part of "the_repository.pending" cocci: apply the "commit.h" part of "the_repository.pending" cocci: apply the "commit-reach.h" part of "the_repository.pending" cocci: apply the "cache.h" part of "the_repository.pending" cocci: add missing "the_repository" macros to "pending" cocci: sort "the_repository" rules by header cocci: fix incorrect & verbose "the_repository" rules cocci: remove dead rule from "the_repository.pending.cocci"	2023-04-04 08:25:52 -07:00
Ævar Arnfjörð Bjarmason	d850b7a545	cocci: apply the "cache.h" part of "the_repository.pending" Apply the part of "the_repository.pending.cocci" pertaining to "cache.h". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-03-28 07:36:36 -07:00
Elijah Newren	f394e093df	treewide: be explicit about dependence on gettext.h Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-03-21 10:56:51 -07:00
Junio C Hamano	d0732a8120	Merge branch 'jk/unused-post-2.39-part2' More work towards -Wunused. * jk/unused-post-2.39-part2: (21 commits) help: mark unused parameter in git_unknown_cmd_config() run_processes_parallel: mark unused callback parameters userformat_want_item(): mark unused parameter for_each_commit_graft(): mark unused callback parameter rewrite_parents(): mark unused callback parameter fetch-pack: mark unused parameter in callback function notes: mark unused callback parameters prio-queue: mark unused parameters in comparison functions for_each_object: mark unused callback parameters list-objects: mark unused callback parameters mark unused parameters in signal handlers run-command: mark error routine parameters as unused mark "pointless" data pointers in callbacks ref-filter: mark unused callback parameters http-backend: mark unused parameters in virtual functions http-backend: mark argc/argv unused object-name: mark unused parameters in disambiguate callbacks serve: mark unused parameters in virtual functions serve: use repository pointer to get config ls-refs: drop config caching ...	2023-03-17 14:03:09 -07:00
Junio C Hamano	88cc8ed8bc	Merge branch 'en/header-cleanup' Code clean-up to clarify the rule that "git-compat-util.h" must be the first to be included. * en/header-cleanup: diff.h: remove unnecessary include of object.h Remove unnecessary includes of builtin.h treewide: replace cache.h with more direct headers, where possible replace-object.h: move read_replace_refs declaration from cache.h to here object-store.h: move struct object_info from cache.h dir.h: refactor to no longer need to include cache.h object.h: stop depending on cache.h; make cache.h depend on object.h ident.h: move ident-related declarations out of cache.h pretty.h: move has_non_ascii() declaration from commit.h cache.h: remove dependence on hex.h; make other files include it explicitly hex.h: move some hex-related declarations from cache.h hash.h: move some oid-related declarations from cache.h alloc.h: move ALLOC_GROW() functions from cache.h treewide: remove unnecessary cache.h includes in source files treewide: remove unnecessary cache.h includes treewide: remove unnecessary git-compat-util.h includes in headers treewide: ensure one of the appropriate headers is sourced first	2023-03-17 14:03:09 -07:00
Jeff King	8d3e7eac52	fsck: check even zero-entry index files In `fb64ca526a` (fsck: check index files in all worktrees, 2023-02-24), we swapped out a call to vanilla repo_read_index() for a series of read_index_from() calls, one per worktree. The code for the latter was copied from add_index_objects_to_pending(), which checks for a positive return value from the index reading function, and we do the same here in fsck now. But this is probably the wrong thing. I had interpreted the check as "don't operate on the index struct if there was an error". But in reality, if there is an error then the index-reading code will simply die (which admittedly is not great for fsck, but that is not a new problem). The return value here is actually the number of entries read. So it makes sense for add_index_objects_to_pending() to ignore a zero-entry index (there is nothing to add). But for fsck, we would still want to check any extensions, etc (though presumably it is unlikely to have them in an empty index, I don't think it's impossible). So we should ignore the return value from read_index_from() entirely. This matches the behavior before `fb64ca526a`, when we ignored the return value from repo_read_index(). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-27 07:36:36 -08:00
Jeff King	592ec63b38	fsck: mention file path for index errors If we encounter an error in an index file, we may say something like: error: 1234abcd: invalid sha1 pointer in resolve-undo But if you have multiple worktrees, each with its own index, it can be very helpful to know which file had the problem. So let's pass that path down through the various index-fsck functions and use it where appropriate. After this patch you should get something like: error: 1234abcd: invalid sha1 pointer in resolve-undo of .git/worktrees/wt/index That's a bit verbose, but since the point is that you shouldn't see this normally, we're better to err on the side of more details. I've also added the index filename to the name used by "fsck --name-objects", which will show up if we find the object to be missing, etc. This is bending the rules a little there, as the option claims to write names that can be fed to rev-parse. But there is no revision syntax to access the index of another worktree, so the best we can do is make up something that a human will probably understand. I did take care to retain the existing ":file" syntax for the current worktree. So the uglier output should kick in only when it's actually necessary. See the included tests for examples of both forms. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-24 09:32:23 -08:00
Jeff King	fb64ca526a	fsck: check index files in all worktrees We check the index file for the main worktree, but completely ignore the index files in other worktrees. These should be checked, too, as they are part of the repository state (and in particular, errors in those index files may cause repo-wide operations like "git gc" to complain). Reported-by: Johannes Sixt <j6t@kdbg.org> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-24 09:32:23 -08:00
Jeff King	8840069a37	fsck: factor out index fsck The code to fsck an index operates directly on the_index. Let's move it into its own function in preparation for handling the index files from other worktrees. Since we now have only a single reference to the_index, let's drop our USE_THE_INDEX_VARIABLE definition and just use the_repository.index directly. That's a minor cleanup, but also ensures that we didn't miss any references when moving the code into fsck_index(). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-24 09:30:58 -08:00
Jeff King	be252d3349	for_each_object: mark unused callback parameters The for_each_{loose,packed}_object interface uses callback functions, but not every callback needs all of the parameters. Mark the unused ones to satisfy -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-24 09:13:31 -08:00
Elijah Newren	cbeab74713	replace-object.h: move read_replace_refs declaration from cache.h to here Adjust several files to be more explicit about their dependency on replace-objects to accommodate this change. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-23 17:25:30 -08:00
Elijah Newren	41771fa435	cache.h: remove dependence on hex.h; make other files include it explicitly Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-23 17:25:29 -08:00
Ævar Arnfjörð Bjarmason	07047d6829	cocci: apply "pending" index-compatibility to some "builtin/.c" Apply "index-compatibility.pending.cocci" rule to "builtin/", but exclude those where we conflict with in-flight changes. As a result some of them end up using only "the_index", so let's have them use the more narrow "USE_THE_INDEX_VARIABLE" rather than "USE_THE_INDEX_COMPATIBILITY_MACROS". Manual changes not made by coccinelle, that were squashed in: * Whitespace-wrap argument lists for repo_hold_locked_index(), repo_read_index_preload() and repo_refresh_and_write_index(), in cases where the line became too long after the transformation. * Change "refresh_cache()" to "refresh_index()" in a comment in "builtin/update-index.c". * For those whose call was followed by perror("<macro-name>"), change it to perror("<function-name>"), referring to the new function. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-11-21 12:06:15 +09:00
Ævar Arnfjörð Bjarmason	dc594180d9	cocci & cache.h: apply variable section of "pending" index-compatibility Mostly apply the part of "index-compatibility.pending.cocci" that renames the global variables like "active_nr", which are a shorthand to referencing (in that case) a struct member as "the_index.cache_nr". In doing so move more of "index-compatibility.pending.cocci" to "index-compatibility.cocci". In the case of "active_nr" we'd have a textual conflict with "ab/various-leak-fixes" in "next"[1]. Let's exclude that specific case while moving the rule over from "pending". 1. `407b94280f` (commit: discard partial cache before (re-)reading it, 2022-11-08) Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-11-21 12:06:15 +09:00
Junio C Hamano	7b9b634ca5	Merge branch 'ab/doc-synopsis-and-cmd-usage' The short-help text shown by "git cmd -h" and the synopsis text shown at the beginning of "git help cmd" have been made more consistent. * ab/doc-synopsis-and-cmd-usage: (34 commits) tests: assert consistent whitespace in -h output tests: start asserting that *.txt SYNOPSIS matches -h output doc txt & -h consistency: make "worktree" consistent worktree: define subcommand -h in terms of command -h reflog doc: list real subcommands up-front doc txt & -h consistency: make "commit" consistent doc txt & -h consistency: make "diff-tree" consistent doc txt & -h consistency: use "[<label>...]" for "zero or more" doc txt & -h consistency: make "annotate" consistent doc txt & -h consistency: make "stash" consistent doc txt & -h consistency: add missing options doc txt & -h consistency: use "git foo" form, not "git-foo" doc txt & -h consistency: make "bundle" consistent doc txt & -h consistency: make "read-tree" consistent doc txt & -h consistency: make "rerere" consistent doc txt & -h consistency: add missing options and labels doc txt & -h consistency: make output order consistent doc txt & -h consistency: add or fix optional "--" syntax doc txt & -h consistency: fix mismatching labels doc SYNOPSIS & -h: use "-" to separate words in labels, not "_" ...	2022-10-28 11:26:54 -07:00
Ævar Arnfjörð Bjarmason	d9054a19ed	doc txt & -h consistency: add missing options Change those built-in commands that were attempting to exhaustively list the options in the "-h" output to actually do so, and always have .txt documentation know about the exhaustive list of options. Let's also fix the documentation and -h output for those built-in commands where the .txt and -h output was a mismatch of missing options on both sides. In the case of "interpret-trailers" fixing the missing options reveals that the *.txt version was implicitly claiming that the command had two operating modes, which a look at the -h version (and studying the documentation) will show is not the case. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-10-13 09:32:57 -07:00
Junio C Hamano	fdbfac60fd	Merge branch 'jk/fsck-on-diet' "git fsck" failed to release contents of tree objects already used from the memory, which has been fixed. * jk/fsck-on-diet: parse_object_buffer(): respect save_commit_buffer fsck: turn off save_commit_buffer fsck: free tree buffers after walking unreachable objects	2022-10-10 10:08:39 -07:00
Jeff King	51b27747e5	parse_object_buffer(): respect save_commit_buffer If the global variable "save_commit_buffer" is set to 0, then parse_commit() will throw away the commit object data after parsing it, rather than sticking it into a commit slab. This goes all the way back to `60ab26de99` ([PATCH] Avoid wasting memory in git-rev-list, 2005-09-15). But there's another code path which may similarly stash the buffer: parse_object_buffer(). This is where we end up if we parse a commit via parse_object(), and it's used directly in a few other code paths like git-fsck. The original goal of `60ab26de99` was avoiding extra memory usage for rev-list. And there it's not all that important to catch parse_object(). We use that function only for looking at the tips of the traversal, and the majority of the commits are parsed by following parent links, where we use parse_commit() directly. So we were wasting some memory, but only a small portion. It's much easier to see the effect with fsck. Since we now turn off save_commit_buffer by default there, we _should_ be able to drop the freeing of the commit buffer in fsck_obj(). But if we do so (taking the first hunk of this patch without the rest), then the peak heap of "git fsck" in a clone of git.git goes from 136MB to 194MB. Teaching parse_object_buffer() to respect save_commit_buffer brings that down to 134.5MB (it's hard to tell from massif's output, but I suspect the savings comes from avoiding the overhead of the mostly-empty commit slab). Other programs should see a small improvement. Both "rev-list --all" and "fsck --connectivity-only" improve by a few hundred kilobytes, as they'd avoid loading the tip objects of their traversals. Most importantly, no code should be hurt by doing this. Any program that turns off save_commit_buffer is already making the assumption that any commit it sees may need to have its object data loaded on demand, as it doesn't know which ones were parsed by parse_commit() versus parse_object(). Not to mention that anything parsed by the commit graph may be in the same boat, even if save_commit_buffer was not disabled. This should be the only spot that needs to be fixed. Grepping for set_commit_buffer() shows that this and parse_commit() are the only relevant calls. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-22 11:40:47 -07:00
Jeff King	069e445256	fsck: turn off save_commit_buffer When parsing a commit, the default behavior is to stuff the original buffer into a commit_slab (which takes ownership of it). But for a tool like fsck, this isn't useful. While we may look at the buffer further as part of fsck_commit(), we'll always do so through a separate pointer; attaching the buffer to the slab doesn't help. Worse, it means we have to remember to free the commit buffer in all call paths. We do so in fsck_obj(), which covers a regular "git fsck". But with "--connectivity-only", we forget to do so in both traverse_one_object(), which covers reachable objects, and mark_unreachable_referents(), which covers unreachable ones. As a result, that mode ends up storing an uncompressed copy of every commit on the heap at once. We could teach the code paths for --connectivity-only to also free commit buffers. But there's an even easier fix: we can just turn off the save_commit_buffer flag, and then we won't attach them to the commits in the first place. This reduces the peak heap of running "git fsck --connectivity-only" in a clone of linux.git from ~2GB to ~1GB. According to massif, the remaining memory goes where you'd expect: the object structs themselves, the obj_hash containing them, and the delta base cache. Note that we'll leave the call to free commit buffers in fsck_obj() for now; it's not quite redundant because of a related bug that we'll fix in a subsequent commit. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-22 11:40:11 -07:00
Jeff King	fbce4fa9ae	fsck: free tree buffers after walking unreachable objects After calling fsck_walk(), a tree object struct may be left in the parsed state, with the full tree contents available via tree->buffer. It's the responsibility of the caller to free these when it's done with the object to avoid having many trees allocated at once. In a regular "git fsck", we hit fsck_walk() only from fsck_obj(), which does call free_tree_buffer(). Likewise for "--connectivity-only", we see most objects via traverse_one_object(), which makes a similar call. The exception is in mark_unreachable_referents(). When using both "--connectivity-only" and "--dangling" (the latter of which is the default), we walk all of the unreachable objects, and there we forget to free. Most cases would not notice this, because they don't have a lot of unreachable objects, but you can make a pathological case like this: git clone --bare /path/to/linux.git repo.git cd repo.git rm packed-refs ;# now everything is unreachable! git fsck --connectivity-only That ends up with peak heap usage ~18GB, which is (not coincidentally) close to the size of all uncompressed trees in the repository. After this patch, the peak heap is only ~2GB. A few things to note: - it might seem like fsck_walk(), if it is parsing the trees, should be responsible for freeing them. But the situation is quite tricky. In the non-connectivity mode, after we call fsck_walk() we then proceed with fsck_object() which actually does the type-specific sanity checks on the object contents. We do pass our own separate buffer to fsck_object(), but there's a catch: our earlier call to parse_object_buffer() may have attached that buffer to the object struct! So by freeing it, we leave the rest of the code with a dangling pointer. Likewise, the call to fsck_walk() in index-pack is subtle. It attaches a buffer to the tree object that must not be freed! And so rather than calling free_tree_buffer(), it actually detaches it by setting tree->buffer to NULL. These cases would _probably_ be fixable by having fsck_walk() free the tree buffer only when it was the one who allocated it via parse_tree(). But that would still leave the callers responsible for freeing other cases, so they wouldn't be simplified. While the current semantics for fsck_walk() make it easy to accidentally leak in new callers, at least they are simple to explain, and it's not a function that's likely to get a lot of new call-sites. And in any case, it's probably sensible to fix the leak first with this simple patch, and try any more complicated refactoring separately. - a careful reader may notice that fsck_obj() also frees commit buffers, but neither the call in traverse_one_object() nor the one touched in this patch does so. And indeed, this is another problem for --connectivity-only (and accounts for most of the 2GB heap after this patch), but it's one we'll fix in a separate commit. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-22 11:30:06 -07:00
Ævar Arnfjörð Bjarmason	5cf88fd8b0	git-compat-util.h: use "UNUSED", not "UNUSED(var)" As reported in [1] the "UNUSED(var)" macro introduced in `2174b8c75d` (Merge branch 'jk/unused-annotation' into next, 2022-08-24) breaks coccinelle's parsing of our sources in files where it occurs. Let's instead partially go with the approach suggested in [2] of making this not take an argument. As noted in [1] "coccinelle" will ignore such tokens in argument lists that it doesn't know about, and it's less of a surprise to syntax highlighters. This undoes the "help us notice when a parameter marked as unused is actually use" part of `9b24034754` (git-compat-util: add UNUSED macro, 2022-08-19), a subsequent commit will further tweak the macro to implement a replacement for that functionality. 1. https://lore.kernel.org/git/220825.86ilmg4mil.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/220819.868rnk54ju.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-01 10:49:48 -07:00
Jeff King	c006e9fa59	refs: mark unused reflog callback parameters Functions used with for_each_reflog_ent() need to conform to a particular interface, but not every function needs all of the parameters. Mark the unused ones to make -Wunused-parameter happy. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-08-19 12:18:54 -07:00
Jeff King	63e14ee2d6	refs: mark unused each_ref_fn parameters Functions used with for_each_ref(), etc, need to conform to the each_ref_fn interface. But most of them don't need every parameter; let's annotate the unused ones to quiet -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-08-19 12:18:54 -07:00
Junio C Hamano	e0ad13977a	fsck: do not dereference NULL while checking resolve-undo data When we found an invalid object recorded in the resolve-undo data, we would have ended up dereferencing NULL while fsck. Reporting the problem and going on to the next object is the right thing to do here. Noticed by SZEDER Gábor. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-07-11 16:26:33 -07:00
Junio C Hamano	5a5ea141e7	revision: mark blobs needed for resolve-undo as reachable The resolve-undo extension was added to the index in `cfc5789a` (resolve-undo: record resolved conflicts in a new index extension section, 2009-12-25). This extension records the blob object names and their modes of conflicted paths when the path gets resolved (e.g. with "git add"), to allow "undoing" the resolution with "checkout -m path". These blob objects should be guarded from garbage-collection while we have the resolve-undo information in the index (otherwise unresolve operation may try to use a blob object that has already been pruned away). But the code called from mark_reachable_objects() for the index forgets to do so. Teach add_index_objects_to_pending() helper to also add objects referred to by the resolve-undo extension. Also make matching changes to "fsck", which has code that is fairly similar to the reachability stuff, but have parallel implementations for all these stuff, which may (or may not) someday want to be unified. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-06-09 16:45:07 -07:00
Ævar Arnfjörð Bjarmason	2b7098936c	run-command API users: use strvec_pushl(), not argv construction Change a pattern of hardcoding an "argv" array size, populating it and assigning to the "argv" member of "struct child_process" to instead use "strvec_pushl()" to add data to the "args" member. This implements the same behavior as before in fewer lines of code, and moves us further towards being able to remove the "argv" member in a subsequent commit. Since we've entirely removed the "argv" variable(s) we can be sure that no potential logic errors of the type discussed in a preceding commit are being introduced here, i.e. ones where the local "argv" was being modified after the assignment to "struct child_process"'s "argv". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-11-25 22:15:07 -08:00
Junio C Hamano	2c0fa66bc8	Merge branch 'ab/fsck-unexpected-type' Regression fix. * ab/fsck-unexpected-type: object-file: free(*contents) only in read_loose_object() caller object-file: fix SEGV on free() regression in v2.34.0-rc2	2021-11-12 15:29:25 -08:00
Ævar Arnfjörð Bjarmason	16235e3b14	object-file: free(*contents) only in read_loose_object() caller In the preceding commit a free() of uninitialized memory regression in `96e41f58fe` (fsck: report invalid object type-path combinations, 2021-10-01) was fixed, but we'd still have an issue with leaking memory from fsck_loose(). Let's fix that issue too. That issue was introduced in my `31deb28f5e` (fsck: don't hard die on invalid object types, 2021-10-01). It can be reproduced under SANITIZE=leak with the test I added in `093fffdfbe` (fsck tests: add test for fsck-ing an unknown type, 2021-10-01): ./t1450-fsck.sh --run=84 -vixd In some sense it's not a problem, we lost the same amount of memory in terms of things malloc'd and not free'd. It just moved from the "still reachable" to "definitely lost" column in valgrind(1) nomenclature[1], since we'd have die()'d before. But now that we don't hard die() anymore in the library let's properly free() it. Doing so makes this code much easier to follow, since we'll now have one function owning the freeing of the "contents" variable, not two. For context on that memory management pattern the read_loose_object() function was added in `f6371f9210` (sha1_file: add read_loose_object() function, 2017-01-13) and subsequently used in `c68b489e56` (fsck: parse loose object paths directly, 2017-01-13). The pattern of it being the task of both sides to free() the memory has been there in this form since its inception. 1. https://valgrind.org/docs/manual/mc-manual.html#mc-manual.leaks Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-11-11 13:40:43 -08:00
Junio C Hamano	7afb458e91	Merge branch 'gc/use-repo-settings' It is wrong to read some settings directly from the config subsystem, as things like feature.experimental can affect their default values. * gc/use-repo-settings: gc: perform incremental repack when implictly enabled fsck: verify multi-pack-index when implictly enabled fsck: verify commit graph when implicitly enabled	2021-11-01 13:48:08 -07:00
Junio C Hamano	061a21d36d	Merge branch 'ab/fsck-unexpected-type' "git fsck" has been taught to report mismatch between expected and actual types of an object better. * ab/fsck-unexpected-type: fsck: report invalid object type-path combinations fsck: don't hard die on invalid object types object-file.c: stop dying in parse_loose_header() object-file.c: return ULHR_TOO_LONG on "header too long" object-file.c: use "enum" return type for unpack_loose_header() object-file.c: simplify unpack_loose_short_header() object-file.c: make parse_loose_header_extended() public object-file.c: return -1, not "status" from unpack_loose_header() object-file.c: don't set "typep" when returning non-zero cat-file tests: test for current --allow-unknown-type behavior cat-file tests: add corrupt loose object test cat-file tests: test for missing/bogus object with -t, -s and -p cat-file tests: move bogus_* variable declarations earlier fsck tests: test for garbage appended to a loose object fsck tests: test current hash/type mismatch behavior fsck tests: refactor one test to use a sub-repo fsck tests: add test for fsck-ing an unknown type	2021-10-25 16:06:56 -07:00
Glen Choo	dc5570872f	fsck: verify multi-pack-index when implictly enabled Like the previous commit, change fsck to check the "core_multi_pack_index" variable set in "repo-settings.c" instead of reading the "core.multiPackIndex" config variable. This fixes a bug where we wouldn't verify midx if the config key was missing. This bug was introduced in `18e449f86b` (midx: enable core.multiPackIndex by default, 2020-09-25) where core.multiPackIndex was turned on by default. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Glen Choo <chooglen@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-10-15 14:30:08 -07:00
Glen Choo	f30e4d854b	fsck: verify commit graph when implicitly enabled Change fsck to check the "core_commit_graph" variable set in "repo-settings.c" instead of reading the "core.commitGraph" variable. This fixes a bug where we wouldn't verify the commit-graph if the config key was missing. This bug was introduced in `31b1de6a09` (commit-graph: turn on commit-graph by default, 2019-08-13), where core.commitGraph was turned on by default. Add tests to "t5318-commit-graph.sh" to verify that fsck checks the commit-graph as expected for the 3 values of core.commitGraph. Also, disable GIT_TEST_COMMIT_GRAPH in t/t0410-partial-clone.sh because some test cases use fsck in ways that assume that commit-graph checking is disabled. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Glen Choo <chooglen@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-10-15 14:30:07 -07:00
Ævar Arnfjörð Bjarmason	96e41f58fe	fsck: report invalid object type-path combinations Improve the error that's emitted in cases where we find a loose object we parse, but which isn't at the location we expect it to be. Before this change we'd prefix the error with a not-a-OID derived from the path at which the object was found, due to an emergent behavior in how we'd end up with an "OID" in these codepaths. Now we'll instead say what object we hashed, and what path it was found at. Before this patch series e.g.: $ git hash-object --stdin -w -t blob </dev/null `e69de29bb2` $ mv objects/e6/ objects/e7 Would emit ("[...]" used to abbreviate the OIDs): git fsck error: hash mismatch for ./objects/e7/9d[...] (expected e79d[...]) error: e79d[...]: object corrupt or missing: ./objects/e7/9d[...] Now we'll instead emit: error: e69d[...]: hash-path mismatch, found at: ./objects/e7/9d[...] Furthermore, we'll do the right thing when the object type and its location are bad. I.e. this case: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ mv objects/83 objects/84 As noted in an earlier commits we'd simply die early in those cases, until preceding commits fixed the hard die on invalid object type: $ git fsck fatal: invalid object type Now we'll instead emit sensible error messages: $ git fsck error: 8315[...]: hash-path mismatch, found at: ./objects/84/15[...] error: 8315[...]: object is of unknown type 'garbage': ./objects/84/15[...] In both fsck.c and object-file.c we're using null_oid as a sentinel value for checking whether we got far enough to be certain that the issue was indeed this OID mismatch. We need to add the "object corrupt or missing" special-case to deal with cases where read_loose_object() will return an error before completing check_object_signature(), e.g. if we have an error in unpack_loose_rest() because we find garbage after the valid gzip content: $ git hash-object --stdin -w -t blob </dev/null `e69de29bb2` $ chmod 755 objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ echo garbage >>objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 $ git fsck error: garbage at end of loose object 'e69d[...]' error: unable to unpack contents of ./objects/e6/9d[...] error: e69d[...]: object corrupt or missing: ./objects/e6/9d[...] There is currently some weird messaging in the edge case when the two are combined, i.e. because we're not explicitly passing along an error state about this specific scenario from check_stream_oid() via read_loose_object() we'll end up printing the null OID if an object is of an unknown type and it can't be unpacked by zlib, e.g.: $ git hash-object --stdin -w -t garbage --literally </dev/null 8315a83d2acc4c174aed59430f9a9c4ed926440f $ chmod 755 objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ echo garbage >>objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f $ /usr/bin/git fsck fatal: invalid object type $ ~/g/git/git fsck error: garbage at end of loose object '8315a83d2acc4c174aed59430f9a9c4ed926440f' error: unable to unpack contents of ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 8315a83d2acc4c174aed59430f9a9c4ed926440f: object corrupt or missing: ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f error: 0000000000000000000000000000000000000000: object is of unknown type 'garbage': ./objects/83/15a83d2acc4c174aed59430f9a9c4ed926440f [...] I think it's OK to leave that for future improvements, which would involve enum-ifying more error state as we've done with "enum unpack_loose_header_result" in preceding commits. In these increasingly more obscure cases the worst that can happen is that we'll get slightly nonsensical or inapplicable error messages. There's other such potential edge cases, all of which might produce some confusing messaging, but still be handled correctly as far as passing along errors goes. E.g. if check_object_signature() returns and oideq(real_oid, null_oid()) is true, which could happen if it returns -1 due to the read_istream() call having failed. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-10-01 15:06:01 -07:00
Ævar Arnfjörð Bjarmason	31deb28f5e	fsck: don't hard die on invalid object types Change the error fsck emits on invalid object types, such as: $ git hash-object --stdin -w -t garbage --literally </dev/null <OID> From the very ungraceful error of: $ git fsck fatal: invalid object type $ To: $ git fsck error: <OID>: object is of unknown type 'garbage': <OID_PATH> [ other fsck output ] We'll still exit with non-zero, but now we'll finish the rest of the traversal. The tests that's being added here asserts that we'll still complain about other fsck issues (e.g. an unrelated dangling blob). To do this we need to pass down the "OBJECT_INFO_ALLOW_UNKNOWN_TYPE" flag from read_loose_object() through to parse_loose_header(). Since the read_loose_object() function is only used in builtin/fsck.c we can simply change it to accept a "struct object_info" (which contains the OBJECT_INFO_ALLOW_UNKNOWN_TYPE in its flags). See `f6371f9210` (sha1_file: add read_loose_object() function, 2017-01-13) for the introduction of read_loose_object(). Since we'll need a "struct strbuf" to hold the "type_name" let's pass it to the for_each_loose_file_in_objdir() callback to avoid allocating a new one for each loose object in the iteration. It also makes the memory management simpler than sticking it in fsck_loose() itself, as we'll only need to strbuf_reset() it, with no need to do a strbuf_release() before each "return". Before this commit we'd never check the "type" if read_loose_object() failed, but now we do. We therefore need to initialize it to OBJ_NONE to be able to tell the difference between e.g. its unpack_loose_header() having failed, and us getting past that and into parse_loose_header(). Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-10-01 15:06:01 -07:00
Junio C Hamano	ed125c4f07	Merge branch 'ab/fsck-api-cleanup' Last minute compilation fix. * ab/fsck-api-cleanup: builtin/fsck.c: don't conflate "int" and "enum" in callback	2021-06-02 07:34:27 +09:00
Ævar Arnfjörð Bjarmason	28abf260a5	builtin/fsck.c: don't conflate "int" and "enum" in callback Fix a warning on AIX's xlc compiler that's been emitted since my `a1aad71601` (fsck.h: use "enum object_type" instead of "int", 2021-03-28): "builtin/fsck.c", line 805.32: 1506-068 (W) Operation between types "int()(struct object,enum object_type,void,struct fsck_options)" and "int()(struct object,int,void,struct fsck_options)" is not allowed. I.e. it complains about us assigning a function with a prototype "int" where we're expecting "enum object_type". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-06-02 05:59:15 +09:00
Junio C Hamano	8e97852919	Merge branch 'ds/sparse-index-protections' Builds on top of the sparse-index infrastructure to mark operations that are not ready to mark with the sparse index, causing them to fall back on fully-populated index that they always have worked with. * ds/sparse-index-protections: (47 commits) name-hash: use expand_to_path() sparse-index: expand_to_path() name-hash: don't add directories to name_hash revision: ensure full index resolve-undo: ensure full index read-cache: ensure full index pathspec: ensure full index merge-recursive: ensure full index entry: ensure full index dir: ensure full index update-index: ensure full index stash: ensure full index rm: ensure full index merge-index: ensure full index ls-files: ensure full index grep: ensure full index fsck: ensure full index difftool: ensure full index commit: ensure full index checkout: ensure full index ...	2021-04-30 13:50:26 +09:00
Derrick Stolee	2227ea175f	fsck: ensure full index When verifying all blobs reachable from the index, ensure that a sparse index has been expanded to a full one to avoid missing some blobs. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-04-14 13:47:11 -07:00
Jeff King	45a187cc34	lookup_unknown_object(): take a repository argument All of the other lookup_foo() functions take a repository argument, but lookup_unknown_object() was never converted, and it uses the_repository internally. Let's fix that. We could leave a wrapper that uses the_repository, but there aren't that many calls, so we'll just convert them all. I looked briefly at each site to see if we had a repository struct (besides the_repository) we could pass, but none of them do (so this conversion to pass the_repository is a pure noop in each case, though it does take us one step closer to eventually getting rid of the_repository). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-04-13 13:18:46 -07:00
Ævar Arnfjörð Bjarmason	394d5d31b0	fsck.c: pass along the fsck_msg_id in the fsck_error callback Change the fsck_error callback to also pass along the fsck_msg_id. Before this change the only way to get the message id was to parse it back out of the "message". Let's pass it down explicitly for the benefit of callers that might want to use it, as discussed in [1]. Passing the msg_type is now redundant, as you can always get it back from the msg_id, but I'm not changing that convention. It's really common to need the msg_type, and the report() function itself (which calls "fsck_error") needs to call fsck_msg_type() to discover it. Let's not needlessly re-do that work in the user callback. 1. https://lore.kernel.org/git/87blcja2ha.fsf@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-28 19:03:10 -07:00

1 2 3 4 5

228 Commits