development/git - HydraGit

mirror of https://github.com/git/git synced 2024-08-28 03:59:25 +00:00

Author	SHA1	Message	Date
Elijah Newren	d5ebb50dcb	wrapper.h: move declarations for wrapper.c functions from cache.h Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-03-21 10:56:53 -07:00
Elijah Newren	f394e093df	treewide: be explicit about dependence on gettext.h Dozens of files made use of gettext functions, without explicitly including gettext.h. This made it more difficult to find which files could remove a dependence on cache.h. Make C files explicitly include gettext.h if they are using it. However, while compat/fsmonitor/fsm-ipc-darwin.c should also gain an include of gettext.h, it was left out to avoid conflicting with an in-flight topic. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-03-21 10:56:51 -07:00
Taylor Blau	c08645b353	commit-graph: introduce `ensure_generations_valid()` Use the just-introduced compute_reachable_generation_numbers_1() to implement a function which dynamically computes topological levels (or corrected commit dates) for out-of-graph commits. This will be useful for the ahead-behind algorithm we are about to introduce, which needs accurate topological levels on _all_ commits reachable from the tips in order to avoid over-counting. Co-authored-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-03-20 12:17:33 -07:00
Derrick Stolee	2ee11f7261	commit-graph: return generation from memory The commit_graph_generation() method used to report a value of GENERATION_NUMBER_INFINITY if the commit_graph_data_slab had an instance for the given commit but the graph_pos indicated the commit was not in the commit-graph file. However, an upcoming change will introduce the ability to set generation values in-memory without writing the commit-graph file. Thus, we can no longer trust 'graph_pos' to indicate whether or not the generation member can be trusted. Instead, trust the 'generation' member if the commit has a value in the slab _and_ the 'generation' member is non-zero. Otherwise, treat it as GENERATION_NUMBER_INFINITY. This only makes a difference for a very old case for the commit-graph: the very first Git release to write commit-graph files wrote zeroes in the topological level positions. If we are parsing a commit-graph with all zeroes, those commits will now appear to have GENERATION_NUMBER_INFINITY (as if they were not parsed from the commit-graph). I attempted several variations to work around the need for providing an uninitialized 'generation' member, but this was the best one I found. It does require a change to a verification test in t5318 because it reports a different error than the one about non-zero generation numbers. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-03-20 12:17:33 -07:00
Derrick Stolee	80c928d947	commit-graph: simplify compute_generation_numbers() The previous change introduced the generic algorithm compute_reachable_generation_numbers() and used it as the core functionality of compute_topological_levels(). Now, use it as the core functionality of compute_generation_numbers(). The main difference here is that we use generation version 2, which is used in to toggle the logic in compute_generation_from_max() for computing the corrected commit date based on the corrected commit dates of the parent commits (and the commit date of the current commit). It also uses different methods for (get\|set)_generation in the vtable in order to store and access the value in the correct places. Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-03-20 12:17:33 -07:00
Derrick Stolee	368d19b0b7	commit-graph: refactor compute_topological_levels() This patch extracts the common code used to compute topological levels and corrected committer dates into a common routine, compute_reachable_generation_numbers(). For ease of reading, it only modifies compute_topological_levels() to use this new routine, leaving compute_generation_numbers() to be modified in the next change. This new routine dispatches to call the necessary functions to get and set the generation number for a given commit through a vtable (the compute_generation_info struct). Computing the generation number itself is done in compute_generation_from_max(), which dispatches its implementation based on the generation version requested, or issuing a BUG() for unrecognized generation versions. This does not use a vtable because the logic depends only on the generation number version, not where the data is being loaded from or being stored to. This is a subtle point that will make more sense in a future change that modifies the in-memory generation values instead of just preparing values for writing to a commit-graph file. This change looks like it adds a lot of new code. However, two upcoming changes will be quite small due to the work being done in this change. Co-authored-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-03-20 12:17:33 -07:00
Elijah Newren	41771fa435	cache.h: remove dependence on hex.h; make other files include it explicitly Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-02-23 17:25:29 -08:00
René Scharfe	6e57841096	use DUP_ARRAY Add a semantic patch for replace ALLOC_ARRAY+COPY_ARRAY with DUP_ARRAY to reduce code duplication and apply its results. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2023-01-09 13:28:36 +09:00
Junio C Hamano	3239100b5a	Merge branch 'ml/commit-graph-expire-dir-leak-fix' A result from opendir() was leaking in the commit-graph expiration codepath, which has been plugged. * ml/commit-graph-expire-dir-leak-fix: commit-graph: Fix missing closedir in expire_commit_graphs	2022-09-21 14:23:14 -07:00
Miaoqian Lin	12f1ae5324	commit-graph: Fix missing closedir in expire_commit_graphs The function calls opendir() but missing the corresponding closedir() before exit the function. Add missing closedir() to fix it. Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-19 10:42:38 -07:00
Junio C Hamano	dd407f1c7c	Merge branch 'ab/unused-annotation' Undoes 'jk/unused-annotation' topic and redoes it to work around Coccinelle rules misfiring false positives in unrelated codepaths. * ab/unused-annotation: git-compat-util.h: use "deprecated" for UNUSED variables git-compat-util.h: use "UNUSED", not "UNUSED(var)"	2022-09-14 12:56:39 -07:00
Junio C Hamano	a6b42ec0c6	Merge branch 'jk/unused-annotation' Annotate function parameters that are not used (but cannot be removed for structural reasons), to prepare us to later compile with -Wunused warning turned on. * jk/unused-annotation: is_path_owned_by_current_uid(): mark "report" parameter as unused run-command: mark unused async callback parameters mark unused read_tree_recursive() callback parameters hashmap: mark unused callback parameters config: mark unused callback parameters streaming: mark unused virtual method parameters transport: mark bundle transport_options as unused refs: mark unused virtual method parameters refs: mark unused reflog callback parameters refs: mark unused each_ref_fn parameters git-compat-util: add UNUSED macro	2022-09-14 12:56:39 -07:00
Junio C Hamano	655e494047	Merge branch 'jk/rev-list-verify-objects-fix' "git rev-list --verify-objects" ought to inspect the contents of objects and notice corrupted ones, but it didn't when the commit graph is in use, which has been corrected. * jk/rev-list-verify-objects-fix: rev-list: disable commit graph with --verify-objects lookup_commit_in_graph(): use prepare_commit_graph() to check for graph	2022-09-13 11:38:24 -07:00
Jeff King	d6045294a9	lookup_commit_in_graph(): use prepare_commit_graph() to check for graph We exit early from lookup_commit_in_graph() if the commit_graph pointer is NULL, under the assumption that we don't have a graph to look at. But the graph pointer is lazy-loaded; if no other code happens to have called prepare_commit_graph(), we'll incorrectly assume that one isn't available at all. This has a pretty small performance impact in practice, because the fallback will generally be to call parse_object() instead. That ends up in parse_commit_buffer(), which loads the graph data itself. So the first commit we see won't use the graph, but subsequent ones will. Since using the graph is just an optimization there's generally no user-visible difference, but if you instrument rev-list like so: diff --git a/revision.c b/revision.c index ee702e498a..63c488ffb6 100644 --- a/revision.c +++ b/revision.c @@ -381,6 +381,9 @@ static struct object get_reference(struct rev_info revs, const char name, parsing commit data from disk. */ commit = lookup_commit_in_graph(revs->repo, oid); + warning("%s %s in commit graph", + commit ? "found" : "did not find", + name); if (commit) object = &commit->object; else and run (in git.git): git commit-graph write --reachable git rev-list origin/master origin/next >/dev/null you'll see that we fail to find the first one: warning: did not find origin/master in commit graph warning: found origin/next in commit graph After this patch, you'll see that we find both: warning: found origin/master in commit graph warning: found origin/next in commit graph Even though the performance implication is small here, there are two important reasons to do this: - it's downright confusing if you are hunting a bug triggered by the use of the commit graph. It may or may not trigger depending on the number and ordering of tips you ask for. - prepare_commit_graph() has other policy logic, too. In particular, if we've loaded a commit graph and then disabled the graph via disable_commit_graph(), that should take precedence. I'm not sure if this can trigger bad behavior in practice. The only caller there is upload-pack's deepen_by_rev_list(), which should be avoiding the commit graph for its traversal tips, but probably wasn't before this patch. Whether you could come up with a case where that mattered is unclear. Still, this is obviously the right thing to be doing. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-07 09:44:28 -07:00
Ævar Arnfjörð Bjarmason	5cf88fd8b0	git-compat-util.h: use "UNUSED", not "UNUSED(var)" As reported in [1] the "UNUSED(var)" macro introduced in `2174b8c75d` (Merge branch 'jk/unused-annotation' into next, 2022-08-24) breaks coccinelle's parsing of our sources in files where it occurs. Let's instead partially go with the approach suggested in [2] of making this not take an argument. As noted in [1] "coccinelle" will ignore such tokens in argument lists that it doesn't know about, and it's less of a surprise to syntax highlighters. This undoes the "help us notice when a parameter marked as unused is actually use" part of `9b24034754` (git-compat-util: add UNUSED macro, 2022-08-19), a subsequent commit will further tweak the macro to implement a replacement for that functionality. 1. https://lore.kernel.org/git/220825.86ilmg4mil.gmgdl@evledraar.gmail.com/ 2. https://lore.kernel.org/git/220819.868rnk54ju.gmgdl@evledraar.gmail.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-09-01 10:49:48 -07:00
Jeff King	63e14ee2d6	refs: mark unused each_ref_fn parameters Functions used with for_each_ref(), etc, need to conform to the each_ref_fn interface. But most of them don't need every parameter; let's annotate the unused ones to quiet -Wunused-parameter. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-08-19 12:18:54 -07:00
Junio C Hamano	042159a509	Merge branch 'tb/commit-graph-genv2-upgrade-fix' into maint There was a bug in the codepath to upgrade generation information in commit-graph from v1 to v2 format, which has been corrected. source: <cover.1657667404.git.me@ttaylorr.com> * tb/commit-graph-genv2-upgrade-fix: commit-graph: fix corrupt upgrade from generation v1 to v2 commit-graph: introduce `repo_find_commit_pos_in_graph()` t5318: demonstrate commit-graph generation v2 corruption	2022-08-10 21:52:35 -07:00
Junio C Hamano	312d5b7429	Merge branch 'hx/lookup-commit-in-graph-fix' into maint A corner case bug where lazily fetching objects from a promisor remote resulted in infinite recursion has been corrected. source: <cover.1656593279.git.hanxin.hx@bytedance.com> * hx/lookup-commit-in-graph-fix: t5330: remove run_with_limited_processses() commit-graph.c: no lazy fetch in lookup_commit_in_graph()	2022-08-10 21:52:32 -07:00
Junio C Hamano	37e4bdd5ee	Merge branch 'tb/commit-graph-genv2-upgrade-fix' There was a bug in the codepath to upgrade generation information in commit-graph from v1 to v2 format, which has been corrected. * tb/commit-graph-genv2-upgrade-fix: commit-graph: fix corrupt upgrade from generation v1 to v2 commit-graph: introduce `repo_find_commit_pos_in_graph()` t5318: demonstrate commit-graph generation v2 corruption	2022-08-03 13:36:08 -07:00
Junio C Hamano	36d7bd19cf	Merge branch 'js/commit-graph-parsing-without-repo-settings' API tweak to make it easier to run fuzz testing on commit-graph parser. * js/commit-graph-parsing-without-repo-settings: commit-graph: pass repo_settings instead of repository	2022-07-27 09:16:52 -07:00
Junio C Hamano	99c0d94eaa	Merge branch 'hx/lookup-commit-in-graph-fix' A corner case bug where lazily fetching objects from a promisor remote resulted in infinite recursion has been corrected. * hx/lookup-commit-in-graph-fix: t5330: remove run_with_limited_processses() commit-graph.c: no lazy fetch in lookup_commit_in_graph()	2022-07-19 16:40:16 -07:00
Taylor Blau	7805360b7a	commit-graph: introduce `repo_find_commit_pos_in_graph()` Low-level callers in systems that are adjacent to the commit-graph (like the changed-path Bloom filter code) could benefit from being able to call a function like `parse_commit_in_graph()` without modifying the corresponding commit slab data. This is useful in contexts where that slab data is being used to prepare for an upcoming commit-graph write, where Git must be careful to avoid clobbering any of that data during a read operation. Introduce a low-level variant of `parse_commit_in_graph()` which returns the graph position of a given commit only, without modifying any of the slab data. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-07-15 16:51:39 -07:00
Taylor Blau	a92d8523ce	commit-graph: pass repo_settings instead of repository The parse_commit_graph() function takes a 'struct repository ' pointer, but it only ever accesses config settings (either directly or through the .settings field of the repo struct). Move all relevant config settings into the repo_settings struct, and update parse_commit_graph() and its existing callers so that it takes 'struct repo_settings ' instead. Callers of parse_commit_graph() will now need to call prepare_repo_settings() themselves, or initialize a 'struct repo_settings' directly. Prior to `ab14d0676c` (commit-graph: pass a 'struct repository *' in more places, 2020-09-09), parsing a commit-graph was a pure function depending only on the contents of the commit-graph itself. Commit `ab14d0676c` introduced a dependency on a `struct repository` pointer, and later commits such as `b66d84756f` (commit-graph: respect 'commitGraph.readChangedPaths', 2020-09-09) added dependencies on config settings, which were accessed through the `settings` field of the repository pointer. This field was initialized via a call to `prepare_repo_settings()`. Additionally, this fixes an issue in fuzz-commit-graph: In `44c7e62` (2021-12-06, repo-settings:prepare_repo_settings only in git repos), prepare_repo_settings was changed to issue a BUG() if it is called by a process whose CWD is not a Git repository. The combination of commits mentioned above broke fuzz-commit-graph, which attempts to parse arbitrary fuzzing-engine-provided bytes as a commit graph file. Prior to this change, parse_commit_graph() called prepare_repo_settings(), but since we run the fuzz tests without a valid repository, we are hitting the BUG() from `44c7e62` for every test case. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Josh Steadmon <steadmon@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-07-14 15:42:17 -07:00
Han Xin	3a1ea94a49	commit-graph.c: no lazy fetch in lookup_commit_in_graph() The commit-graph is used to opportunistically optimize accesses to certain pieces of information on commit objects, and lookup_commit_in_graph() tries to say "no" when the requested commit does not locally exist by returning NULL, in which case the caller can ask for (which may result in on-demand fetching from a promisor remote) and parse the commit object itself. However, it uses a wrong helper, repo_has_object_file(), to do so. This helper not only checks if an object is mmediately available in the local object store, but also tries to fetch from a promisor remote. But the fetch machinery calls lookup_commit_in_graph(), thus causing an infinite loop. We should make lookup_commit_in_graph() expect that a commit given to it can be legitimately missing from the local object store, by using the has_object_file() helper instead. Signed-off-by: Han Xin <hanxin.hx@bytedance.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-06-30 23:00:32 -07:00
Junio C Hamano	a50036da1a	Merge branch 'tb/cruft-packs' A mechanism to pack unreachable objects into a "cruft pack", instead of ejecting them into loose form to be reclaimed later, has been introduced. * tb/cruft-packs: sha1-file.c: don't freshen cruft packs builtin/gc.c: conditionally avoid pruning objects via loose builtin/repack.c: add cruft packs to MIDX during geometric repack builtin/repack.c: use named flags for existing_packs builtin/repack.c: allow configuring cruft pack generation builtin/repack.c: support generating a cruft pack builtin/pack-objects.c: --cruft with expiration reachable: report precise timestamps from objects in cruft packs reachable: add options to add_unseen_recent_objects_to_traversal builtin/pack-objects.c: --cruft without expiration builtin/pack-objects.c: return from create_object_entry() t/helper: add 'pack-mtimes' test-tool pack-mtimes: support writing pack .mtimes files chunk-format.h: extract oid_version() pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles' pack-mtimes: support reading .mtimes files Documentation/technical: add cruft-packs.txt	2022-06-03 14:30:37 -07:00
Taylor Blau	d9fef9d90d	chunk-format.h: extract oid_version() There are three definitions of an identical function which converts `the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a copy of this function for writing both the commit-graph and multi-pack-index file, and another inline definition used to write the .rev header. Consolidate these into a single definition in chunk-format.h. It's not clear that this is the best header to define this function in, but it should do for now. (Worth noting, the .rev caller expects a 4-byte unsigned, but the other two callers work with a single unsigned byte. The consolidated version uses the latter type, and lets the compiler widen it when required). Another caller will be added in a subsequent patch. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-05-26 15:48:26 -07:00
Junio C Hamano	1b8138fb08	Merge branch 'ab/valgrind-fixes' A bit of test framework fixes with a few fixes to issues found by valgrind. * ab/valgrind-fixes: commit-graph.c: don't assume that stat() succeeds object-file: fix a unpack_loose_header() regression in `3b6a8db3b0` log test: skip a failing mkstemp() test under valgrind tests: using custom GIT_EXEC_PATH breaks --valgrind tests	2022-05-23 14:39:54 -07:00
Junio C Hamano	538dc459a0	Merge branch 'ep/maint-equals-null-cocci' Introduce and apply coccinelle rule to discourage an explicit comparison between a pointer and NULL, and applies the clean-up to the maintenance track. * ep/maint-equals-null-cocci: tree-wide: apply equals-null.cocci tree-wide: apply equals-null.cocci contrib/coccinnelle: add equals-null.cocci	2022-05-20 15:26:59 -07:00
Ævar Arnfjörð Bjarmason	7c898554d7	commit-graph.c: don't assume that stat() succeeds Fix code added in `8d84097f96` (commit-graph: expire commit-graph files, 2019-06-18) to check the return value of the stat() system call. Not doing so caused us to use uninitialized memory in the "Bloom generation is limited by --max-new-filters" test in t4216-log-bloom.sh: + rm -f trace.event + pwd + GIT_TRACE2_EVENT=[...]/t/trash directory.t4216-log-bloom/limits/trace.event git commit-graph write --reachable --split=replace --changed-paths --max-new-filters=2 ==24835== Syscall param utimensat(times[0].tv_sec) points to uninitialised byte(s) ==24835== at 0x499E65A: __utimensat64_helper (utimensat.c:34) ==24835== by 0x4999142: utime (utime.c:36) ==24835== by 0x552BE0: mark_commit_graphs (commit-graph.c:2213) ==24835== by 0x550822: write_commit_graph (commit-graph.c:2424) ==24835== by 0x54E3A0: write_commit_graph_reachable (commit-graph.c:1681) ==24835== by 0x4374BB: graph_write (commit-graph.c:269) ==24835== by 0x436F7D: cmd_commit_graph (commit-graph.c:326) ==24835== by 0x407B9A: run_builtin (git.c:465) ==24835== by 0x406651: handle_builtin (git.c:719) ==24835== by 0x407575: run_argv (git.c:786) ==24835== by 0x406410: cmd_main (git.c:917) ==24835== by 0x511F09: main (common-main.c:56) ==24835== Address 0x1ffeffde70 is on thread 1's stack ==24835== in frame #1, created by utime (utime.c:25) ==24835== Uninitialised value was created by a stack allocation ==24835== at 0x552B50: mark_commit_graphs (commit-graph.c:2201) ==24835== [...] error: last command exited with $?=126 not ok 137 - Bloom generation is limited by --max-new-filters This would happen as we stat'd the non-existing ".git/objects/info/commit-graph" file. Let's fix mark_commit_graphs() to check the stat()'s return value, and while we're at it fix another case added in the same commit to do the same. The caller in expire_commit_graphs() would have been less likely to run into this, as it's operating on files it just got from readdir(), but it could still happen due to a race with e.g. a concurrent "rm -rf" of the commit-graph files. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-05-12 15:42:26 -07:00
Junio C Hamano	2b0a58d164	Merge branch 'ep/maint-equals-null-cocci' for maint-2.35 * ep/maint-equals-null-cocci: tree-wide: apply equals-null.cocci contrib/coccinnelle: add equals-null.cocci	2022-05-02 10:06:04 -07:00
Junio C Hamano	afe8a9070b	tree-wide: apply equals-null.cocci Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-05-02 09:50:37 -07:00
Kleber Tarcísio	c0befa0c03	commit-graph: close file before returning NULL There are two reasons that we could return NULL early within load_commit_graph_chain(): 1. The file does not exist, so the file pointer is NULL. 2. The file exists, but is too small to contain a single hash. These were grouped together when the function was first written in `5c84b3396` (commit-graph: load commit-graph chains, 2019-06-18) in order to simplify how the 'chain_name' string is freed. However, the current code leaves a narrow window where the file pointer is not closed when the file exists, but is rejected for being too small. Split out these cases separately to ensure we close the file in this case. Signed-off-by: Kleber Tarcísio <klebertarcisio@yahoo.com.br> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-04-20 13:56:24 -07:00
Junio C Hamano	eb804cd405	Merge branch 'ns/core-fsyncmethod' Replace core.fsyncObjectFiles with two new configuration variables, core.fsync and core.fsyncMethod. * ns/core-fsyncmethod: core.fsync: documentation and user-friendly aggregate options core.fsync: new option to harden the index core.fsync: add configuration parsing core.fsync: introduce granular fsync control infrastructure core.fsyncmethod: add writeout-only mode wrapper: make inclusion of Windows csprng header tightly scoped	2022-03-25 16:38:24 -07:00
Junio C Hamano	38bbb9e990	Merge branch 'ab/string-list-count-in-size-t' Count string_list items in size_t, not "unsigned int". * ab/string-list-count-in-size-t: string-list API: change "nr" and "alloc" to "size_t" gettext API users: don't explicitly cast ngettext()'s "n"	2022-03-16 17:53:09 -07:00
Junio C Hamano	a54cc523ad	Merge branch 'ds/commit-graph-gen-v2-fixes' Fixes to the way generation number v2 in the commit-graph files are (not) handled. * ds/commit-graph-gen-v2-fixes: commit-graph: declare bankruptcy on GDAT chunks commit-graph: fix generation number v2 overflow values commit-graph: start parsing generation v2 (again) commit-graph: fix ordering bug in generation numbers t5318: extract helpers to lib-commit-graph.sh test-read-graph: include extra post-parse info	2022-03-16 17:53:09 -07:00
Neeraj Singh	020406eaa5	core.fsync: introduce granular fsync control infrastructure This commit introduces the infrastructure for the core.fsync configuration knob. The repository components we want to sync are identified by flags so that we can turn on or off syncing for specific components. If core.fsyncObjectFiles is set and the core.fsync configuration also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any loose objects. This picks the strictest data integrity behavior if core.fsync and core.fsyncObjectFiles are set to conflicting values. This change introduces the currently unused fsync_component helper, which will be used by a later patch that adds fsyncing to the refs backend. Actual configuration and documentation of the fsync components list are in other patches in the series to separate review of the underlying mechanism from the policy of how it's configured. Helped-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Neeraj Singh <neerajsi@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-10 15:10:22 -08:00
Ævar Arnfjörð Bjarmason	99d60545f8	string-list API: change "nr" and "alloc" to "size_t" Change the "nr" and "alloc" members of "struct string_list" to use "size_t" instead of "nr". On some platforms the size of an "unsigned int" will be smaller than a "size_t", e.g. a 32 bit unsigned v.s. 64 bit unsigned. As "struct string_list" is a generic API we use in a lot of places this might cause overflows. As one example: code in "refs.c" keeps track of the number of refs with a "size_t", and auxiliary code in builtin/remote.c in get_ref_states() appends those to a "struct string_list". While we're at it split the "nr" and "alloc" in string-list.h across two lines, which is the case for most such struct member declarations (e.g. in "strbuf.h" and "strvec.h"). Changing e.g. "int i" to "size_t i" in run_and_feed_hook() isn't strictly necessary, and there are a lot more cases where we'll use a local "int", "unsigned int" etc. variable derived from the "nr" in the "struct string_list". But in that case as well as add_wrapped_shortlog_msg() in builtin/shortlog.c we need to adjust the printf format referring to "nr" anyway, so let's also change the other variables referring to it. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-07 12:02:04 -08:00
Derrick Stolee	6dbf4b8172	commit-graph: declare bankruptcy on GDAT chunks The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks store corrected commit date offsets, used for generation number v2. Recent changes have demonstrated that previous versions of Git were incorrectly parsing data from these chunks, but might have also been writing them incorrectly. I asserted [1] that the previous fixes were sufficient because the known reasons for incorrectly writing generation number v2 data relied on parsing the information incorrectly out of a commit-graph file, but the previous versions of Git were not reading the generation number v2 data. However, Patrick demonstrated [2] a case where in split commit-graphs across an alternate boundary (and possibly some other special conditions) it was possible to have a commit-graph that was generated by a previous version of Git have incorrect generation number v2 data which results in errors like the following: commit-graph generation for commit <oid> is 1623273624 < 1623273710 [1] https://lore.kernel.org/git/f50e74f0-9ffa-f4f2-4663-269801495ed3@github.com/ [2] https://lore.kernel.org/git/Yh93vOkt2DkrGPh2@ncase/ Clearly, there is something else going on. The situation is not completely understood, but the errors do not reproduce if the commit-graphs are all generated by a Git version including these recent fixes. If we cannot trust the existing data in the GDAT and GDOV chunks, then we can alter the format to change the chunk IDs for these chunks. This causes the new version of Git to silently ignore the older chunks (and disabling generation number v2 in the process) while writing new commit-graph files with correct data in the GDA2 and GDO2 chunks. Update commit-graph-format.txt including a historical note about these deprecated chunks. Reported-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-07 09:17:03 -08:00
Ævar Arnfjörð Bjarmason	ef3fe21448	lockfile API users: simplify and don't leak "path" Fix a memory leak in code added in `6c622f9f0b` (commit-graph: write commit-graph chains, 2019-06-18). We needed to free the "lock_name" if we encounter errors, and the "graph_name" after we'd run unlink() on it. For the case of write_commit_graph_file() refactoring the code to free the "lock_name" after we were done using the "struct lock_file lk" would have made the control flow more complex. Luckily we can free the "lock_file" right after the hold_lock_file_for_update() call, if it makes use of "path" at all it'll have copied its contents to a "struct strbuf" of its own. While I'm at it let's fix code added in `fb10ca5b54` (sparse-checkout: write using lockfile, 2019-11-21) in write_patterns_and_update() to avoid the same complexity that I thought I needed when I wrote the initial fix for write_commit_graph_file(). We can free the "sparse_filename" right after calling hold_lock_file_for_update(), we don't need to wait until we're exiting the function to do so. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-04 13:24:19 -08:00
Ævar Arnfjörð Bjarmason	51a94d8ffe	commit-graph: stop fill_oids_from_packs() progress on error and free() Fix a bug in fill_oids_from_packs(), we should always stop_progress(), but did not do so if we returned an error here. This also plugs a memory leak in those cases by releasing the two "struct strbuf" variables the function uses. While I'm at it stop hardcoding "-1" here and just use the return value of error() instead, which happens to be "-1". Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-04 13:24:19 -08:00
Ævar Arnfjörð Bjarmason	4a0479086a	commit-graph: fix memory leak in misused string_list API When this code was migrated to the string_list API in `d88b14b3fd` (commit-graph: use string-list API for input, 2018-06-27) it was made to use used both STRING_LIST_INIT_NODUP and a strbuf_detach() pattern. Those should not be used together if string_list_clear() is expected to free the memory, instead we need to either use STRING_LIST_INIT_DUP with a string_list_append_nodup(), or a STRING_LIST_INIT_NODUP and manually fiddle with the "strdup_strings" member before calling string_list_clear(). Let's do the former. Since "strdup_strings = 1" is set now other code might be broken by relying on "pack_indexes" not to duplicate it strings, but that doesn't happen. When we pass this down to write_commit_graph() that code uses the "struct string_list" without modifying it. Let's add a "const" to the variable to have the compiler enforce that assumption. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-04 13:24:18 -08:00
Derrick Stolee	c8d67b9a68	commit-graph: fix generation number v2 overflow values The Generation Data Chunk was implemented and tested in `e8b63005c` (commit-graph: implement generation data chunk, 2021-01-16), but the test was carefully constructed to work on systems with 32-bit dates. Since the corrected commit date offsets still required more than 31 bits, this triggered writing the generation_data_overflow chunk. However, upon closer look, the write_graph_chunk_generation_data_overflow() method writes the offsets to the chunk (as dictated by the format) but fill_commit_graph_info() treats the value in the chunk as if it is the full corrected commit date (not an offset). For some reason, this does not cause an issue when using the FUTURE_DATE specified in t5318-commit-graph.sh, but it does show up as a failure in 'git commit-graph verify' if we increase that FUTURE_DATE to be above four billion. Fix this error and create a 64-bit timestamp version of the test so we can test these larger values. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-01 12:15:06 -08:00
Derrick Stolee	3b0199d4c3	commit-graph: start parsing generation v2 (again) The 'read_generation_data' member of 'struct commit_graph' was introduced by `1fdc383c5` (commit-graph: use generation v2 only if entire chain does, 2021-01-16). The intention was to avoid using corrected commit dates if not all layers of a commit-graph had that data stored. The logic in validate_mixed_generation_chain() at that point incorrectly initialized read_generation_data to 1 if and only if the tip commit-graph contained the Corrected Commit Date chunk. This was "fixed" in `448a39e65` (commit-graph: validate layers for generation data, 2021-02-02) to validate that read_generation_data was either non-zero for all layers, or it would set read_generation_data to zero for all layers. The problem here is that read_generation_data is not initialized to be non-zero anywhere! This change initializes read_generation_data immediately after the chunk is parsed, so each layer will have its value present as soon as possible. The read_generation_data member is used in fill_commit_graph_info() to determine if we should use the corrected commit date or the topological levels stored in the Commit Data chunk. Due to this bug, all previous versions of Git were defaulting to topological levels in all cases! This can be measured with some performance tests. Using the Linux kernel as a testbed, I generated a complete commit-graph containing corrected commit dates and tested the 'new' version against the previous, 'old' version. First, rev-list with --topo-order demonstrates a 26% improvement using corrected commit dates: hyperfine \ -n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \ -n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 57.1 ms ± 3.1 ms Range (min … max): 52.9 ms … 62.0 ms 55 runs Benchmark 2: new Time (mean ± σ): 45.5 ms ± 3.3 ms Range (min … max): 39.9 ms … 51.7 ms 59 runs Summary 'new' ran 1.26 ± 0.11 times faster than 'old' These performance improvements are due to the algorithmic improvements given by walking fewer commits due to the higher cutoffs from corrected commit dates. However, this comes at a cost. The additional I/O cost of parsing the corrected commit dates is visible in case of merge-base commands that do not reduce the overall number of walked commits. hyperfine \ -n "old" "$OLD_GIT merge-base v4.8 v4.9" \ -n "new" "$NEW_GIT merge-base v4.8 v4.9" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 110.4 ms ± 6.4 ms Range (min … max): 96.0 ms … 118.3 ms 25 runs Benchmark 2: new Time (mean ± σ): 150.7 ms ± 1.1 ms Range (min … max): 149.3 ms … 153.4 ms 19 runs Summary 'old' ran 1.36 ± 0.08 times faster than 'new' Performance issues like this are what motivated `702110aac` (commit-graph: use config to specify generation type, 2021-02-25). In the future, we could fix this performance problem by inserting the corrected commit date offsets into the Commit Date chunk instead of having that data in an extra chunk. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-01 12:15:06 -08:00
Derrick Stolee	75979d9460	commit-graph: fix ordering bug in generation numbers When computing the generation numbers for a commit-graph, we compute the corrected commit dates and then check if their offsets from the actual dates is too large to fit in the 32-bit Generation Data chunk. However, there is a problem with this approach: if we have parsed the generation data from the previous commit-graph, then we continue the loop because the corrected commit date is already computed. This causes an under-count in the number of overflow values. It is incorrect to add an increment to num_generation_data_overflows next to this 'continue' statement, because we might start double-counting commits that are computed because of the depth-first search walk from a commit with an earlier OID. Instead, iterate over the full commit list at the end, checking the offsets to see how many grow beyond the maximum value. Create a new t5328-commit-graph-64-bit-time.sh test script to handle special cases of testing 64-bit timestamps. This helps demonstrate this bug in more cases. It still won't hit all potential cases until the next change, which reenables reading generation numbers. Use the skip_all trick from `0a2bfccb9c` (t0051: use "skip_all" under !MINGW in single-test file, 2022-02-04) to make the output clean when run on a 32-bit system. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-03-01 12:14:57 -08:00
Lessley Dennington	0803f9c7cd	commit-graph: return if there is no git directory Return early if git directory does not exist. This will protect against test failures in the upcoming change to BUG in prepare_repo_settings if no git directory exists. Signed-off-by: Lessley Dennington <lessleydennington@gmail.com> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-12-06 09:55:05 -08:00
Junio C Hamano	c042ad5ad5	Merge branch 'js/run-command-close-packs' The run-command API has been updated so that the callers can easily ask the file descriptors open for packfiles to be closed immediately before spawning commands that may trigger auto-gc. * js/run-command-close-packs: Close object store closer to spawning child processes run_auto_maintenance(): implicitly close the object store run-command: offer to close the object store before running run-command: prettify the `RUN_COMMAND_*` flags pull: release packs before fetching commit-graph: when closing the graph, also release the slab	2021-09-20 15:20:45 -07:00
Junio C Hamano	df0c308c1a	Merge branch 'ab/progress-users-adjust-counters' The code to show progress indicator in a few code paths did not cover between 0-100%, which has been corrected. * ab/progress-users-adjust-counters: entry: show finer-grained counter in "Filtering content" progress line commit-graph: fix bogus counter in "Scanning merged commits" progress line	2021-09-20 15:20:41 -07:00
SZEDER Gábor	4011224944	commit-graph: fix bogus counter in "Scanning merged commits" progress line The final value of the counter of the "Scanning merged commits" progress line is always one less than its expected total, e.g.: Scanning merged commits: 83% (5/6), done. This happens because while iterating over an array the loop variable is passed to display_progress() as-is, but while C arrays (and thus the loop variable) start at 0 and end at N-1, the progress counter must end at N. Fix this by passing 'i + 1' to display_progress(), like most other callsites do. There's an RFC series to add a GIT_TEST_CHECK_PROGRESS=1 mode[1] which catches this issue in the 'fetch.writeCommitGraph' and 'fetch.writeCommitGraph with submodules' tests in 't5510-fetch.sh'. The GIT_TEST_CHECK_PROGRESS=1 mode is not part of this series, but future changes to progress.c may add it or similar assertions to catch this and similar bugs elsewhere. 1. https://lore.kernel.org/git/20210620200303.2328957-1-szeder.dev@gmail.com/ Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-09-09 09:58:19 -07:00
Johannes Schindelin	957ba814bf	commit-graph: when closing the graph, also release the slab The slab has information about the commit graph. That means that it is meaningless (and even misleading) when the commit graph was closed. This seems not to matter currently, but we're about to fix a Windows-specific bug where `git pull` does not close the object store before fetching (risking that an implicit auto-gc fails to remove the now-obsolete pack file(s)), and once we have that bug fix in place, it does matter: after that bug fix, we will open the object store, do some stuff with it, then close it, fetch, and then open it again, and do more stuff. If we close the commit graph without releasing the corresponding slab, we're hit by a symptom like this in t5520.19: BUG: commit-reach.c:85: bad generation skip 9223372036854775807 > 3 at 5cd378271655d43a3b4477520014f02213ad1546 Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-09-08 12:17:14 -07:00
Patrick Steinhardt	f559d6d45e	revision: avoid hitting packfiles when commits are in commit-graph When queueing references in git-rev-list(1), we try to optimize parsing of commits via the commit-graph. To do so, we first look up the object's type, and if it is a commit we call `repo_parse_commit()` instead of `parse_object()`. This is quite inefficient though given that we're always uncompressing the object header in order to determine the type. Instead, we can opportunistically search the commit-graph for the object ID: in case it's found, we know it's a commit and can directly fill in the commit object without having to uncompress the object header. Expose a new function `lookup_commit_in_graph()`, which tries to find a commit in the commit-graph by ID, and convert `get_reference()` to use this function. This provides a big performance win in cases where we load references in a repository with lots of references pointing to commits. The following has been executed in a real-world repository with about 2.2 million refs: Benchmark #1: HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 4.458 s ± 0.044 s [User: 4.115 s, System: 0.342 s] Range (min … max): 4.409 s … 4.534 s 10 runs Benchmark #2: HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev Time (mean ± σ): 3.089 s ± 0.015 s [User: 2.768 s, System: 0.321 s] Range (min … max): 3.061 s … 3.105 s 10 runs Summary 'HEAD: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' ran 1.44 ± 0.02 times faster than 'HEAD~: rev-list --unsorted-input --objects --quiet --not --all --not $newrev' Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-08-09 09:51:12 -07:00
Patrick Steinhardt	809ea28f80	commit-graph: split out function to search commit position The function `find_commit_in_graph()` assumes that the caller has passed an object which was already determined to be a commit given that it will access the commit's graph position, which is stored in a commit slab. In a subsequent patch, we want to search for an object ID though without knowing whether it is a commit or not, which is not currently possible. Split out the logic to search the commit graph for a given object ID to prepare for this change. This commit also renames the function to `find_commit_pos_in_graph()`, which more accurately reflects what this function does. Furthermore, in order to allow for the searched object ID to be const, we need to adjust `bsearch_graph()`'s signature to accept a constant object ID as input, too. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-08-09 09:51:12 -07:00
Junio C Hamano	dd6d3c90ee	Merge branch 'ab/attribute-format' Many "printf"-like helper functions we have have been annotated with __attribute__() to catch placeholder/parameter mismatches. * ab/attribute-format: advice.h: add missing __attribute__((format)) & fix usage .h: add a few missing __attribute__((format)) .c static functions: add missing __attribute__((format)) sequencer.c: move static function to avoid forward decl *.c static functions: don't forward-declare __attribute__	2021-07-28 13:17:59 -07:00
Ævar Arnfjörð Bjarmason	48ca53cac4	*.c static functions: add missing __attribute__((format)) Add missing __attribute__((format)) function attributes to various "static" functions that take printf arguments. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-07-13 15:20:20 -07:00
Taylor Blau	15316a4732	commit-graph: rewrite to use checksum_valid() Rewrite an existing caller in `git commit-graph verify` to take advantage of checksum_valid(). Note that the replacement isn't a verbatim cut-and-paste, since the new function avoids using hashfile at all and instead talks to the_hash_algo directly, but it is functionally equivalent. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-06-28 20:36:17 -07:00
brian m. carlson	72871b132c	commit-graph: don't store file hashes as struct object_id The idea behind struct object_id is that it is supposed to represent the identifier of a standard Git object or a special pseudo-object like the all-zeros object ID. In this case, we have file hashes, which, while similar, are distinct from the identifiers of objects. Switch these code paths to use an unsigned char array. This is both more logically consistent and it means that we need not set the algorithm identifier for the struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-04-27 16:31:39 +09:00
brian m. carlson	92e2cab96b	Always use oidread to read into struct object_id In the future, we'll want oidread to automatically set the hash algorithm member for an object ID we read into it, so ensure we use oidread instead of hashcpy everywhere we're copying a hash value into a struct object_id. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-04-27 16:31:38 +09:00
Junio C Hamano	d20fa3cf9d	Merge branch 'ds/commit-graph-generation-config' A new configuration variable has been introduced to allow choosing which version of the generation number gets used in the commit-graph file. * ds/commit-graph-generation-config: commit-graph: use config to specify generation type commit-graph: create local repository pointer	2021-03-22 14:00:23 -07:00
René Scharfe	ca56dadb4b	use CALLOC_ARRAY Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-13 16:00:09 -08:00
Junio C Hamano	660dd97a62	Merge branch 'ds/chunked-file-api' The common code to deal with "chunked file format" that is shared by the multi-pack-index and commit-graph files have been factored out, to help codepaths for both filetypes to become more robust. * ds/chunked-file-api: commit-graph.c: display correct number of chunks when writing chunk-format: add technical docs chunk-format: restore duplicate chunk checks midx: use 64-bit multiplication for chunk sizes midx: use chunk-format read API commit-graph: use chunk-format read API chunk-format: create read chunk API midx: use chunk-format API in write_midx_internal() midx: drop chunk progress during write midx: return success/failure in chunk write methods midx: add num_large_offsets to write_midx_context midx: add pack_perm to write_midx_context midx: add entries to write_midx_context midx: use context in write_midx_pack_names() midx: rename pack_info to write_midx_context commit-graph: use chunk-format write API chunk-format: create chunk format write API commit-graph: anonymize data in chunk_write_fn	2021-03-01 14:02:57 -08:00
Junio C Hamano	c0b27e3964	Merge branch 'js/commit-graph-warning' * js/commit-graph-warning: Revert "commit-graph: when incompatible with graphs, indicate why"	2021-03-01 09:21:24 -08:00
Junio C Hamano	cdc986a7c2	Revert "commit-graph: when incompatible with graphs, indicate why" This reverts commit `c85eec7fc3`, as it is a bit overzealous, we are in prerelease freeze, and we want to have enough time to get this right and cook in 'next'. cf. <8735xgkvuo.fsf@evledraar.gmail.com>	2021-03-01 09:19:37 -08:00
Derrick Stolee	702110aac6	commit-graph: use config to specify generation type We have two established generation number versions: 1: topological levels 2: corrected commit dates The corrected commit dates are enabled by default, but they also write extra data in the GDAT and GDOV chunks. Services that host Git data might want to have more control over when this feature rolls out than just updating the Git binaries. Add a new "commitGraph.generationVersion" config option that specifies the intended generation number version. If this value is less than 2, then the GDAT chunk is never written _or read_ from an existing file. This can replace our use of the GIT_TEST_COMMIT_GRAPH_NO_GDAT environment variable in the test suite. Remove it. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-25 15:10:41 -08:00
Derrick Stolee	c7ef8fe608	commit-graph: create local repository pointer The write_commit_graph() method uses 'the_repository' in a few places. A new need for a repository pointer is coming in the following change, so group these instances into a local variable 'r' that could eventually become part of the method signature, if so desired. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-25 15:10:40 -08:00
Taylor Blau	c4ff24bbb3	commit-graph.c: display correct number of chunks when writing When writing a commit-graph, a progress meter is shown which indicates the number of pieces of data to write (one per commit in each chunk). In `47410aa837` (commit-graph: use chunk-format write API, 2021-02-18), the number of chunks became tracked by the new chunk-format API. But a stray local variable was left behind from when write_commit_graph_file() used to keep track of the same. Since this was no longer updated after `47410aa837`, the progress meter appeared broken: $ git commit-graph write --reachable Expanding reachable commits in commit graph: 837569, done. Writing out commit graph in 3 passes: 166% (4187845/2512707), done. Drop the local variable and rely instead on the chunk-format API to tell us the correct number of chunks. Reported-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-24 11:44:34 -08:00
Andrzej Hunt	bf4bb9f9f5	commit-graph: avoid leaking topo_levels slab in write_commit_graph() write_commit_graph initialises topo_levels using init_topo_level_slab(), next it calls compute_topological_levels() which can cause the slab to grow, we therefore need to clear the slab again using clear_topo_level_slab() when we're done. First introduced in `72a2bfca` (commit-graph: add a slab to store topological levels, 2021-01-16). LeakSanitizer output: ==1026==ERROR: LeakSanitizer: detected memory leaks Direct leak of 8 byte(s) in 1 object(s) allocated from: #0 0x498ae9 in realloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0xafbed8 in xrealloc /src/git/wrapper.c:126:8 #2 0x7966d1 in topo_level_slab_at_peek /src/git/commit-graph.c:71:1 #3 0x7965e0 in topo_level_slab_at /src/git/commit-graph.c:71:1 #4 0x78fbf5 in compute_topological_levels /src/git/commit-graph.c:1472:12 #5 0x78c5c3 in write_commit_graph /src/git/commit-graph.c:2456:2 #6 0x535c5f in graph_write /src/git/builtin/commit-graph.c:299:6 #7 0x5350ca in cmd_commit_graph /src/git/builtin/commit-graph.c:337:11 #8 0x4cddb1 in run_builtin /src/git/git.c:453:11 #9 0x4cabe2 in handle_builtin /src/git/git.c:704:3 #10 0x4cd084 in run_argv /src/git/git.c:771:4 #11 0x4ca424 in cmd_main /src/git/git.c:902:19 #12 0x707fb6 in main /src/git/common-main.c:52:11 #13 0x7fee4249383f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f) Indirect leak of 524256 byte(s) in 1 object(s) allocated from: #0 0x498942 in calloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3 #1 0xafc088 in xcalloc /src/git/wrapper.c:140:8 #2 0x796870 in topo_level_slab_at_peek /src/git/commit-graph.c:71:1 #3 0x7965e0 in topo_level_slab_at /src/git/commit-graph.c:71:1 #4 0x78fbf5 in compute_topological_levels /src/git/commit-graph.c:1472:12 #5 0x78c5c3 in write_commit_graph /src/git/commit-graph.c:2456:2 #6 0x535c5f in graph_write /src/git/builtin/commit-graph.c:299:6 #7 0x5350ca in cmd_commit_graph /src/git/builtin/commit-graph.c:337:11 #8 0x4cddb1 in run_builtin /src/git/git.c:453:11 #9 0x4cabe2 in handle_builtin /src/git/git.c:704:3 #10 0x4cd084 in run_argv /src/git/git.c:771:4 #11 0x4ca424 in cmd_main /src/git/git.c:902:19 #12 0x707fb6 in main /src/git/common-main.c:52:11 #13 0x7fee4249383f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f) SUMMARY: AddressSanitizer: 524264 byte(s) leaked in 2 allocation(s). Signed-off-by: Andrzej Hunt <ajrhunt@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-22 13:45:01 -08:00
Derrick Stolee	2692c2f6fd	commit-graph: use chunk-format read API Instead of parsing the table of contents directly, use the chunk-format API methods read_table_of_contents() and pair_chunk(). While the current implementation loses the duplicate-chunk detection, that will be added in a future change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-18 13:38:16 -08:00
Derrick Stolee	47410aa837	commit-graph: use chunk-format write API The commit-graph write logic is ready to make use of the chunk-format write API. Each chunk write method is already in the correct prototype. We only need to use the 'struct chunkfile' pointer and the correct API calls. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-18 13:38:16 -08:00
Junio C Hamano	726b11d68a	Merge branch 'js/commit-graph-warning' When certain features (e.g. grafts) used in the repository are incompatible with the use of the commit-graph, we used to silently turned commit-graph off; we now tell the user what we are doing. * js/commit-graph-warning: commit-graph: when incompatible with graphs, indicate why	2021-02-17 17:21:42 -08:00
Junio C Hamano	5bd0b21bf7	Merge branch 'ds/commit-graph-genno-fix' Fix incremental update of commit-graph file around corrected commit date data. * ds/commit-graph-genno-fix: commit-graph: prepare commit graph commit-graph: be extra careful about mixed generations commit-graph: compute generations separately commit-graph: validate layers for generation data commit-graph: always parse before commit_graph_data_at() commit-graph: use repo_parse_commit	2021-02-17 17:21:40 -08:00
Junio C Hamano	8b4701ae4f	Merge branch 'ak/corrected-commit-date' The commit-graph learned to use corrected commit dates instead of the generation number to help topological revision traversal. * ak/corrected-commit-date: doc: add corrected commit date info commit-reach: use corrected commit dates in paint_down_to_common() commit-graph: use generation v2 only if entire chain does commit-graph: implement generation data chunk commit-graph: implement corrected commit date commit-graph: return 64-bit generation number commit-graph: add a slab to store topological levels t6600-test-reach: generalize *_three_modes commit-graph: consolidate fill_commit_graph_info revision: parse parent in indegree_walk_step() commit-graph: fix regression when computing Bloom filters	2021-02-17 17:21:40 -08:00
Johannes Schindelin	c85eec7fc3	commit-graph: when incompatible with graphs, indicate why When `gc.writeCommitGraph = true`, it is possible that the commit-graph is _still_ not written: replace objects, grafts and shallow repositories are incompatible with the commit-graph feature. Under such circumstances, we need to indicate to the user why the commit-graph was not written instead of staying silent about it. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Acked-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-11 09:33:01 -08:00
Junio C Hamano	9d5b1c06ac	Merge branch 'jk/use-oid-pos' Code clean-up to ensure our use of hashtables using object names as keys use the "struct object_id" objects, not the raw hash values. * jk/use-oid-pos: oid_pos(): access table through const pointers hash_pos(): convert to oid_pos() rerere: use strmap to store rerere directories rerere: tighten rr-cache dirname check rerere: check dirname format while iterating rr_cache directory commit_graft_pos(): take an oid instead of a bare hash	2021-02-10 14:48:31 -08:00
Derrick Stolee	eb9071912f	commit-graph: anonymize data in chunk_write_fn In preparation for creating an API around file formats using chunks and tables of contents, prepare the commit-graph write code to use prototypes that will match this new API. Specifically, convert chunk_write_fn to take a "void *data" parameter instead of the commit-graph-specific "struct write_commit_graph_context" pointer. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-05 15:40:41 -08:00
Junio C Hamano	973e20b83f	Merge branch 'jk/peel-iterated-oid' The peel_ref() API has been replaced with peel_iterated_oid(). * jk/peel-iterated-oid: refs: switch peel_ref() to peel_iterated_oid()	2021-02-03 15:04:49 -08:00
Derrick Stolee	bc50d6c91f	commit-graph: prepare commit graph Before checking if the repository has a commit-graph loaded, be sure to run prepare_commit_graph(). This is necessary because otherwise the topo_levels slab is not initialized. As we compute topo_levels for the new commits, we iterate further into the lower layers since the first visit to each commit looks as though the topo_level is not populated. By properly initializing the topo_slab, we fix the previously broken case of a split commit graph where a base layer has the generation_data_overflow chunk. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-01 21:03:36 -08:00
Derrick Stolee	fde55b0906	commit-graph: be extra careful about mixed generations When upgrading to a commit-graph with corrected commit dates from one without, there are a few things that need to be considered. When computing generation numbers for the new commit-graph file that expects to add the generation_data chunk with corrected commit dates, we need to ensure that the 'generation' member of the commit_graph_data struct is set to zero for these commits. Unfortunately, the fallback to use topological level for generation number when corrected commit dates are not available are causing us harm here: parsing commits notices that read_generation_data is false and populates 'generation' with the topological level. The solution is to iterate through the commits, parse the commits to populate initial values, then reset the generation values to zero to trigger recalculation. This loop only occurs when the existing commit-graph data has no corrected commit dates. While this improves our situation somewhat, we have not completely solved the issue for correctly computing generation numbers for mixed layers. That follows in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-01 21:03:36 -08:00
Derrick Stolee	9c2c0a8256	commit-graph: compute generations separately The compute_generation_numbers() method was introduced by `3258c663` (commit-graph: compute generation numbers, 2018-05-01) to compute what is now known as "topological levels". These are still stored in the commit-graph file for compatibility sake while `c1a09119` (commit-graph: implement corrected commit date, 2021-01-16) updated the method to also compute the new version of generation numbers: corrected commit date. It makes sense why these are grouped. They perform very similar walks of the necessary commits and compute similar maximums over each parent. However, having these two together conflates them in subtle ways that is hard to separate. In particular, the topo_level slab is used to store the topological levels in all cases, but the commit_graph_data_at(c)->generation member stores different values depending on the state of the existing commit-graph file. * If the existing commit-graph file has a "GDAT" chunk, then these values represent corrected commit dates. * If the existing commit-graph file doesn't have a "GDAT" chunk, then these values are actually the topological levels. This issue only occurs only when upgrading an existing commit-graph file into one that has the "GDAT" chunk. The current change does not resolve this upgrade problem, but splitting the implementation into two pieces here helps with that process, which will follow in the next change. The important thing this helps with is the case where the num_generation_data_overflows was being incremented incorrectly, triggering a write of the overflow chunk. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-01 21:03:36 -08:00
Derrick Stolee	448a39e65d	commit-graph: validate layers for generation data We need to be extra careful that we don't use corrected commit dates from any layer of a commit-graph chain if there is a single commit-graph file that is missing the generation_data chunk. Update validate_mixed_generation_chain() to correctly update each layer to ignore the generation_data chunk in this case. It now also returns 1 if all layers have a generation_data chunk. This return value will be used in the next change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-01 21:03:36 -08:00
Derrick Stolee	90cb1c47c7	commit-graph: always parse before commit_graph_data_at() There is a subtle failure happening when computing corrected commit dates with --split enabled. It requires a base layer needing the generation_data_overflow chunk. Then, the next layer on top erroneously thinks it needs an overflow chunk due to a bug leading to recalculating all reachable generation numbers. The output of the failure is BUG: commit-graph.c:1912: expected to write 8 bytes to chunk 47444f56, but wrote 0 instead These "expected" 8 bytes are due to re-computing the corrected commit date for the lower layer but the new layer does not need any overflow. Add a test to t5318-commit-graph.sh that demonstrates this bug. However, it does not trigger consistently with the existing code. The generation number data is stored in a slab and accessed by commit_graph_data_at(). This data is initialized when parsing a commit, but is otherwise used assuming it has been populated. The loop in compute_generation_numbers() did not enforce that all reachable commits were parsed and had correct values. This could lead to some problems when writing a commit-graph with corrected commit dates based on a commit-graph without them. It has been difficult to identify the issue here because it was so hard to reproduce. It relies on this uninitialized data having a non-zero value, but also on specifically in a way that overwrites the existing data. This patch adds the extra parse to ensure the data is filled before we compute the generation number of a commit. This triggers the new test to fail because the generation number overflow count does not match between this computation and the write for that chunk. The actual fix will follow as the next few changes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-01 21:03:36 -08:00
Derrick Stolee	c4cc083169	commit-graph: use repo_parse_commit The write_commit_graph_context has a repository pointer, so use it. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-02-01 21:03:35 -08:00
Jeff King	8380dcd700	oid_pos(): access table through const pointers When we are looking up an oid in an array, we obviously don't need to write to the array. Let's mark it as const in the function interfaces, as well as in the local variables we use to derference the void pointer (note a few cases use pointers-to-pointers, so we mark everything const). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-28 12:03:26 -08:00
Jeff King	45ee13b942	hash_pos(): convert to oid_pos() All of our callers are actually looking up an object_id, not a bare hash. Likewise, the arrays they are looking in are actual arrays of object_id (not just raw bytes of hashes, as we might find in a pack .idx; those are handled by bsearch_hash()). Using an object_id gives us more type safety, and makes the callers slightly shorter. It also gets rid of the word "sha1" from several access functions, though we could obviously also rename those with s/sha1/hash/. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-28 12:02:39 -08:00
Junio C Hamano	58e2ce9112	Merge branch 'ma/more-opaque-lock-file' Code clean-up. * ma/more-opaque-lock-file: read-cache: try not to peek into `struct {lock_,temp}file` refs/files-backend: don't peek into `struct lock_file` midx: don't peek into `struct lock_file` commit-graph: don't peek into `struct lock_file` builtin/gc: don't peek into `struct lock_file`	2021-01-25 14:19:17 -08:00
Jeff King	36a317929b	refs: switch peel_ref() to peel_iterated_oid() The peel_ref() interface is confusing and error-prone: - it's typically used by ref iteration callbacks that have both a refname and oid. But since they pass only the refname, we may load the ref value from the filesystem again. This is inefficient, but also means we are open to a race if somebody simultaneously updates the ref. E.g., this: int some_ref_cb(const char refname, const struct object_id oid, ...) { if (!peel_ref(refname, &peeled)) printf("%s peels to %s", oid_to_hex(oid), oid_to_hex(&peeled); } could print nonsense. It is correct to say "refname peels to..." (you may see the "before" value or the "after" value, either of which is consistent), but mentioning both oids may be mixing before/after values. Worse, whether this is possible depends on whether the optimization to read from the current iterator value kicks in. So it is actually not possible with: for_each_ref(some_ref_cb); but it _is_ possible with: head_ref(some_ref_cb); which does not use the iterator mechanism (though in practice, HEAD should never peel to anything, so this may not be triggerable). - it must take a fully-qualified refname for the read_ref_full() code path to work. Yet we routinely pass it partial refnames from callbacks to for_each_tag_ref(), etc. This happens to work when iterating because there we do not call read_ref_full() at all, and only use the passed refname to check if it is the same as the iterator. But the requirements for the function parameters are quite unclear. Instead of taking a refname, let's instead take an oid. That fixes both problems. It's a little funny for a "ref" function not to involve refs at all. The key thing is that it's optimizing under the hood based on having access to the ref iterator. So let's change the name to make it clear why you'd want this function versus just peel_object(). There are two other directions I considered but rejected: - we could pass the peel information into the each_ref_fn callback. However, we don't know if the caller actually wants it or not. For packed-refs, providing it is essentially free. But for loose refs, we actually have to peel the object, which would be wasteful in most cases. We could likewise pass in a flag to the callback indicating whether the peeled information is known, but that complicates those callbacks, as they then have to decide whether to manually peel themselves. Plus it requires changing the interface of every callback, whether they care about peeling or not, and there are many of them. - we could make a function to return the peeled value of the current iterated ref (computing it if necessary), and BUG() otherwise. I.e.: int peel_current_iterated_ref(struct object_id *out); Each of the current callers is an each_ref_fn callback, so they'd mostly be happy. But: - we use those callbacks with functions like head_ref(), which do not use the iteration code. So we'd need to handle the fallback case there, anyway. - it's possible that a caller would want to call into generic code that sometimes is used during iteration and sometimes not. This encapsulates the logic to do the fast thing when possible, and fallback when necessary. The implementation is mostly obvious, but I want to call out a few things in the patch: - the test-tool coverage for peel_ref() is now meaningless, as it all collapses to a single peel_object() call (arguably they were pretty uninteresting before; the tricky part of that function is the fast-path we see during iteration, but these calls didn't trigger that). I've just dropped it entirely, though note that some other tests relied on the tags we created; I've moved that creation to the tests where it matters. - we no longer need to take a ref_store parameter, since we'd never look up a ref now. We do still rely on a global "current iterator" variable which _could_ be kept per-ref-store. But in practice this is only useful if there are multiple recursive iterations, at which point the more appropriate solution is probably a stack of iterators. No caller used the actual ref-store parameter anyway (they all call the wrapper that passes the_repository). - the original only kicked in the optimization when the "refname" pointer matched (i.e., not string comparison). We do likewise with the "oid" parameter here, but fall back to doing an actual oideq() call. This in theory lets us kick in the optimization more often, though in practice no current caller cares. It should never be wrong, though (peeling is a property of an object, so two refs pointing to the same object would peel identically). - the original took care not to touch the peeled out-parameter unless we found something to put in it. But no caller cares about this, and anyway, it is enforced by peel_object() itself (and even in the optimized iterator case, that's where we eventually end up). We can shorten the code and avoid an extra copy by just passing the out-parameter through the stack. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-21 15:51:31 -08:00
Abhishek Kumar	8d00d7c3df	commit-reach: use corrected commit dates in paint_down_to_common() `091f4cf` (commit: don't use generation numbers if not needed, 2018-08-30) changed paint_down_to_common() to use commit dates instead of generation numbers v1 (topological levels) as the performance regressed on certain topologies. With generation number v2 (corrected commit dates) implemented, we no longer have to rely on commit dates and can use generation numbers. For example, the command `git merge-base v4.8 v4.9` on the Linux repository walks 167468 commits, taking 0.135s for committer date and 167496 commits, taking 0.157s for corrected committer date respectively. While using corrected commit dates, Git walks nearly the same number of commits as commit date, the process is slower as for each comparision we have to access a commit-slab (for corrected committer date) instead of accessing struct member (for committer date). This change incidentally broke the fragile t6404-recursive-merge test. t6404-recursive-merge sets up a unique repository where all commits have the same committer date without a well-defined merge-base. While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer date as a heuristic in paint_down_to_common(). 6404.1 'combined merge conflicts' merges commits in the order: - Merge C with B to form an intermediate commit. - Merge the intermediate commit with A. With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently use the corrected committer date, which changes the order in which commits are merged: - Merge A with B to form an intermediate commit. - Merge the intermediate commit with C. While resulting repositories are equivalent, 6404.4 'virtual trees were processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting different merge-bases and thus have different object ids for the intermediate commits. As this has already causes problems (as noted in `859fdc0` (commit-graph: define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph within t6404-recursive-merge. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-18 16:21:18 -08:00
Abhishek Kumar	1fdc383c5c	commit-graph: use generation v2 only if entire chain does Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-18 16:21:18 -08:00
Abhishek Kumar	e8b63005c4	commit-graph: implement generation data chunk As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-18 16:21:18 -08:00
Abhishek Kumar	c1a09119f6	commit-graph: implement corrected commit date With most of preparations done, let's implement corrected commit date. The corrected commit date for a commit is defined as: * A commit with no parents (a root commit) has corrected commit date equal to its committer date. * A commit with at least one parent has corrected commit date equal to the maximum of its commit date and one more than the largest corrected commit date among its parents. As a special case, a root commit with timestamp of zero (01.01.1970 00:00:00Z) has corrected commit date of one, to be able to distinguish from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date). To minimize the space required to store corrected commit date, Git stores corrected commit date offsets into the commit-graph file. The corrected commit date offset for a commit is defined as the difference between its corrected commit date and actual commit date. Storing corrected commit date requires sizeof(timestamp_t) bytes, which in most cases is 64 bits (uintmax_t). However, corrected commit date offsets can be safely stored using only 32-bits. This halves the size of GDAT chunk, which is a reduction of around 6% in the size of commit-graph file. However, using offsets be problematic if a commit is malformed but valid and has committer date of 0 Unix time, as the offset would be the same as corrected commit date and thus require 64-bits to be stored properly. While Git does not write out offsets at this stage, Git stores the corrected commit dates in member generation of struct commit_graph_data. It will begin writing commit date offsets with the introduction of generation data chunk. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-18 16:21:18 -08:00
Abhishek Kumar	d7f92784c6	commit-graph: return 64-bit generation number In a preparatory step for introducing corrected commit dates, let's return timestamp_t values from commit_graph_generation(), use timestamp_t for local variables and define GENERATION_NUMBER_INFINITY as (2 ^ 63 - 1) instead. We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to represent the largest topological level we can store in the commit data chunk. With corrected commit dates implemented, we will have two such *_MAX variables to denote the largest offset and largest topological level that can be stored. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-18 16:21:18 -08:00
Abhishek Kumar	72a2bfcaf0	commit-graph: add a slab to store topological levels In a later commit we will introduce corrected commit date as the generation number v2. Corrected commit dates will be stored in the new seperate Generation Data chunk. However, to ensure backwards compatibility with "Old" Git we need to continue to write generation number v1 (topological levels) to the commit data chunk. Thus, we need to compute and store both versions of generation numbers to write the commit-graph file. Therefore, let's introduce a commit-slab `topo_level_slab` to store topological levels; corrected commit date will be stored in the member `generation` of struct commit_graph_data. The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO` mark commits not in the commit-graph file and commits written by a version of Git that did not compute generation numbers respectively. Generation numbers are computed identically for both kinds of commits. A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit is not in the commit-graph file. However, since the slab is zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`). Thus, we no longer need to check if the topological level of a commit is `GENERATION_NUMBER_INFINITY`. We will add a pointer to the slab in `struct write_commit_graph_context` and `struct commit_graph` to populate the slab in `fill_commit_graph_info` if the commit has a pre-computed topological level as in case of split commit-graphs. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-18 16:21:18 -08:00
Abhishek Kumar	f90fca638e	commit-graph: consolidate fill_commit_graph_info Both fill_commit_graph_info() and fill_commit_in_graph() parse information present in commit data chunk. Let's simplify the implementation by calling fill_commit_graph_info() within fill_commit_in_graph(). fill_commit_graph_info() used to not load committer data from commit data chunk. However, with the upcoming switch to using corrected committer date as generation number v2, we will have to load committer date to compute generation number value anyway. `e51217e15` (t5000: test tar files that overflow ustar headers, 30-06-2016) introduced a test 'generate tar with future mtime' that creates a commit with committer date of (2^36 + 1) seconds since EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus committer time overflows into generation number (within CDAT chunk) and has undefined behavior. The test used to pass as fill_commit_graph_info() would not set struct member `date` of struct commit and load committer date from the object database, generating a tar file with the expected mtime. However, with corrected commit date, we will load the committer date from CDAT chunk (truncated to lower 34-bits to populate the generation number. Thus, Git sets date and generates tar file with the truncated mtime. The ustar format (the header format used by most modern tar programs) only has room for 11 (or 12, depending on some implementations) octal digits for the size and mtime of each file. As the CDAT chunk is overflow by 12-octal digits but not 11-octal digits, we split the existing tests to test both implementations separately and add a new explicit test for 11-digit implementation. To test the 11-octal digit implementation, we create a future commit with committer date of 2^34 - 1, which overflows 11-octal digits without overflowing 34-bits of the Commit Date chunks. To test the 12-octal digit implementation, the smallest committer date possible is 2^36 + 1, which overflows the CDAT chunk and thus commit-graph must be disabled for the test. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-18 16:21:18 -08:00
Abhishek Kumar	e30c5ee76c	commit-graph: fix regression when computing Bloom filters Before computing Bloom filters, the commit-graph machinery uses commit_gen_cmp to sort commits by generation order for improved diff performance. `3d11275505` (commit-graph: examine commits by generation number, 2020-03-30) claims that this sort can reduce the time spent to compute Bloom filters by nearly half. But since `c49c82aa4c` (commit: move members graph_pos, generation to a slab, 2020-06-17), this optimization is broken, since asking for a 'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY while writing. Not all hope is lost, though: 'commit_gen_cmp()' falls back to comparing commits by their date when they have equal generation number, and so since `c49c82aa4c` is purely a date comparison function. This heuristic is good enough that we don't seem to loose appreciable performance while computing Bloom filters. Applying this patch (compared with v2.30.0) speeds up computing Bloom filters by factors ranging from 0.40% to 5.19% on various repositories [1]. So, avoid the useless 'commit_graph_generation()' while writing by instead accessing the slab directly. This returns the newly-computed generation numbers, and allows us to avoid the heuristic by directly comparing generation numbers. [1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-18 16:21:17 -08:00
Martin Ågren	a52cdce936	commit-graph: don't peek into `struct lock_file` Similar to the previous commit, avoid peeking into the `struct lock_file`. Use the lock file API instead. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-06 13:53:32 -08:00
Martin Ågren	bc62692757	hash-lookup: rename from sha1-lookup Change all remnants of "sha1" in hash-lookup.c and .h and rename them to reflect that we're not just able to handle SHA-1 these days. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-04 13:01:55 -08:00
Martin Ågren	7a7d992d0d	sha1-lookup: rename `sha1_pos()` as `hash_pos()` Rename this function to reflect that we're not just able to handle SHA-1 these days. There are a few instances of "sha1" left in sha1-lookup.[ch] after this, but those will be addressed in the next commit. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-01-04 13:01:55 -08:00
Jeff King	3361390cbe	commit-graph: use size_t for array allocation and indexing Our packed_commit_list is an array of pointers to commit structs. We use "int" for the allocation, which is 32-bit even on 64-bit platforms. This isn't likely to overflow in practice (we're writing commit graphs, so you'd need to actually have billions of unique commits in the repository). But it's good practice to use size_t for allocations. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-12-07 12:32:04 -08:00
Jeff King	a5f1c44899	commit-graph: replace packed_oid_list with oid_array Our custom packed_oid_list data structure is really just an oid_array in disguise. Let's switch to using the generic structure, which shortens and simplifies the code slightly. There's one slightly awkward part: in the old code we copied a hash straight from the mmap'd on-disk data into the final object_id. And now we'll copy to a temporary oid, which we'll then pass to oid_array_append(). But this is an operation we have to do all over the commit-graph code already, since it mostly uses object_id structs internally. I also measured "git commit-graph --append", which triggers this code path, and it showed no difference. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-12-07 12:32:04 -08:00
Jeff King	1cbdbf3bef	commit-graph: drop count_distinct_commits() function When writing a commit graph, we collect a list of object ids in an array, which we'll eventually copy into an array of "struct commit" pointers. Before we do that, though, we count the number of distinct commit entries. There's a subtle bug in this step, though. We eliminate not only duplicate oids, but also in split mode, any oids which are not commits or which are already in a graph file. However, the loop starts at index 1, always counting index 0 as distinct. And indeed it can't be a duplicate, since we check for those by comparing against the previous entry, and there isn't one for index 0. But it could be a commit that's already in a graph file, and we'd overcount the number of commits by 1 in that case. That turns out not to be a problem, though. The only things we do with the count are: - check if our count will overflow our data structures. But the limit there is 2^31 commits, so while this is a useful check, the off-by-one is not likely to matter. - pre-allocate the array of commit pointers. But over-allocating by one isn't a problem; we'll just waste a few extra bytes. The bug would be easy enough to fix, but we can observe that neither of those steps is necessary. After building the actual commit array, we'll likewise check its count for overflow. So the extra check of the distinct commit count here is redundant. And likewise we use ALLOC_GROW() when building the commit array, so there's no need to preallocate it (it's possible that doing so is slightly more efficient, but if we care we can just optimistically allocate one slot for each oid; I didn't bother here). So count_distinct_commits() isn't doing anything useful. Let's just get rid of that step. Note that a side effect of the function was that we sorted the list of oids, which we do rely on in copy_oids_to_commits(), since it must also skip the duplicates. So we'll move the qsort there. I didn't copy the "TODO" about adding more progress meters. It's actually quite hard to make a repository large enough for this qsort would take an appreciable amount of time, so this doesn't seem like a useful note. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-12-07 12:32:04 -08:00
Junio C Hamano	307a53dd99	Merge branch 'ds/commit-graph-merging-fix' When "git commit-graph" detects the same commit recorded more than once while it is merging the layers, it used to die. The code now ignores all but one of them and continues. * ds/commit-graph-merging-fix: commit-graph: don't write commit-graph when disabled commit-graph: ignore duplicates when merging layers	2020-11-02 13:17:39 -08:00
Derrick Stolee	85102ac71b	commit-graph: don't write commit-graph when disabled The core.commitGraph config setting can be set to 'false' to prevent parsing commits from the commit-graph file(s). This causes an issue when trying to write with "--split" which needs to distinguish between commits that are in the existing commit-graph layers and commits that are not. The existing mechanism uses parse_commit() and follows by checking if there is a 'graph_pos' that shows the commit was parsed from the commit-graph file. When core.commitGraph=false, we do not parse the commits from the commit-graph and 'graph_pos' indicates that no commits are in the existing file. The --split logic moves forward creating a new layer on top that holds all reachable commits, then possibly merges down into those layers, resulting in duplicate commits. The previous change makes that merging process more robust to such a situation in case it happens in the written commit-graph data. The easy answer here is to avoid writing a commit-graph if reading the commit-graph is disabled. Since the resulting commit-graph will would not be read by subsequent Git processes. This is more natural than forcing core.commitGraph to be true for the 'write' process. Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de> Helped-by: Jeff King <peff@peff.net> Helped-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-10-09 14:16:32 -07:00

1 2 3 4 5 ...

386 commits