Commit graph

423 commits

Author SHA1 Message Date
Taylor Blau c4ff24bbb3 commit-graph.c: display correct number of chunks when writing
When writing a commit-graph, a progress meter is shown which indicates
the number of pieces of data to write (one per commit in each chunk).

In 47410aa837 (commit-graph: use chunk-format write API, 2021-02-18),
the number of chunks became tracked by the new chunk-format API. But a
stray local variable was left behind from when write_commit_graph_file()
used to keep track of the same.

Since this was no longer updated after 47410aa837, the progress meter
appeared broken:

    $ git commit-graph write --reachable
    Expanding reachable commits in commit graph: 837569, done.
    Writing out commit graph in 3 passes: 166% (4187845/2512707), done.

Drop the local variable and rely instead on the chunk-format API to tell
us the correct number of chunks.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-24 11:44:34 -08:00
Andrzej Hunt bf4bb9f9f5 commit-graph: avoid leaking topo_levels slab in write_commit_graph()
write_commit_graph initialises topo_levels using init_topo_level_slab(),
next it calls compute_topological_levels() which can cause the slab to
grow, we therefore need to clear the slab again using
clear_topo_level_slab() when we're done.

First introduced in 72a2bfca (commit-graph: add a slab to store
topological levels, 2021-01-16).

LeakSanitizer output:

==1026==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 8 byte(s) in 1 object(s) allocated from:
    #0 0x498ae9 in realloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3
    #1 0xafbed8 in xrealloc /src/git/wrapper.c:126:8
    #2 0x7966d1 in topo_level_slab_at_peek /src/git/commit-graph.c:71:1
    #3 0x7965e0 in topo_level_slab_at /src/git/commit-graph.c:71:1
    #4 0x78fbf5 in compute_topological_levels /src/git/commit-graph.c:1472:12
    #5 0x78c5c3 in write_commit_graph /src/git/commit-graph.c:2456:2
    #6 0x535c5f in graph_write /src/git/builtin/commit-graph.c:299:6
    #7 0x5350ca in cmd_commit_graph /src/git/builtin/commit-graph.c:337:11
    #8 0x4cddb1 in run_builtin /src/git/git.c:453:11
    #9 0x4cabe2 in handle_builtin /src/git/git.c:704:3
    #10 0x4cd084 in run_argv /src/git/git.c:771:4
    #11 0x4ca424 in cmd_main /src/git/git.c:902:19
    #12 0x707fb6 in main /src/git/common-main.c:52:11
    #13 0x7fee4249383f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)

Indirect leak of 524256 byte(s) in 1 object(s) allocated from:
    #0 0x498942 in calloc /src/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3
    #1 0xafc088 in xcalloc /src/git/wrapper.c:140:8
    #2 0x796870 in topo_level_slab_at_peek /src/git/commit-graph.c:71:1
    #3 0x7965e0 in topo_level_slab_at /src/git/commit-graph.c:71:1
    #4 0x78fbf5 in compute_topological_levels /src/git/commit-graph.c:1472:12
    #5 0x78c5c3 in write_commit_graph /src/git/commit-graph.c:2456:2
    #6 0x535c5f in graph_write /src/git/builtin/commit-graph.c:299:6
    #7 0x5350ca in cmd_commit_graph /src/git/builtin/commit-graph.c:337:11
    #8 0x4cddb1 in run_builtin /src/git/git.c:453:11
    #9 0x4cabe2 in handle_builtin /src/git/git.c:704:3
    #10 0x4cd084 in run_argv /src/git/git.c:771:4
    #11 0x4ca424 in cmd_main /src/git/git.c:902:19
    #12 0x707fb6 in main /src/git/common-main.c:52:11
    #13 0x7fee4249383f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)

SUMMARY: AddressSanitizer: 524264 byte(s) leaked in 2 allocation(s).

Signed-off-by: Andrzej Hunt <ajrhunt@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-22 13:45:01 -08:00
Derrick Stolee 2692c2f6fd commit-graph: use chunk-format read API
Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). While the current
implementation loses the duplicate-chunk detection, that will be added
in a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee 47410aa837 commit-graph: use chunk-format write API
The commit-graph write logic is ready to make use of the chunk-format
write API. Each chunk write method is already in the correct prototype.
We only need to use the 'struct chunkfile' pointer and the correct API
calls.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Junio C Hamano 726b11d68a Merge branch 'js/commit-graph-warning'
When certain features (e.g. grafts) used in the repository are
incompatible with the use of the commit-graph, we used to silently
turned commit-graph off; we now tell the user what we are doing.

* js/commit-graph-warning:
  commit-graph: when incompatible with graphs, indicate why
2021-02-17 17:21:42 -08:00
Junio C Hamano 5bd0b21bf7 Merge branch 'ds/commit-graph-genno-fix'
Fix incremental update of commit-graph file around corrected commit
date data.

* ds/commit-graph-genno-fix:
  commit-graph: prepare commit graph
  commit-graph: be extra careful about mixed generations
  commit-graph: compute generations separately
  commit-graph: validate layers for generation data
  commit-graph: always parse before commit_graph_data_at()
  commit-graph: use repo_parse_commit
2021-02-17 17:21:40 -08:00
Junio C Hamano 8b4701ae4f Merge branch 'ak/corrected-commit-date'
The commit-graph learned to use corrected commit dates instead of
the generation number to help topological revision traversal.

* ak/corrected-commit-date:
  doc: add corrected commit date info
  commit-reach: use corrected commit dates in paint_down_to_common()
  commit-graph: use generation v2 only if entire chain does
  commit-graph: implement generation data chunk
  commit-graph: implement corrected commit date
  commit-graph: return 64-bit generation number
  commit-graph: add a slab to store topological levels
  t6600-test-reach: generalize *_three_modes
  commit-graph: consolidate fill_commit_graph_info
  revision: parse parent in indegree_walk_step()
  commit-graph: fix regression when computing Bloom filters
2021-02-17 17:21:40 -08:00
Johannes Schindelin c85eec7fc3 commit-graph: when incompatible with graphs, indicate why
When `gc.writeCommitGraph = true`, it is possible that the commit-graph
is _still_ not written: replace objects, grafts and shallow repositories
are incompatible with the commit-graph feature.

Under such circumstances, we need to indicate to the user why the
commit-graph was not written instead of staying silent about it.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-11 09:33:01 -08:00
Junio C Hamano 9d5b1c06ac Merge branch 'jk/use-oid-pos'
Code clean-up to ensure our use of hashtables using object names as
keys use the "struct object_id" objects, not the raw hash values.

* jk/use-oid-pos:
  oid_pos(): access table through const pointers
  hash_pos(): convert to oid_pos()
  rerere: use strmap to store rerere directories
  rerere: tighten rr-cache dirname check
  rerere: check dirname format while iterating rr_cache directory
  commit_graft_pos(): take an oid instead of a bare hash
2021-02-10 14:48:31 -08:00
Derrick Stolee eb9071912f commit-graph: anonymize data in chunk_write_fn
In preparation for creating an API around file formats using chunks and
tables of contents, prepare the commit-graph write code to use
prototypes that will match this new API.

Specifically, convert chunk_write_fn to take a "void *data" parameter
instead of the commit-graph-specific "struct write_commit_graph_context"
pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-05 15:40:41 -08:00
Junio C Hamano 973e20b83f Merge branch 'jk/peel-iterated-oid'
The peel_ref() API has been replaced with peel_iterated_oid().

* jk/peel-iterated-oid:
  refs: switch peel_ref() to peel_iterated_oid()
2021-02-03 15:04:49 -08:00
Derrick Stolee bc50d6c91f commit-graph: prepare commit graph
Before checking if the repository has a commit-graph loaded, be sure
to run prepare_commit_graph(). This is necessary because otherwise
the topo_levels slab is not initialized. As we compute topo_levels for
the new commits, we iterate further into the lower layers since the
first visit to each commit looks as though the topo_level is not
populated.

By properly initializing the topo_slab, we fix the previously broken
case of a split commit graph where a base layer has the
generation_data_overflow chunk.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee fde55b0906 commit-graph: be extra careful about mixed generations
When upgrading to a commit-graph with corrected commit dates from
one without, there are a few things that need to be considered.

When computing generation numbers for the new commit-graph file that
expects to add the generation_data chunk with corrected commit
dates, we need to ensure that the 'generation' member of the
commit_graph_data struct is set to zero for these commits.

Unfortunately, the fallback to use topological level for generation
number when corrected commit dates are not available are causing us
harm here: parsing commits notices that read_generation_data is
false and populates 'generation' with the topological level.

The solution is to iterate through the commits, parse the commits
to populate initial values, then reset the generation values to
zero to trigger recalculation. This loop only occurs when the
existing commit-graph data has no corrected commit dates.

While this improves our situation somewhat, we have not completely
solved the issue for correctly computing generation numbers for mixed
layers. That follows in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee 9c2c0a8256 commit-graph: compute generations separately
The compute_generation_numbers() method was introduced by 3258c663
(commit-graph: compute generation numbers, 2018-05-01) to compute what
is now known as "topological levels". These are still stored in the
commit-graph file for compatibility sake while c1a09119 (commit-graph:
implement corrected commit date, 2021-01-16) updated the method to also
compute the new version of generation numbers: corrected commit date.

It makes sense why these are grouped. They perform very similar walks of
the necessary commits and compute similar maximums over each parent.
However, having these two together conflates them in subtle ways that is
hard to separate.

In particular, the topo_level slab is used to store the topological
levels in all cases, but the commit_graph_data_at(c)->generation member
stores different values depending on the state of the existing
commit-graph file.

* If the existing commit-graph file has a "GDAT" chunk, then these
  values represent corrected commit dates.

* If the existing commit-graph file doesn't have a "GDAT" chunk, then
  these values are actually the topological levels.

This issue only occurs only when upgrading an existing commit-graph file
into one that has the "GDAT" chunk. The current change does not resolve
this upgrade problem, but splitting the implementation into two pieces
here helps with that process, which will follow in the next change.

The important thing this helps with is the case where the
num_generation_data_overflows was being incremented incorrectly,
triggering a write of the overflow chunk.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee 448a39e65d commit-graph: validate layers for generation data
We need to be extra careful that we don't use corrected
commit dates from any layer of a commit-graph chain if there is a
single commit-graph file that is missing the generation_data chunk.
Update validate_mixed_generation_chain() to correctly update each
layer to ignore the generation_data chunk in this case. It now also
returns 1 if all layers have a generation_data chunk. This return
value will be used in the next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee 90cb1c47c7 commit-graph: always parse before commit_graph_data_at()
There is a subtle failure happening when computing corrected commit
dates with --split enabled. It requires a base layer needing the
generation_data_overflow chunk. Then, the next layer on top
erroneously thinks it needs an overflow chunk due to a bug leading
to recalculating all reachable generation numbers. The output of
the failure is

  BUG: commit-graph.c:1912: expected to write 8 bytes to
  chunk 47444f56, but wrote 0 instead

These "expected" 8 bytes are due to re-computing the corrected
commit date for the lower layer but the new layer does not need
any overflow.

Add a test to t5318-commit-graph.sh that demonstrates this bug. However,
it does not trigger consistently with the existing code.

The generation number data is stored in a slab and accessed by
commit_graph_data_at(). This data is initialized when parsing a commit,
but is otherwise used assuming it has been populated. The loop in
compute_generation_numbers() did not enforce that all reachable
commits were parsed and had correct values. This could lead to some
problems when writing a commit-graph with corrected commit dates based
on a commit-graph without them.

It has been difficult to identify the issue here because it was so hard
to reproduce. It relies on this uninitialized data having a non-zero
value, but also on specifically in a way that overwrites the existing
data.

This patch adds the extra parse to ensure the data is filled before we
compute the generation number of a commit. This triggers the new test
to fail because the generation number overflow count does not match
between this computation and the write for that chunk.

The actual fix will follow as the next few changes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:36 -08:00
Derrick Stolee c4cc083169 commit-graph: use repo_parse_commit
The write_commit_graph_context has a repository pointer, so use it.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-01 21:03:35 -08:00
Jeff King 8380dcd700 oid_pos(): access table through const pointers
When we are looking up an oid in an array, we obviously don't need to
write to the array. Let's mark it as const in the function interfaces,
as well as in the local variables we use to derference the void pointer
(note a few cases use pointers-to-pointers, so we mark everything
const).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-28 12:03:26 -08:00
Jeff King 45ee13b942 hash_pos(): convert to oid_pos()
All of our callers are actually looking up an object_id, not a bare
hash. Likewise, the arrays they are looking in are actual arrays of
object_id (not just raw bytes of hashes, as we might find in a pack
.idx; those are handled by bsearch_hash()).

Using an object_id gives us more type safety, and makes the callers
slightly shorter. It also gets rid of the word "sha1" from several
access functions, though we could obviously also rename those with
s/sha1/hash/.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-28 12:02:39 -08:00
Junio C Hamano 58e2ce9112 Merge branch 'ma/more-opaque-lock-file'
Code clean-up.

* ma/more-opaque-lock-file:
  read-cache: try not to peek into `struct {lock_,temp}file`
  refs/files-backend: don't peek into `struct lock_file`
  midx: don't peek into `struct lock_file`
  commit-graph: don't peek into `struct lock_file`
  builtin/gc: don't peek into `struct lock_file`
2021-01-25 14:19:17 -08:00
Jeff King 36a317929b refs: switch peel_ref() to peel_iterated_oid()
The peel_ref() interface is confusing and error-prone:

  - it's typically used by ref iteration callbacks that have both a
    refname and oid. But since they pass only the refname, we may load
    the ref value from the filesystem again. This is inefficient, but
    also means we are open to a race if somebody simultaneously updates
    the ref. E.g., this:

      int some_ref_cb(const char *refname, const struct object_id *oid, ...)
      {
              if (!peel_ref(refname, &peeled))
                      printf("%s peels to %s",
                             oid_to_hex(oid), oid_to_hex(&peeled);
      }

    could print nonsense. It is correct to say "refname peels to..."
    (you may see the "before" value or the "after" value, either of
    which is consistent), but mentioning both oids may be mixing
    before/after values.

    Worse, whether this is possible depends on whether the optimization
    to read from the current iterator value kicks in. So it is actually
    not possible with:

      for_each_ref(some_ref_cb);

    but it _is_ possible with:

      head_ref(some_ref_cb);

    which does not use the iterator mechanism (though in practice, HEAD
    should never peel to anything, so this may not be triggerable).

  - it must take a fully-qualified refname for the read_ref_full() code
    path to work. Yet we routinely pass it partial refnames from
    callbacks to for_each_tag_ref(), etc. This happens to work when
    iterating because there we do not call read_ref_full() at all, and
    only use the passed refname to check if it is the same as the
    iterator. But the requirements for the function parameters are quite
    unclear.

Instead of taking a refname, let's instead take an oid. That fixes both
problems. It's a little funny for a "ref" function not to involve refs
at all. The key thing is that it's optimizing under the hood based on
having access to the ref iterator. So let's change the name to make it
clear why you'd want this function versus just peel_object().

There are two other directions I considered but rejected:

  - we could pass the peel information into the each_ref_fn callback.
    However, we don't know if the caller actually wants it or not. For
    packed-refs, providing it is essentially free. But for loose refs,
    we actually have to peel the object, which would be wasteful in most
    cases. We could likewise pass in a flag to the callback indicating
    whether the peeled information is known, but that complicates those
    callbacks, as they then have to decide whether to manually peel
    themselves. Plus it requires changing the interface of every
    callback, whether they care about peeling or not, and there are many
    of them.

  - we could make a function to return the peeled value of the current
    iterated ref (computing it if necessary), and BUG() otherwise. I.e.:

      int peel_current_iterated_ref(struct object_id *out);

    Each of the current callers is an each_ref_fn callback, so they'd
    mostly be happy. But:

      - we use those callbacks with functions like head_ref(), which do
        not use the iteration code. So we'd need to handle the fallback
        case there, anyway.

      - it's possible that a caller would want to call into generic code
        that sometimes is used during iteration and sometimes not. This
        encapsulates the logic to do the fast thing when possible, and
        fallback when necessary.

The implementation is mostly obvious, but I want to call out a few
things in the patch:

  - the test-tool coverage for peel_ref() is now meaningless, as it all
    collapses to a single peel_object() call (arguably they were pretty
    uninteresting before; the tricky part of that function is the
    fast-path we see during iteration, but these calls didn't trigger
    that). I've just dropped it entirely, though note that some other
    tests relied on the tags we created; I've moved that creation to the
    tests where it matters.

  - we no longer need to take a ref_store parameter, since we'd never
    look up a ref now. We do still rely on a global "current iterator"
    variable which _could_ be kept per-ref-store. But in practice this
    is only useful if there are multiple recursive iterations, at which
    point the more appropriate solution is probably a stack of
    iterators. No caller used the actual ref-store parameter anyway
    (they all call the wrapper that passes the_repository).

  - the original only kicked in the optimization when the "refname"
    pointer matched (i.e., not string comparison). We do likewise with
    the "oid" parameter here, but fall back to doing an actual oideq()
    call. This in theory lets us kick in the optimization more often,
    though in practice no current caller cares. It should never be
    wrong, though (peeling is a property of an object, so two refs
    pointing to the same object would peel identically).

  - the original took care not to touch the peeled out-parameter unless
    we found something to put in it. But no caller cares about this, and
    anyway, it is enforced by peel_object() itself (and even in the
    optimized iterator case, that's where we eventually end up). We can
    shorten the code and avoid an extra copy by just passing the
    out-parameter through the stack.

Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-21 15:51:31 -08:00
Abhishek Kumar 8d00d7c3df commit-reach: use corrected commit dates in paint_down_to_common()
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.

For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar 1fdc383c5c commit-graph: use generation v2 only if entire chain does
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar e8b63005c4 commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar c1a09119f6 commit-graph: implement corrected commit date
With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar d7f92784c6 commit-graph: return 64-bit generation number
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar 72a2bfcaf0 commit-graph: add a slab to store topological levels
In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.

Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.

The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.

A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.

We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar f90fca638e commit-graph: consolidate fill_commit_graph_info
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.

e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.

The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.

As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.

To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.

To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar e30c5ee76c commit-graph: fix regression when computing Bloom filters
Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.

But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.

Not all hope is lost, though: 'commit_gen_cmp()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparison function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters.

Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various repositories [1].

So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.

[1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:17 -08:00
Martin Ågren a52cdce936 commit-graph: don't peek into struct lock_file
Similar to the previous commit, avoid peeking into the `struct
lock_file`. Use the lock file API instead.

Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-06 13:53:32 -08:00
Martin Ågren bc62692757 hash-lookup: rename from sha1-lookup
Change all remnants of "sha1" in hash-lookup.c and .h and rename them to
reflect that we're not just able to handle SHA-1 these days.

Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04 13:01:55 -08:00
Martin Ågren 7a7d992d0d sha1-lookup: rename sha1_pos() as hash_pos()
Rename this function to reflect that we're not just able to handle SHA-1
these days. There are a few instances of "sha1" left in sha1-lookup.[ch]
after this, but those will be addressed in the next commit.

Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04 13:01:55 -08:00
Jeff King 3361390cbe commit-graph: use size_t for array allocation and indexing
Our packed_commit_list is an array of pointers to commit structs. We use
"int" for the allocation, which is 32-bit even on 64-bit platforms. This
isn't likely to overflow in practice (we're writing commit graphs, so
you'd need to actually have billions of unique commits in the
repository). But it's good practice to use size_t for allocations.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07 12:32:04 -08:00
Jeff King a5f1c44899 commit-graph: replace packed_oid_list with oid_array
Our custom packed_oid_list data structure is really just an oid_array in
disguise. Let's switch to using the generic structure, which shortens
and simplifies the code slightly.

There's one slightly awkward part: in the old code we copied a hash
straight from the mmap'd on-disk data into the final object_id. And now
we'll copy to a temporary oid, which we'll then pass to
oid_array_append(). But this is an operation we have to do all over the
commit-graph code already, since it mostly uses object_id structs
internally. I also measured "git commit-graph --append", which triggers
this code path, and it showed no difference.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07 12:32:04 -08:00
Jeff King 1cbdbf3bef commit-graph: drop count_distinct_commits() function
When writing a commit graph, we collect a list of object ids in an
array, which we'll eventually copy into an array of "struct commit"
pointers. Before we do that, though, we count the number of distinct
commit entries. There's a subtle bug in this step, though.

We eliminate not only duplicate oids, but also in split mode, any oids
which are not commits or which are already in a graph file. However, the
loop starts at index 1, always counting index 0 as distinct. And indeed
it can't be a duplicate, since we check for those by comparing against
the previous entry, and there isn't one for index 0. But it could be a
commit that's already in a graph file, and we'd overcount the number of
commits by 1 in that case.

That turns out not to be a problem, though. The only things we do with
the count are:

  - check if our count will overflow our data structures. But the limit
    there is 2^31 commits, so while this is a useful check, the
    off-by-one is not likely to matter.

  - pre-allocate the array of commit pointers. But over-allocating by
    one isn't a problem; we'll just waste a few extra bytes.

The bug would be easy enough to fix, but we can observe that neither of
those steps is necessary.

After building the actual commit array, we'll likewise check its count
for overflow. So the extra check of the distinct commit count here is
redundant.

And likewise we use ALLOC_GROW() when building the commit array, so
there's no need to preallocate it (it's possible that doing so is
slightly more efficient, but if we care we can just optimistically
allocate one slot for each oid; I didn't bother here).

So count_distinct_commits() isn't doing anything useful. Let's just get
rid of that step.

Note that a side effect of the function was that we sorted the list of
oids, which we do rely on in copy_oids_to_commits(), since it must also
skip the duplicates. So we'll move the qsort there. I didn't copy the
"TODO" about adding more progress meters. It's actually quite hard to
make a repository large enough for this qsort would take an appreciable
amount of time, so this doesn't seem like a useful note.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07 12:32:04 -08:00
Junio C Hamano 307a53dd99 Merge branch 'ds/commit-graph-merging-fix'
When "git commit-graph" detects the same commit recorded more than
once while it is merging the layers, it used to die.  The code now
ignores all but one of them and continues.

* ds/commit-graph-merging-fix:
  commit-graph: don't write commit-graph when disabled
  commit-graph: ignore duplicates when merging layers
2020-11-02 13:17:39 -08:00
Derrick Stolee 85102ac71b commit-graph: don't write commit-graph when disabled
The core.commitGraph config setting can be set to 'false' to prevent
parsing commits from the commit-graph file(s). This causes an issue when
trying to write with "--split" which needs to distinguish between
commits that are in the existing commit-graph layers and commits that
are not. The existing mechanism uses parse_commit() and follows by
checking if there is a 'graph_pos' that shows the commit was parsed from
the commit-graph file.

When core.commitGraph=false, we do not parse the commits from the
commit-graph and 'graph_pos' indicates that no commits are in the
existing file. The --split logic moves forward creating a new layer on
top that holds all reachable commits, then possibly merges down into
those layers, resulting in duplicate commits. The previous change makes
that merging process more robust to such a situation in case it happens
in the written commit-graph data.

The easy answer here is to avoid writing a commit-graph if reading the
commit-graph is disabled. Since the resulting commit-graph will would not
be read by subsequent Git processes. This is more natural than forcing
core.commitGraph to be true for the 'write' process.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09 14:16:32 -07:00
Derrick Stolee 150f11574b commit-graph: ignore duplicates when merging layers
Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.

The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09 14:16:23 -07:00
Junio C Hamano 288ed98bf7 Merge branch 'tb/bloom-improvements'
"git commit-graph write" learned to limit the number of bloom
filters that are computed from scratch with the --max-new-filters
option.

* tb/bloom-improvements:
  commit-graph: introduce 'commitGraph.maxNewFilters'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  commit-graph: rename 'split_commit_graph_opts'
  bloom: encode out-of-bounds filters as non-empty
  bloom/diff: properly short-circuit on max_changes
  bloom: use provided 'struct bloom_filter_settings'
  bloom: split 'get_bloom_filter()' in two
  commit-graph.c: store maximum changed paths
  commit-graph: respect 'commitGraph.readChangedPaths'
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: pass a 'struct repository *' in more places
  t4216: use an '&&'-chain
  commit-graph: introduce 'get_bloom_filter_settings()'
2020-09-29 14:01:20 -07:00
Junio C Hamano 48794acc50 Merge branch 'ds/maintenance-part-1'
A "git gc"'s big brother has been introduced to take care of more
repository maintenance tasks, not limited to the object database
cleaning.

* ds/maintenance-part-1:
  maintenance: add trace2 regions for task execution
  maintenance: add auto condition for commit-graph task
  maintenance: use pointers to check --auto
  maintenance: create maintenance.<task>.enabled config
  maintenance: take a lock on the objects directory
  maintenance: add --task option
  maintenance: add commit-graph task
  maintenance: initialize task array
  maintenance: replace run_auto_gc()
  maintenance: add --quiet option
  maintenance: create basic maintenance runner
2020-09-25 15:25:38 -07:00
Taylor Blau 809e0327f5 builtin/commit-graph.c: introduce '--max-new-filters=<n>'
Introduce a command-line flag to specify the maximum number of new Bloom
filters that a 'git commit-graph write' is willing to compute from
scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18 10:35:39 -07:00
Taylor Blau 98bb796191 commit-graph: rename 'split_commit_graph_opts'
In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
member to instead be 'split_flags' to clarify that it only has to do
with the behavior implied by '--split'.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 21:55:50 -07:00
Taylor Blau 59f0d5073f bloom: encode out-of-bounds filters as non-empty
When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.

This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.

In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".

It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.

To that end, change how we store Bloom filters in the "computed but not
representable" category:

  - Bloom filters with no entries are stored as a single byte with all
    bits low (i.e., all queries to that Bloom filter will return
    "definitely not")

  - Bloom filters with too many entries are stored as a single byte with
    all bits set high (i.e., all queries to that Bloom filter will
    return "maybe").

These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:

  - Filters that were previously empty will be recomputed and stored
    according to the new rules, and

  - old clients reading filters generated by new clients will interpret
    the filters correctly and be none the wiser to how they were
    generated.

Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.

Note that this does increase the size of on-disk commit-graphs, but far
less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or empty commit.

In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.

See, for example, these relative proportions of such boundary commits
(collected by SZEDER Gábor):

                  |     Percentage of     |    commit-graph   |           |
                  |   commits modifying   |     file size     |           |
                  ├────────┬──────────────┼───────────────────┤    pct.   |
                  | 0 path | >= 512 paths | before  |  after  |   change  |
 ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤
 | android-base   | 13.20% |        0.13% | 37.468M | 37.534M | +0.1741 % |
 | cmssw          |  0.15% |        0.23% | 17.118M | 17.119M | +0.0091 % |
 | cpython        |  3.07% |        0.01% |  7.967M |  7.971M | +0.0423 % |
 | elasticsearch  |  0.70% |        1.00% |  8.833M |  8.835M | +0.0128 % |
 | gcc            |  0.00% |        0.08% | 16.073M | 16.074M | +0.0030 % |
 | gecko-dev      |  0.14% |        0.64% | 59.868M | 59.874M | +0.0105 % |
 | git            |  0.11% |        0.02% |  3.895M |  3.895M | +0.0020 % |
 | glibc          |  0.02% |        0.10% |  3.555M |  3.555M | +0.0021 % |
 | go             |  0.00% |        0.07% |  3.186M |  3.186M | +0.0018 % |
 | homebrew-cask  |  0.40% |        0.02% |  7.035M |  7.035M | +0.0065 % |
 | homebrew-core  |  0.01% |        0.01% | 11.611M | 11.611M | +0.0002 % |
 | jdk            |  0.26% |        5.64% |  5.537M |  5.540M | +0.0590 % |
 | linux          |  0.01% |        0.51% | 63.735M | 63.740M | +0.0073 % |
 | llvm-project   |  0.12% |        0.03% | 25.515M | 25.516M | +0.0050 % |
 | rails          |  0.10% |        0.10% |  6.252M |  6.252M | +0.0027 % |
 | rust           |  0.07% |        0.17% |  9.364M |  9.364M | +0.0033 % |
 | tensorflow     |  0.09% |        1.02% |  7.009M |  7.010M | +0.0158 % |
 | webkit         |  0.05% |        0.31% | 17.405M | 17.406M | +0.0047 % |

(where the above increase is determined by computing a non-split
commit-graph before and after this patch).

Given that these projects are all "large" by commit count, the storage
cost by writing these filters explicitly is negligible. In the most
extreme example, android-base (which has 494,848 commits at the time of
writing) would have its commit-graph increase by a modest 68.4 KB.

Finally, a test to exercise filters which contain too many changed path
entries will be introduced in a subsequent patch.

Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 21:55:50 -07:00
Derrick Stolee 663b2b1b90 maintenance: add commit-graph task
The first new task in the 'git maintenance' builtin is the
'commit-graph' task. This updates the commit-graph file
incrementally with the command

	git commit-graph write --reachable --split

By writing an incremental commit-graph file using the "--split"
option we minimize the disruption from this operation. The default
behavior is to merge layers until the new "top" layer is less than
half the size of the layer below. This provides quick writes most
of the time, with the longer writes following a power law
distribution.

Most importantly, concurrent Git processes only look at the
commit-graph-chain file for a very short amount of time, so they
will verly likely not be holding a handle to the file when we try
to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but
our job expires some of the .graph files before they can be read,
then those processes will see a warning message (but not fail).
This could be avoided by a future update to use the --expire-time
argument when writing the commit-graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:05 -07:00
Taylor Blau 9a7a9ed10d bloom: use provided 'struct bloom_filter_settings'
When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 09:31:25 -07:00
Taylor Blau 312cff5207 bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.

It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 09:31:25 -07:00
Taylor Blau 97ffa4fab5 commit-graph.c: store maximum changed paths
For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 09:29:22 -07:00
Taylor Blau b66d84756f commit-graph: respect 'commitGraph.readChangedPaths'
Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 12:51:48 -07:00
Taylor Blau ab14d0676c commit-graph: pass a 'struct repository *' in more places
In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 12:51:48 -07:00
Taylor Blau 4f3644056a commit-graph: introduce 'get_bloom_filter_settings()'
Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed in practice to be the same for any
layer in a chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 12:51:48 -07:00
Derrick Stolee 665d70ad03 commit-graph: use the "hash version" byte
The commit-graph format reserved a byte among the header of the file to
store a "hash version". During the SHA-256 work, this was not modified
because file formats are not necessarily intended to work across hash
versions. If a repository has SHA-256 as its hash algorithm, it
automatically up-shifts the lengths of object names in all necessary
formats.

However, since we have this byte available for adjusting the version, we
can make the file formats more obviously incompatible instead of relying
on other context from the repository.

Update the oid_version() method in commit-graph.c to add a new value, 2,
for sha-256. This automatically writes the new value in a SHA-256
repository _and_ verifies the value is correct. This is a breaking
change relative to the current 'master' branch since 092b677 (Merge
branch 'bc/sha-256-cvs-svn-updates', 2020-08-13) but it is not breaking
relative to any released version of Git.

The test impact is relatively minor: the output of 'test-tool
read-graph' lists the header information, so those instances of '1' need
to be replaced with a variable determined by GIT_TEST_DEFAULT_HASH. A
more careful test is added that specifically creates a repository of
each type then swaps the commit-graph files. The important value here is
that the "git log" command succeeds while writing a message to stderr.

Helped-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 16:45:14 -07:00
Junio C Hamano 70cdbbe3a7 Merge branch 'ds/commit-graph-bloom-updates' into master
Updates to the changed-paths bloom filter.

* ds/commit-graph-bloom-updates:
  commit-graph: check all leading directories in changed path Bloom filters
  revision: empty pathspecs should not use Bloom filters
  revision.c: fix whitespace
  commit-graph: check chunk sizes after writing
  commit-graph: simplify chunk writes into loop
  commit-graph: unify the signatures of all write_graph_chunk_*() functions
  commit-graph: persist existence of changed-paths
  bloom: fix logic in get_bloom_filter()
  commit-graph: change test to die on parse, not load
  commit-graph: place bloom_settings in context
2020-07-30 13:20:31 -07:00
Junio C Hamano de6dda0dc3 Merge branch 'sg/commit-graph-cleanups' into master
The changed-path Bloom filter is improved using ideas from an
independent implementation.

* sg/commit-graph-cleanups:
  commit-graph: simplify write_commit_graph_file() #2
  commit-graph: simplify write_commit_graph_file() #1
  commit-graph: simplify parse_commit_graph() #2
  commit-graph: simplify parse_commit_graph() #1
  commit-graph: clean up #includes
  diff.h: drop diff_tree_oid() & friends' return value
  commit-slab: add a function to deep free entries on the slab
  commit-graph-format.txt: all multi-byte numbers are in network byte order
  commit-graph: fix parsing the Chunk Lookup table
  tree-walk.c: don't match submodule entries for 'submod/anything'
2020-07-30 13:20:30 -07:00
Junio C Hamano 12f5eb9f08 Merge branch 'sg/commit-graph-progress-fix' into master
The code to produce progress output from "git commit-graph --write"
had a few breakages, which have been fixed.

* sg/commit-graph-progress-fix:
  commit-graph: fix "Writing out commit graph" progress counter
  commit-graph: fix progress of reachable commits
2020-07-15 16:29:43 -07:00
Junio C Hamano 24ecfdf206 Merge branch 'tb/fix-persistent-shallow' into master
When "fetch.writeCommitGraph" configuration is set in a shallow
repository and a fetch moves the shallow boundary, we wrote out
broken commit-graph files that do not match the reality, which has
been corrected.

* tb/fix-persistent-shallow:
  commit.c: don't persist substituted parents when unshallowing
2020-07-09 14:00:44 -07:00
SZEDER Gábor 150cd3b61d commit-graph: fix "Writing out commit graph" progress counter
76ffbca71a (commit-graph: write Bloom filters to commit graph file,
2020-04-06) added two delayed progress lines to writing the Bloom
filter index and data chunk.  This is wrong, because a single common
progress is used while writing all chunks, which is not updated while
writing these two new chunks, resulting in incomplete-looking "done"
lines:

  Expanding reachable commits in commit graph: 888679, done.
  Computing commit changed paths Bloom filters: 100% (888678/888678), done.
  Writing out commit graph in 6 passes:  66% (3554712/5332068), done.

Use the common 'struct progress' instance while writing the Bloom
filter chunks as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-09 10:28:49 -07:00
SZEDER Gábor 6f9d5f2fda commit-graph: fix progress of reachable commits
To display a progress line while iterating over all refs,
d335ce8f24 (commit-graph.c: show progress of finding reachable
commits, 2020-05-13) should have added a pair of
start_delayed_progress() and stop_progress() calls around a
for_each_ref() invocation.  Alas, the stop_progress() call ended up at
the wrong place, after write_commit_graph(), which does all the
commit-graph computation and writing, and has several progress lines
of its own.  Consequently, that new

  Collecting referenced commits: 123

progress line is overwritten by the first progress line shown by
write_commit_graph(), and its final "done" line is shown last, after
everything is finished:

  Expanding reachable commits in commit graph: 344786, done.
  Computing commit changed paths Bloom filters: 100% (344786/344786), done.
  Collecting referenced commits: 154, done.

Move that stop_progress() call to the right place.

While at it, drop the unnecessary 'if (data.progress)' condition
protecting the stop_progress() call, because that function is prepared
to handle a NULL progress struct.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-09 10:27:23 -07:00
Taylor Blau ce16364e89 commit.c: don't persist substituted parents when unshallowing
Since 37b9dcabfc (shallow.c: use '{commit,rollback}_shallow_file',
2020-04-22), Git knows how to reset stat-validity checks for the
$GIT_DIR/shallow file, allowing it to change between a shallow and
non-shallow state in the same process (e.g., in the case of 'git fetch
--unshallow').

However, when $GIT_DIR/shallow changes, Git does not alter or remove any
grafts (nor substituted parents) in memory.

This comes up in a "git fetch --unshallow" with fetch.writeCommitGraph
set to true. Ordinarily in a shallow repository (and before 37b9dcabfc,
even in this case), commit_graph_compatible() would return false,
indicating that the repository should not be used to write a
commit-graphs (since commit-graph files cannot represent a shallow
history). But since 37b9dcabfc, in an --unshallow operation that check
succeeds.

Thus even though the repository isn't shallow any longer (that is, we
have all of the objects), the in-core representation of those objects
still has munged parents at the shallow boundaries.  When the
commit-graph write proceeds, we use the incorrect parentage, producing
wrong results.

There are two ways for a user to work around this: either (1) set
'fetch.writeCommitGraph' to 'false', or (2) drop the commit-graph after
unshallowing.

One way to fix this would be to reset the parsed object pool entirely
(flushing the cache and thus preventing subsequent reads from modifying
their parents) after unshallowing. That would produce a problem when
callers have a now-stale reference to the old pool, and so this patch
implements a different approach. Instead, attach a new bit to the pool,
'substituted_parent', which indicates if the repository *ever* stored a
commit which had its parents modified (i.e., the shallow boundary
prior to unshallowing).

This bit needs to be sticky because all reads subsequent to modifying a
commit's parents are unreliable when unshallowing. Modify the check in
'commit_graph_compatible' to take this bit into account, and correctly
avoid generating commit-graphs in this case, thus solving the bug.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: Jonathan Nieder <jrnieder@gmail.com>
Reported-by: Jay Conrod <jayconrod@google.com>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-08 16:13:46 -07:00
SZEDER Gábor 2dd4fed927 commit-graph: check chunk sizes after writing
In my experience while experimenting with new commit-graph chunks,
early versions of the corresponding new write_commit_graph_my_chunk()
functions are, sadly but not surprisingly, often buggy, and write more
or less data than they are supposed to, especially if the chunk size
is not directly proportional to the number of commits.  This then
causes all kinds of issues when reading such a bogus commit-graph
file, raising the question of whether the writing or the reading part
happens to be buggy this time.

Let's catch such issues early, already when writing the commit-graph
file, and check that each write_graph_chunk_*() function wrote the
amount of data that it was expected to, and what has been encoded in
the Chunk Lookup table.  Now that all commit-graph chunks are written
in a loop we can do this check in a single place for all chunks, and
any chunks added in the future will get checked as well.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
SZEDER Gábor 17e6275fc9 commit-graph: simplify chunk writes into loop
In write_commit_graph_file() we now have one block of code filling the
array of 'struct chunk_info' with the IDs and sizes of chunks to be
written, and an other block of code calling the functions responsible
for writing individual chunks.  In case of optional chunks like Extra
Edge List an Base Graphs List there is also a condition checking
whether that chunk is necessary/desired, and that same condition is
repeated in both blocks of code. Other, newer chunks have similar
optional conditions.

Eliminate these repeated conditions by storing the function pointers
responsible for writing individual chunks in the 'struct chunk_info'
array as well, and calling them in a loop to write the commit-graph
file.  This will open up the possibility for a bit of foolproofing in
the following patch.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
SZEDER Gábor 9bab081dfa commit-graph: unify the signatures of all write_graph_chunk_*() functions
Update the write_graph_chunk_*() helper functions to have the same
signature:

  - Return an int error code from all these functions.
    write_graph_chunk_base() already has an int error code, now the
    others will have one, too, but since they don't indicate any
    error, they will always return 0.

  - Drop the hash size parameter of write_graph_chunk_oids() and
    write_graph_chunk_data(); its value can be read directly from
    'the_hash_algo' inside these functions as well.

This opens up the possibility for further cleanups and foolproofing in
the following two patches.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
Derrick Stolee 0087a87ba8 commit-graph: persist existence of changed-paths
The changed-path Bloom filters were released in v2.27.0, but have a
significant drawback. A user can opt-in to writing the changed-path
filters using the "--changed-paths" option to "git commit-graph write"
but the next write will drop the filters unless that option is
specified.

This becomes even more important when considering the interaction with
gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
features.experimental). These config options trigger commit-graph writes
that the user did not signal, and hence there is no --changed-paths
option available.

Allow a user that opts-in to the changed-path filters to persist the
property of "my commit-graph has changed-path filters" automatically. A
user can drop filters using the --no-changed-paths option.

In the process, we need to be extremely careful to match the Bloom
filter settings as specified by the commit-graph. This will allow future
versions of Git to customize these settings, and the version with this
change will persist those settings as commit-graphs are rewritten on
top.

Use the trace2 API to signal the settings used during the write, and
check that output in a test after manually adjusting the correct bytes
in the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
Derrick Stolee 949197420e bloom: fix logic in get_bloom_filter()
The get_bloom_filter() method is a bit complicated in some parts where
it does not need to be. In particular, it needs to return a NULL filter
only when compute_if_not_present is zero AND the filter data cannot be
loaded from a commit-graph file. This currently happens by accident
because the commit-graph does not load changed-path Bloom filters from
an existing commit-graph when writing a new one. This will change in a
later patch.

Also clean up some style issues while we are here.

One side-effect of returning a NULL filter is that the filters that are
reported as "too large" will now be reported as NULL insead of length
zero. This case was not properly covered before, so add a test. Further,
remote the counting of the zero-length filters from revision.c and the
trace2 logs.

Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
Derrick Stolee 7b671f8c2b commit-graph: change test to die on parse, not load
43d3561 (commit-graph write: don't die if the existing graph is corrupt,
2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment
variable. This was created to verify that commit-graph was not loaded
when writing a new non-incremental commit-graph.

An upcoming change wants to load a commit-graph in some valuable cases,
but we want to maintain that we don't trust the commit-graph data when
writing our new file. Instead of dying on load, instead die if we ever
try to parse a commit from the commit-graph. This functionally verifies
the same intended behavior, but allows a more advanced feature in the
next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-23 17:12:08 -07:00
Derrick Stolee 98037f2bf2 commit-graph: place bloom_settings in context
Place an instance of struct bloom_settings into the struct
write_commit_graph_context. This allows simplifying the function
prototype of write_graph_chunk_bloom_data(). This will allow us
to combine the function prototypes and use function pointers to
simplify write_commit_graph_file().

By using a pointer, we can later replace the settings to match those
that exist in the current commit-graph, in case a future Git version
allows customization of these parameters.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-23 17:12:08 -07:00
Abhishek Kumar c752ad09c4 commit-graph: minimize commit_graph_data_slab access
In an earlier patch, multiple struct acccesses to `graph_pos` and
`generation` were auto-converted to multiple method calls.

Since the values are fixed and commit-slab access costly, we would be
better off with storing the values as a local variable and reusing it.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17 14:37:52 -07:00
Abhishek Kumar c49c82aa4c commit: move members graph_pos, generation to a slab
We remove members `graph_pos` and `generation` from the struct commit.
The default assignments in init_commit_node() are no longer valid,
which is fine as the slab helpers return appropriate default values and
the assignments are removed.

We will replace existing use of commit->generation and commit->graph_pos
by commit_graph_data_slab helpers using
`contrib/coccinelle/commit.cocci'.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17 14:37:30 -07:00
Abhishek Kumar 4844812b9e commit-graph: introduce commit_graph_data_slab
The struct commit is used in many contexts. However, members
`generation` and `graph_pos` are only used for commit-graph related
operations and otherwise waste memory.

This wastage would have been more pronounced as we transition to
generation number v2, which uses 64-bit generation number instead of
current 32-bits.

As they are often accessed together, let's introduce struct
commit_graph_data and move them to a commit_graph_data slab.

While the overall test suite runs just as fast as master,
(series: 26m48s, master: 27m34s, faster by 2.87%), certain commands
like `git merge-base --is-ancestor` were slowed by 40% as discovered
by Szeder Gábor [1]. After minimizing commit-slab access, the slow down
persists but is closer to 20%.

Derrick Stolee believes the slow down is attributable to the underlying
algorithm rather than the slowness of commit-slab access [2] and we will
follow-up in a later series.

[1]: https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/
[2]: https://lore.kernel.org/git/13db757a-9412-7f1e-805c-8a028c4ab2b1@gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17 14:37:23 -07:00
Junio C Hamano dc57a9be5e Merge branch 'tb/commit-graph-no-check-oids'
Clean-up the commit-graph codepath.

* tb/commit-graph-no-check-oids:
  commit-graph: drop COMMIT_GRAPH_WRITE_CHECK_OIDS flag
  t5318: reorder test below 'graph_read_expect'
  commit-graph.c: simplify 'fill_oids_from_commits'
  builtin/commit-graph.c: dereference tags in builtin
  builtin/commit-graph.c: extract 'read_one_commit()'
  commit-graph.c: peel refs in 'add_ref_to_set'
  commit-graph.c: show progress of finding reachable commits
  commit-graph.c: extract 'refs_cb_data'
2020-06-08 18:06:27 -07:00
SZEDER Gábor 7fbfe07ab4 commit-graph: simplify write_commit_graph_file() #2
Unify the 'chunk_ids' and 'chunk_sizes' arrays into an array of
'struct chunk_info'.  This will allow more cleanups in the following
patches.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor bb4d60e5d5 commit-graph: simplify write_commit_graph_file() #1
In write_commit_graph_file() one block of code fills the array of
chunk IDs, another block of code fills the array of chunk offsets,
then the chunk IDs and offsets are written to the Chunk Lookup table,
and finally a third block of code writes the actual chunks.  In case
of optional chunks like Extra Edge List and Base Graphs List there is
also a condition checking whether that chunk is necessary/desired, and
that same condition is repeated in all those three blocks of code.
This patch series is about to add more optional chunks, so there would
be even more repeated conditions.

Those chunk offsets are relative to the beginning of the file, so they
inherently depend on the size of the Chunk Lookup table, which in turn
depends on the number of chunks that are to be written to the
commit-graph file.  IOW at the time we set the first chunk's ID we
can't yet know its offset, because we don't yet know how many chunks
there are.

Simplify this by initially filling an array of chunk sizes, not
offsets, and calculate the offsets based on the chunk sizes only
later, while we are writing the Chunk Lookup table.  This way we can
fill the arrays of chunk IDs and sizes in one go, eliminating one set
of repeated conditions.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor 5cfa438a76 commit-graph: simplify parse_commit_graph() #2
The Chunk Lookup table stores the chunks' starting offset in the
commit-graph file, not their sizes.  Consequently, the size of a chunk
can only be calculated by subtracting its offset from the offset of
the subsequent chunk (or that of the terminating label).  This is
currenly implemented in a bit complicated way: as we iterate over the
entries of the Chunk Lookup table, we check the id of each chunk and
store its starting offset, then we check the id of the last seen chunk
and calculate its size using its previously saved offset.  At the
moment there is only one chunk for which we calculate its size, but
this patch series will add more, and the repeated chunk id checks are
not that pretty.

Instead let's read ahead the offset of the next chunk on each
iteration, so we can calculate the size of each chunk right away,
right where we store its starting offset.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor 2ad4f1a7c4 commit-graph: simplify parse_commit_graph() #1
While we iterate over all entries of the Chunk Lookup table we make
sure that we don't attempt to read past the end of the mmap-ed
commit-graph file, and check in each iteration that the chunk ID and
offset we are about to read is still within the mmap-ed memory region.
However, these checks in each iteration are not really necessary,
because the number of chunks in the commit-graph file is already known
before this loop from the just parsed commit-graph header.

So let's check that the commit-graph file is large enough for all
entries in the Chunk Lookup table before we start iterating over those
entries, and drop those per-iteration checks.  While at it, take into
account the size of everything that is necessary to have a valid
commit-graph file, i.e. the size of the header, the size of the
mandatory OID Fanout chunk, and the size of the signature in the
trailer as well.

Note that this necessitates the change of the error message as well,
and, consequently, have to update the 'detect incorrect chunk count'
test in 't5318-commit-graph.sh' as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor fa7965309e commit-graph: clean up #includes
Our CodingGuidelines says that it's sufficient to include one of
'git-compat-util.h' and 'cache.h', but both 'commit-graph.c' and
'commit-graph.h' include both.  Let's include only 'git-compat-util.h'
to loose a bunch of unnecessary dependencies; but include 'hash.h',
because 'commit-graph.h' does require the definition of 'struct
object_id'.

'commit-graph.h' explicitly includes 'repository.h' and
'string-list.h', but only needs the declaration of a few structs from
them.  Drop these includes and forward-declare the necessary structs
instead.

'commit-graph.c' includes 'dir.h', but doesn't actually use anything
from there, so let's drop that #include as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor cb9daf16db commit-graph: fix parsing the Chunk Lookup table
The commit-graph file format specifies that the chunks may be in any
order.  However, if the OID Lookup chunk happens to be the last one in
the file, then any command attempting to access the commit-graph data
will fail with:

  fatal: invalid commit position. commit-graph is likely corrupt

In this case the error is wrong, the commit-graph file does conform to
the specification, but the parsing of the Chunk Lookup table is a bit
buggy, and leaves the field holding the number of commits in the
commit-graph zero-initialized.

The number of commits in the commit-graph is determined while parsing
the Chunk Lookup table, by dividing the size of the OID Lookup chunk
with the hash size.  However, the Chunk Lookup table doesn't actually
store the size of the chunks, but it stores their starting offset.
Consequently, the size of a chunk can only be calculated by
subtracting the starting offsets of that chunk from the offset of the
subsequent chunk, or in case of the last chunk from the offset
recorded in the terminating label.  This is currenly implemented in a
bit complicated way: as we iterate over the entries of the Chunk
Lookup table, we check the ID of each chunk and store its starting
offset, then we check the ID of the last seen chunk and calculate its
size using its previously saved offset if necessary (at the moment
it's only necessary for the OID Lookup chunk).  Alas, while parsing
the Chunk Lookup table we only interate through the "real" chunks, but
never look at the terminating label, thus don't even check whether
it's necessary to calulate the size of the last chunk.  Consequently,
if the OID Lookup chunk is the last one, then we don't calculate its
size and turn don't run the piece of code determining the number of
commits in the commit graph, leaving the field holding that number
unchanged (i.e. zero-initialized), eventually triggering the sanity
check in load_oid_from_graph().

Fix this by iterating through all entries in the Chunk Lookup table,
including the terminating label.

Note that this is the minimal fix, suitable for the maintenance track.
A better fix would be to simplify how the chunk sizes are calculated,
but that is a more invasive change, less suitable for 'maint', so that
will be done in later patches.

This additional flexibility of scanning more chunks breaks a test for
"git commit-graph verify" so alter that test to mutate the commit-graph
to have an even lower chunk count.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:48 -07:00
Taylor Blau 2f00c355cb commit-graph: drop COMMIT_GRAPH_WRITE_CHECK_OIDS flag
Since 7c5c9b9c57 (commit-graph: error out on invalid commit oids in
'write --stdin-commits', 2019-08-05), the commit-graph builtin dies on
receiving non-commit OIDs as input to '--stdin-commits'.

This behavior can be cumbersome to work around in, say, the case of
piping 'git for-each-ref' to 'git commit-graph write --stdin-commits' if
the caller does not want to cull out non-commits themselves. In this
situation, it would be ideal if 'git commit-graph write' wrote the graph
containing the inputs that did pertain to commits, and silently ignored
the remainder of the input.

Some options have been proposed to the effect of '--[no-]check-oids'
which would allow callers to have the commit-graph builtin do just that.
After some discussion, it is difficult to imagine a caller who wouldn't
want to pass '--no-check-oids', suggesting that we should get rid of the
behavior of complaining about non-commit inputs altogether.

If callers do wish to retain this behavior, they can easily work around
this change by doing the following:

     git for-each-ref --format='%(objectname) %(objecttype) %(*objecttype)' |
     awk '
       !/commit/ { print "not-a-commit:"$1 }
        /commit/ { print $1 }
     ' |
     git commit-graph write --stdin-commits

To make it so that valid OIDs that refer to non-existent objects are
indeed an error after loosening the error handling, perform an extra
lookup to make sure that object indeed exists before sending it to the
commit-graph internals.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-18 12:51:11 -07:00
Taylor Blau 0ec2d0ff07 commit-graph.c: simplify 'fill_oids_from_commits'
In the previous handful of commits, both 'git commit-graph write
--reachable' and '--stdin-commits' learned to peel tags down to the
commits which they refer to before passing them into the commit-graph
internals.

This makes the call to 'lookup_commit_reference_gently()' inside of
'fill_oids_from_commits()' a noop, since all OIDs are commits by that
point.

As such, remove the call entirely, as well as the progress meter, which
has been split and moved out to the callers in the aforementioned
earlier commits.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-18 12:51:11 -07:00
Taylor Blau 630cd5194e commit-graph.c: peel refs in 'add_ref_to_set'
While iterating references (to discover the set of commits to write to
the commit-graph with 'git commit-graph write --reachable'),
'add_ref_to_set' can save 'fill_oids_from_commits()' some time by
peeling the references beforehand.

Move peeling out of 'fill_oids_from_commits()' and into
'add_ref_to_set()' to use 'peel_ref()' instead of 'deref_tag()'. Doing
so allows the commit-graph machinery to use the peeled value from
'$GIT_DIR/packed-refs' instead of having to load and parse tags.

While we're at it, discard non-commit objects reachable from ref tips.
This would be done automatically by 'fill_oids_from_commits()', but such
functionality will be removed in a subsequent patch after the call to
'lookup_commit_reference_gently' is dropped (at which point a non-commit
object in the commits oidset will become an error).

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-13 15:20:45 -07:00
Taylor Blau d335ce8f24 commit-graph.c: show progress of finding reachable commits
When 'git commit-graph write --reachable' is invoked, the commit-graph
machinery calls 'for_each_ref()' to discover the set of reachable
commits.

Right now the 'add_ref_to_set' callback is not doing anything other than
adding an OID to the set of known-reachable OIDs. In a subsequent
commit, 'add_ref_to_set' will presumptively peel references. This
operation should be fast for repositories with an up-to-date
'$GIT_DIR/packed-refs', but may be slow in the general case.

So that it doesn't appear that 'git commit-graph write' is idling with
'--reachable' in the slow case, add a progress meter to provide some
output in the meantime.

In general, we don't expect a progress meter to appear at all, since
peeling references with a 'packed-refs' file is quick. If it's slow and
we do show a progress meter, the subsequent 'fill_oids_from_commits()'
will be fast, since all of the calls to
'lookup_commit_reference_gently()' will be no-ops.

Both progress meters are delayed, so it is unlikely that more than one
will appear. In either case, this intermediate state will go away in a
handful of patches, at which point there will be at most one progress
meter.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-13 15:20:45 -07:00
Junio C Hamano 896833b268 Merge branch 'tb/shallow-cleanup'
Code cleanup.

* tb/shallow-cleanup:
  shallow: use struct 'shallow_lock' for additional safety
  shallow.h: document '{commit,rollback}_shallow_file'
  shallow: extract a header file for shallow-related functions
  commit: make 'commit_graft_pos' non-static
2020-05-13 12:19:18 -07:00
Junio C Hamano 95875e0356 Merge branch 'jt/commit-graph-plug-memleak'
Fix a leak noticed by fuzzer.

* jt/commit-graph-plug-memleak:
  commit-graph: avoid memory leaks
2020-05-08 14:25:05 -07:00
Junio C Hamano 1d7e9c4c4e Merge branch 'tb/commit-graph-perm-bits'
Some of the files commit-graph subsystem keeps on disk did not
correctly honor the core.sharedRepository settings and some were
left read-write.

* tb/commit-graph-perm-bits:
  commit-graph.c: make 'commit-graph-chain's read-only
  commit-graph.c: ensure graph layers respect core.sharedRepository
  commit-graph.c: write non-split graphs as read-only
  lockfile.c: introduce 'hold_lock_file_for_update_mode'
  tempfile.c: introduce 'create_tempfile_mode'
2020-05-05 14:54:28 -07:00
Taylor Blau 1fe10844ca commit-graph.c: extract 'refs_cb_data'
In subsequent patches, we are going to update a progress meter when
'add_ref_to_set()' is called, and need a convenient way to pass a
'struct progress *' in from the caller.

Introduce 'refs_cb_data' as a catch-all for parameters that
'add_ref_to_set' may need, and wrap the existing single parameter in
that struct.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-04 23:20:24 -07:00
Jonathan Tan fbda77c6c0 commit-graph: avoid memory leaks
A fuzzer running on the entry point provided by fuzz-commit-graph.c
revealed a memory leak when parse_commit_graph() creates a struct
bloom_filter_settings and then returns early due to error. Fix that
error by always freeing that struct first (if it exists) before
returning early due to error.

While making that change, I also noticed another possible memory leak -
when the BLOOMDATA chunk is provided but not BLOOMINDEXES. Also fix that
error.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-04 14:08:38 -07:00
Junio C Hamano 6d56d4c7dc Merge branch 'ds/blame-on-bloom'
"git blame" learns to take advantage of the "changed-paths" Bloom
filter stored in the commit-graph file.

* ds/blame-on-bloom:
  test-bloom: check that we have expected arguments
  test-bloom: fix some whitespace issues
  blame: drop unused parameter from maybe_changed_path
  blame: use changed-path Bloom filters
  tests: write commit-graph with Bloom filters
  revision: complicated pathspecs disable filters
2020-05-01 13:39:54 -07:00
Junio C Hamano 9b6606f43d Merge branch 'gs/commit-graph-path-filter'
Introduce an extension to the commit-graph to make it efficient to
check for the paths that were modified at each commit using Bloom
filters.

* gs/commit-graph-path-filter:
  bloom: ignore renames when computing changed paths
  commit-graph: add GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS test flag
  t4216: add end to end tests for git log with Bloom filters
  revision.c: add trace2 stats around Bloom filter usage
  revision.c: use Bloom filters to speed up path based revision walks
  commit-graph: add --changed-paths option to write subcommand
  commit-graph: reuse existing Bloom filters during write
  commit-graph: write Bloom filters to commit graph file
  commit-graph: examine commits by generation number
  commit-graph: examine changed-path objects in pack order
  commit-graph: compute Bloom filters for changed paths
  diff: halt tree-diff early after max_changes
  bloom.c: core Bloom filter implementation for changed paths.
  bloom.c: introduce core Bloom filter constructs
  bloom.c: add the murmur3 hash implementation
  commit-graph: define and use MAX_NUM_CHUNKS
2020-05-01 13:39:53 -07:00
Junio C Hamano cf054f817a Merge branch 'tb/commit-graph-fd-exhaustion-fix'
The commit-graph code exhausted file descriptors easily when it
does not have to.

* tb/commit-graph-fd-exhaustion-fix:
  commit-graph: close descriptors after mmap
  commit-graph.c: gracefully handle file descriptor exhaustion
  t/test-lib.sh: make ULIMIT_FILE_DESCRIPTORS available to tests
  commit-graph.c: don't use discarded graph_name in error
2020-05-01 13:39:53 -07:00
Junio C Hamano 6a1c17d05b Merge branch 'tb/commit-graph-split-strategy'
"git commit-graph write" learned different ways to write out split
files.

* tb/commit-graph-split-strategy:
  Revert "commit-graph.c: introduce '--[no-]check-oids'"
  commit-graph.c: introduce '--[no-]check-oids'
  commit-graph.h: replace 'commit_hex' with 'commits'
  oidset: introduce 'oidset_size'
  builtin/commit-graph.c: introduce split strategy 'replace'
  builtin/commit-graph.c: introduce split strategy 'no-merge'
  builtin/commit-graph.c: support for '--split[=<strategy>]'
  t/helper/test-read-graph.c: support commit-graph chains
2020-05-01 13:39:52 -07:00
Taylor Blau 120ad2b0f1 shallow: extract a header file for shallow-related functions
There are many functions in commit.h that are more related to shallow
repositories than they are to any sort of generic commit machinery.
Likely this began when there were only a few shallow-related functions,
and commit.h seemed a reasonable enough place to put them.

But, now there are a good number of shallow-related functions, and
placing them all in 'commit.h' doesn't make sense.

This patch extracts a 'shallow.h', which takes all of the declarations
from 'commit.h' for functions which already exist in 'shallow.c'. We
will bring the remaining shallow-related functions defined in 'commit.c'
in a subsequent patch.

For now, move only the ones that already are implemented in 'shallow.c',
and update the necessary includes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-30 14:19:13 -07:00
Junio C Hamano dbd5e0a186 Revert "commit-graph.c: introduce '--[no-]check-oids'"
This reverts commit 7a9ce0269b,
which has not yet gained consensus.
2020-04-29 12:44:40 -07:00
Taylor Blau 45a4365cb6 commit-graph.c: make 'commit-graph-chain's read-only
In a previous commit, we made incremental graph layers read-only by
using 'git_mkstemp_mode' with permissions '0444'.

There is no reason that 'commit-graph-chain's should be modifiable by
the user, since they are generated at a temporary location and then
atomically renamed into place.

To ensure that these files are read-only, too, use
'hold_lock_file_for_update_mode' with the same read-only permission
bits, and let the umask and 'adjust_shared_perm' take care of the rest.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-29 12:35:30 -07:00
Taylor Blau f4d62847a4 commit-graph.c: ensure graph layers respect core.sharedRepository
Non-layered commit-graphs use 'adjust_shared_perm' to make the
commit-graph file readable (or not) to a combination of the user, group,
and others.

Call 'adjust_shared_perm' for split-graph layers to make sure that these
also respect 'core.sharedRepository'. The 'commit-graph-chain' file
already respects this configuration since it uses
'hold_lock_file_for_update' (which calls 'adjust_shared_perm' eventually
in 'create_tempfile_mode').

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-29 12:35:30 -07:00
Taylor Blau 1f9becaedc commit-graph.c: write non-split graphs as read-only
In the previous commit, Git learned 'hold_lock_file_for_update_mode' to
allow the caller to specify the permission bits (prior to further
adjustment by the umask and shared repository permissions) used when
acquiring a temporary file.

Use this in the commit-graph machinery for writing a non-split graph to
acquire an opened temporary file with permissions read-only permissions
to match the split behavior. (In the split case, Git uses
git_mkstemp_mode' for each of the commit-graph layers with permission
bits '0444').

One can notice this discrepancy when moving a non-split graph to be part
of a new chain. This causes a commit-graph chain where all layers have
read-only permission bits, except for the base layer, which is writable
for the current user.

Resolve this discrepancy by using the new
'hold_lock_file_for_update_mode' and passing the desired permission
bits.

Doing so causes some test fallout in t5318 and t6600. In t5318, this
occurs in tests that corrupt a commit-graph file by writing into it. For
these, 'chmod u+w'-ing the file beforehand resolves the issue. The
additional spot in 'corrupt_graph_verify' is necessary because of the
extra 'git commit-graph write' beforehand (which *does* rewrite the
commit-graph file). In t6600, this is caused by copying a read-only
commit-graph file into place and then trying to replace it. For these,
make these files writable.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-29 12:35:30 -07:00
Junio C Hamano 25b336421f Merge branch 'ds/commit-graph-expiry-fix'
"git commit-graph write --expire-time=<timestamp>" did not use the
given timestamp correctly, which has been corrected.

* ds/commit-graph-expiry-fix:
  commit-graph: fix buggy --expire-time option
2020-04-28 15:50:02 -07:00
Jeff King c8828530b7 commit-graph: close descriptors after mmap
We don't ever refer to the descriptor after mmap-ing it. And keeping it
open means we can run out of descriptors in degenerate cases (e.g.,
thousands of split chain files). Let's close it as soon as possible.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-24 22:25:50 -07:00
Taylor Blau b78a556a6a commit-graph.c: gracefully handle file descriptor exhaustion
When writing a layered commit-graph, the commit-graph machinery uses
'commit_graph_filenames_after' and 'commit_graph_hash_after' to keep
track of the layers in the chain that we are in the process of writing.

When the number of commit-graph layers shrinks, we initialize all
entries in the aforementioned arrays, because we know the structure of
the new commit-graph chain immediately (since there are no new layers,
there are no unknown hash values).

But when the number of commit-graph layers grows (i.e., that
'num_commit_graphs_after > num_commit_graphs_before'), then we leave
some entries in the filenames and hashes arrays as uninitialized,
because we will fill them in later as those values become available.

For instance, we rely on 'write_commit_graph_file's to store the
filename and hash of the last layer in the new chain, which is the one
that it is responsible for writing. But, it's possible that
'write_commit_graph_file' may fail, e.g., from file descriptor
exhaustion. In this case it is possible that 'git_mkstemp_mode' will
fail, and that function will return early *before* setting the values
for the last commit-graph layer's filename and hash.

This causes a number of upleasant side-effects. For instance, trying to
'free()' each entry in 'ctx->commit_graph_filenames_after' (and
similarly for the hashes array) causes us to 'free()' uninitialized
memory, since the area is allocated with 'malloc()' and is therefore
subject to contain garbage (which is left alone when
'write_commit_graph_file' returns early).

This can manifest in other issues, like a general protection fault,
and/or leaving a stray 'commit-graph-chain.lock' around after the
process dies. (The reasoning for this is still a mystery to me, since
we'd otherwise usually expect the kernel to run tempfile.c's 'atexit()'
handlers in the case of a normal death...)

To resolve this, initialize the memory with 'CALLOC_ARRAY' so that
uninitialized entries are filled with zeros, and can thus be 'free()'d
as a noop instead of causing a fault.

Helped-by: Jeff King <peff@peff.net>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-23 14:58:52 -07:00
Taylor Blau a2d57e2280 commit-graph.c: don't use discarded graph_name in error
When writing a commit-graph layer, we do so in a temporary file which is
renamed into place. If we fail to create a temporary file, for e.g.,
because we have too many open files, then 'git_mkstemp_mode' sets the
pattern to the empty string, in which case we get an error something
along the lines of:

  error: unable to create ''

It's not useful to show the pattern here at all, since we (1) know the
pattern is well-formed, and (2) would have already shown the dirname
when trying to create the leading directories. So, replace this error
with something friendlier.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-23 14:58:52 -07:00
Derrick Stolee b23ea9790d tests: write commit-graph with Bloom filters
The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
graph file whenever "git commit" is run, ensuring that we always
have an updated commit-graph throughout the test suite. The
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
introduced to write the changed-path Bloom filters whenever "git
commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
trick doesn't launch a separate process and instead writes it
directly.

To expand the number of tests that have commits in the commit-graph
file, add a helper method that computes the commit-graph and place
that helper inside "git commit" and "git merge".

In the helper method, check GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
to ensure we are writing changed-path Bloom filters whenever
possible.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-16 15:38:04 -07:00
Taylor Blau 7a9ce0269b commit-graph.c: introduce '--[no-]check-oids'
When operating on a stream of commit OIDs on stdin, 'git commit-graph
write' checks that each OID refers to an object that is indeed a commit.
This is convenient to make sure that the given input is well-formed, but
can sometimes be undesirable.

For example, server operators may wish to feed the refnames that were
updated during a push to 'git commit-graph write --input=stdin-commits',
and silently discard refs that don't point at commits. This can be done
by combing the output of 'git for-each-ref' with '--format
%(*objecttype)', but this requires opening up a potentially large number
of objects.  Instead, it is more convenient to feed the updated refs to
the commit-graph machinery, and let it throw out refs that don't point
to commits.

Introduce '--[no-]check-oids' to make such a behavior possible. With
'--check-oids' (the default behavior to retain backwards compatibility),
'git commit-graph write' will barf on a non-commit line in its input.
With 'no-check-oids', such lines will be silently ignored, making the
above possible by specifying this option.

No matter which is supplied, 'git commit-graph write' retains the
behavior from the previous commit of rejecting non-OID inputs like
"HEAD" and "refs/heads/foo" as before.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-15 09:20:34 -07:00
Taylor Blau 6830c36077 commit-graph.h: replace 'commit_hex' with 'commits'
The 'write_commit_graph()' function takes in either a string list of
pack indices, or a string list of hexadecimal commit OIDs. These
correspond to the '--stdin-packs' and '--stdin-commits' mode(s) from
'git commit-graph write'.

Using a string_list of hexadecimal commit IDs is not the most efficient
use of memory, since we can instead use the 'struct oidset', which is
more well-suited for this case.

This has another benefit which will become apparent in the following
commit. This is that we are about to disambiguate the kinds of errors we
produce with '--stdin-commits' into "non-hex input" and "hex-input, but
referring to a non-commit object". By having 'write_commit_graph' take
in a 'struct oidset *' of commits, we place the burden on the caller (in
this case, the builtin) to handle the first case, and the commit-graph
machinery can handle the second case.

Suggested-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-15 09:20:30 -07:00
Taylor Blau 8a6ac287b2 builtin/commit-graph.c: introduce split strategy 'replace'
When using split commit-graphs, it is sometimes useful to completely
replace the commit-graph chain with a new base.

For example, consider a scenario in which a repository builds a new
commit-graph incremental for each push. Occasionally (say, after some
fixed number of pushes), they may wish to rebuild the commit-graph chain
with all reachable commits.

They can do so with

  $ git commit-graph write --reachable

but this removes the chain entirely and replaces it with a single
commit-graph in 'objects/info/commit-graph'. Unfortunately, this means
that the next push will have to move this commit-graph into the first
layer of a new chain, and then write its new commits on top.

Avoid such copying entirely by allowing the caller to specify that they
wish to replace the entirety of their commit-graph chain, while also
specifying that the new commit-graph should become the basis of a fresh,
length-one chain.

This addresses the above situation by making it possible for the caller
to instead write:

  $ git commit-graph write --reachable --split=replace

which writes a new length-one chain to 'objects/info/commit-graphs',
making the commit-graph incremental generated by the subsequent push
relatively cheap by avoiding the aforementioned copy.

In order to do this, remove an assumption in 'write_commit_graph_file'
that chains are always at least two incrementals long.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-15 09:20:28 -07:00
Taylor Blau fdbde82fe5 builtin/commit-graph.c: introduce split strategy 'no-merge'
In the previous commit, we laid the groundwork for supporting different
splitting strategies. In this commit, we introduce the first splitting
strategy: 'no-merge'.

Passing '--split=no-merge' is useful for callers which wish to write a
new incremental commit-graph, but do not want to spend effort condensing
the incremental chain [1]. Previously, this was possible by passing
'--size-multiple=0', but this no longer the case following 63020f175f
(commit-graph: prefer default size_mult when given zero, 2020-01-02).

When '--split=no-merge' is given, the commit-graph machinery will never
condense an existing chain, and it will always write a new incremental.

[1]: This might occur when, for example, a server administrator running
some program after each push may want to ensure that each job runs
proportional in time to the size of the push, and does not "jump" when
the commit-graph machinery decides to trigger a merge.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-15 09:20:27 -07:00
Garima Singh 1217c03e7b commit-graph: reuse existing Bloom filters during write
Add logic to
a) parse Bloom filter information from the commit graph file and,
b) re-use existing Bloom filters.

See Documentation/technical/commit-graph-format for the format in which
the Bloom filter information is written to the commit graph file.

To read Bloom filter for a given commit with lexicographic position
'i' we need to:
1. Read BIDX[i] which essentially gives us the starting index in BDAT for
   filter of commit i+1. It is essentially the index past the end
   of the filter of commit i. It is called end_index in the code.

2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
   for filter of commit i. It is called the start_index in the code.
   For the first commit, where i = 0, Bloom filter data starts at the
   beginning, just past the header in the BDAT chunk. Hence, start_index
   will be 0.

3. The length of the filter will be end_index - start_index, because
   BIDX[i] gives the cumulative 8-byte words including the ith
   commit's filter.

We toggle whether Bloom filters should be recomputed based on the
compute_if_not_present flag.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-06 11:08:37 -07:00
Garima Singh 76ffbca71a commit-graph: write Bloom filters to commit graph file
Update the technical documentation for commit-graph-format with
the formats for the Bloom filter index (BIDX) and Bloom filter
data (BDAT) chunks. Write the computed Bloom filters information
to the commit graph file using this format.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-06 11:08:37 -07:00
Derrick Stolee b09b785c78 commit-graph: fix buggy --expire-time option
The commit-graph builtin has an --expire-time option that takes a
datetime using OPT_EXPIRY_DATE(). However, the implementation inside
expire_commit_graphs() was treating a non-zero value as a number of
seconds to subtract from "now".

Update t5323-split-commit-graph.sh to demonstrate the correct value
of the --expire-time option by actually creating a crud .graph file
with mtime earlier than the expire time. Instead of using a super-
early time (1980) we use an explicit, and recent, time. Using
test-tool chmtime to create two files on either end of an exact
second, we create a test that catches this failure no matter the
current time. Using a fixed date is more portable than trying to
format a relative date string into the --expiry-date input.

I noticed this when inspecting some Scalar repos that had an excess
number of commit-graph files. In Scalar, we were using this second
interpretation by using "--expire-time=3600" to mean "delete graphs
older than one hour ago" to avoid deleting a commit-graph that a
foreground process may be trying to load.

Also I noticed that the help text was copied from the --max-commits
option. Fix that help text.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-01 14:36:26 -07:00
Garima Singh 3d11275505 commit-graph: examine commits by generation number
When running 'git commit-graph write --changed-paths', we sort the
commits by pack-order to save time when computing the changed-paths
bloom filters. This does not help when finding the commits via the
'--reachable' flag.

If not using pack-order, then sort by generation number before
examining the diff. Commits with similar generation are more likely
to have many trees in common, making the diff faster.

On the Linux kernel repository, this change reduced the computation
time for 'git commit-graph write --reachable --changed-paths' from
3m00s to 1m37s.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-30 09:59:53 -07:00
Jeff King d21ee7d111 commit-graph: examine changed-path objects in pack order
Looking at the diff of commit objects in pack order is much faster than
in sha1 order, as it gives locality to the access of tree deltas
(whereas sha1 order is effectively random). Unfortunately the
commit-graph code sorts the commits (several times, sometimes as an oid
and sometimes a pointer-to-commit), and we ultimately traverse in sha1
order.

Instead, let's remember the position at which we see each commit, and
traverse in that order when looking at bloom filters. This drops my time
for "git commit-graph write --changed-paths" in linux.git from ~4
minutes to ~1.5 minutes.

Probably the "--reachable" code path would want something similar.

Or alternatively, we could use a different data structure (either a
hash, or maybe even just a bit in "struct commit") to keep track of
which oids we've seen, etc instead of sorting. And then we could keep
the original order.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-30 09:59:53 -07:00
Garima Singh f97b9325f6 commit-graph: compute Bloom filters for changed paths
Add new COMMIT_GRAPH_WRITE_CHANGED_PATHS flag that makes Git compute
Bloom filters for the paths that changed between a commit and it's
first parent, for each commit in the commit-graph.  This computation
is done on a commit-by-commit basis.

We will write these Bloom filters to the commit-graph file, to store
this data on disk, in the next change in this series.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-30 09:59:53 -07:00
Garima Singh 3be7efcafc commit-graph: define and use MAX_NUM_CHUNKS
This is a minor cleanup to make it easier to change
the number of chunks being written to the commit
graph.

Reviewed-by: Jakub Narębski <jnareb@gmail.com>
Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-30 09:59:52 -07:00
Junio C Hamano 5da7329e29 Merge branch 'rs/commit-graph-code-simplification'
Code simplfication.

* rs/commit-graph-code-simplification:
  commit-graph: use progress title directly
2020-03-05 10:43:04 -08:00
René Scharfe d68ce906c7 commit-graph: use progress title directly
merge_commit_graphs() copies the (translated) progress message into a
strbuf and passes the copy to start_delayed_progress() at each loop
iteration.  The latter function takes a string pointer, so let's avoid
the detour and hand the string to it directly.  That's shorter, simpler
and slightly more efficient.

Signed-off-by: René Scharfe <l.s.r@web.de>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-27 09:36:22 -08:00
Taylor Blau a7df60cac8 commit-graph.h: use odb in 'load_commit_graph_one_fd_st'
Apply a similar treatment as in the previous patch to pass a 'struct
object_directory *' through the 'load_commit_graph_one_fd_st'
initializer, too.

This prevents a potential bug where a pointer comparison is made to a
NULL 'g->odb', which would cause the commit-graph machinery to think
that a pair of commit-graphs belonged to different alternates when in
fact they do not (i.e., in the case of no '--object-dir').

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 11:36:51 -08:00
Taylor Blau ad2dd5bb63 commit-graph.c: remove path normalization, comparison
As of the previous patch, all calls to 'commit-graph.c' functions which
perform path normalization (for e.g., 'get_commit_graph_filename()') are
of the form 'ctx->odb->path', which is always in normalized form.

Now that there are no callers passing non-normalized paths to these
functions, ensure that future callers are bound by the same restrictions
by making these functions take a 'struct object_directory *' instead of
a 'const char *'. To match, replace all calls with arguments of the form
'ctx->odb->path' with 'ctx->odb' To recover the path, functions that
perform path manipulation simply use 'odb->path'.

Further, avoid string comparisons with arguments of the form
'odb->path', and instead prefer raw pointer comparisons, which
accomplish the same effect, but are far less brittle.

This has a pleasant side-effect of making these functions much more
robust to paths that cannot be normalized by 'normalize_path_copy()',
i.e., because they are outside of the current working directory.

For example, prior to this patch, Valgrind reports that the following
uninitialized memory read [1]:

  $ ( cd t && GIT_DIR=../.git valgrind git rev-parse HEAD^ )

because 'normalize_path_copy()' can't normalize '../.git' (since it's
relative to but above of the current working directory) [2].

By using a 'struct object_directory *' directly,
'get_commit_graph_filename()' does not need to normalize, because all
paths are relative to the current working directory since they are
always read from the '->path' of an object directory.

[1]: https://lore.kernel.org/git/20191027042116.GA5801@sigill.intra.peff.net.
[2]: The bug here is that 'get_commit_graph_filename()' returns the
     result of 'normalize_path_copy()' without checking the return
     value.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 11:36:51 -08:00
Taylor Blau 13c2499249 commit-graph.h: store object directory in 'struct commit_graph'
In a previous patch, the 'char *object_dir' in 'struct commit_graph' was
replaced with a 'struct object_directory'. This patch applies the same
treatment to 'struct commit_graph', which is another intermediate step
towards getting rid of all path normalization in 'commit-graph.c'.

Instead of taking a 'char *object_dir', functions that construct a
'struct commit_graph' now take a 'struct object_directory *'. Any code
that needs an object directory path use '->path' instead.

This ensures that all calls to functions that perform path normalization
are given arguments which do not themselves require normalization. This
prepares those functions to drop their normalization entirely, which
will occur in the subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 11:36:51 -08:00
Taylor Blau 0bd52e27e3 commit-graph.h: store an odb in 'struct write_commit_graph_context'
There are lots of places in 'commit-graph.h' where a function either has
(or almost has) a full 'struct object_directory *', accesses '->path',
and then throws away the rest of the struct.

This can cause headaches when comparing the locations of object
directories across alternates (e.g., in the case of deciding if two
commit-graph layers can be merged). These paths are normalized with
'normalize_path_copy()' which mitigates some comparison issues, but not
all [1].

Replace usage of 'char *object_dir' with 'odb->path' by storing a
'struct object_directory *' in the 'write_commit_graph_context'
structure. This is an intermediate step towards getting rid of all path
normalization in 'commit-graph.c'.

Resolving a user-provided '--object-dir' argument now requires that we
compare it to the known alternates for equality.  Prior to this patch,
an unknown '--object-dir' argument would silently exit with status zero.

This can clearly lead to unintended behavior, such as verifying
commit-graphs that aren't in a repository's own object store (or one of
its alternates), or causing a typo to mask a legitimate commit-graph
verification failure. Make this error non-silent by 'die()'-ing when the
given '--object-dir' does not match any known alternate object store.

[1]: In my testing, for example, I can get one side of the commit-graph
code to fill object_dir with "./objects" and the other with just
"objects".

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-04 11:36:37 -08:00
Junio C Hamano 037f067587 Merge branch 'ds/commit-graph-set-size-mult'
The code to write split commit-graph file(s) upon fetching computed
bogus value for the parameter used in splitting the resulting
files, which has been corrected.

* ds/commit-graph-set-size-mult:
  commit-graph: prefer default size_mult when given zero
2020-01-06 14:17:51 -08:00
Derrick Stolee 63020f175f commit-graph: prefer default size_mult when given zero
In 50f26bd ("fetch: add fetch.writeCommitGraph config setting",
2019-09-02), the fetch builtin added the capability to write a
commit-graph using the "--split" feature. This feature creates
multiple commit-graph files, and those can merge based on a set
of "split options" including a size multiple. The default size
multiple is 2, which intends to provide a log_2 N depth of the
commit-graph chain where N is the number of commits.

However, I noticed during dogfooding that my commit-graph chains
were becoming quite large when left only to builds by 'git fetch'.
It turns out that in split_graph_merge_strategy(), we default the
size_mult variable to 2 except we override it with the context's
split_opts if they exist. In builtin/fetch.c, we create such a
split_opts, but do not populate it with values.

This problem is due to two failures:

 1. It is unclear that we can add the flag COMMIT_GRAPH_WRITE_SPLIT
    with a NULL split_opts.
 2. If we have a non-NULL split_opts, then we override the default
    values even if a zero value is given.

Correct both of these issues. First, do not override size_mult when
the options provide a zero value. Second, stop creating a split_opts
in the fetch builtin.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-02 13:46:34 -08:00
Junio C Hamano 41dac79c2f Merge branch 'ds/commit-graph-delay-gen-progress'
One kind of progress messages were always given during commit-graph
generation, instead of following the "if it takes more than two
seconds, show progress" pattern, which has been corrected.

* ds/commit-graph-delay-gen-progress:
  commit-graph: use start_delayed_progress()
  progress: create GIT_PROGRESS_DELAY
2019-12-10 13:11:43 -08:00
Junio C Hamano d3096d2ba6 Merge branch 'en/doc-typofix'
Docfix.

* en/doc-typofix:
  Fix spelling errors in no-longer-updated-from-upstream modules
  multimail: fix a few simple spelling errors
  sha1dc: fix trivial comment spelling error
  Fix spelling errors in test commands
  Fix spelling errors in messages shown to users
  Fix spelling errors in names of tests
  Fix spelling errors in comments of testcases
  Fix spelling errors in code comments
  Fix spelling errors in documentation outside of Documentation/
  Documentation: fix a bunch of typos, both old and new
2019-12-01 09:04:35 -08:00
Junio C Hamano 0e07c1cd83 Merge branch 'jk/cleanup-object-parsing-and-fsck'
Crufty code and logic accumulated over time around the object
parsing and low-level object access used in "git fsck" have been
cleaned up.

* jk/cleanup-object-parsing-and-fsck: (23 commits)
  fsck: accept an oid instead of a "struct tree" for fsck_tree()
  fsck: accept an oid instead of a "struct commit" for fsck_commit()
  fsck: accept an oid instead of a "struct tag" for fsck_tag()
  fsck: rename vague "oid" local variables
  fsck: don't require an object struct in verify_headers()
  fsck: don't require an object struct for fsck_ident()
  fsck: drop blob struct from fsck_finish()
  fsck: accept an oid instead of a "struct blob" for fsck_blob()
  fsck: don't require an object struct for report()
  fsck: only require an oid for skiplist functions
  fsck: only provide oid/type in fsck_error callback
  fsck: don't require object structs for display functions
  fsck: use oids rather than objects for object_name API
  fsck_describe_object(): build on our get_object_name() primitive
  fsck: unify object-name code
  fsck: require an actual buffer for non-blobs
  fsck: stop checking tag->tagged
  fsck: stop checking commit->parent counts
  fsck: stop checking commit->tree value
  commit, tag: don't set parsed bit for parse failures
  ...
2019-12-01 09:04:28 -08:00
Derrick Stolee ecc0869080 commit-graph: use start_delayed_progress()
When writing a commit-graph, we show progress along several commit
walks. When we use start_delayed_progress(), the progress line will
only appear if that step takes a decent amount of time.

However, one place was missed: computing generation numbers. This is
normally a very fast operation as all commits have been parsed in a
previous step. But, this is showing up for all users no matter how few
commits are being added.

The tests that check for the progress output have already been updated
to use GIT_PROGRESS_DELAY=0 to force the expected output.

Helped-by: Jeff King <peff@peff.net>
Reported-by: ryenus <ryenus@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-11-27 10:57:10 +09:00
Elijah Newren 15beaaa3d1 Fix spelling errors in code comments
Reported-by: Jens Schleusener <Jens.Schleusener@fossies.org>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-11-10 16:00:54 +09:00
Junio C Hamano dac1d83c91 Merge branch 'ds/commit-graph-on-fetch'
Regression fix.

* ds/commit-graph-on-fetch:
  commit-graph: fix writing first commit-graph during fetch
  t5510-fetch.sh: demonstrate fetch.writeCommitGraph bug
2019-11-04 13:33:06 +09:00
Jeff King 228c78fbd4 commit, tag: don't set parsed bit for parse failures
If we can't parse a commit, then parse_commit() will return an error
code. But it _also_ sets the "parsed" flag, which tells us not to bother
trying to re-parse the object. That means that subsequent parses have no
idea that the information in the struct may be bogus.  I.e., doing this:

  parse_commit(commit);
  ...
  if (parse_commit(commit) < 0)
          die("commit is broken");

will never trigger the die(). The second parse_commit() will see the
"parsed" flag and quietly return success.

There are two obvious ways to fix this:

  1. Stop setting "parsed" until we've successfully parsed.

  2. Keep a second "corrupt" flag to indicate that we saw an error (and
     when the parsed flag is set, return 0/-1 depending on the corrupt
     flag).

This patch does option 1. The obvious downside versus option 2 is that
we might continually re-parse a broken object. But in practice,
corruption like this is rare, and we typically die() or return an error
in the caller. So it's OK not to worry about optimizing for corruption.
And it's much simpler: we don't need to use an extra bit in the object
struct, and callers which check the "parsed" flag don't need to learn
about the corrupt bit, too.

There's no new test here, because this case is already covered in t5318.
Note that we do need to update the expected message there, because we
now detect the problem in the return from "parse_commit()", and not with
a separate check for a NULL tree. In fact, we can now ditch that
explicit tree check entirely, as we're covered robustly by this change
(and the previous recent change to treat a NULL tree as a parse error).

We'll also give tags the same treatment. I don't know offhand of any
cases where the problem can be triggered (it implies somebody ignoring a
parse error earlier in the process), but consistently returning an error
should cause the least surprise.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-28 14:04:49 +09:00
Derrick Stolee cb99a34e23 commit-graph: fix writing first commit-graph during fetch
The previous commit includes a failing test for an issue around
fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we
fix that bug and set the test to "test_expect_success".

The problem arises with this set of commands when the remote repo at
<url> has a submodule. Note that --recurse-submodules is not needed to
demonstrate the bug.

	$ git clone <url> test
	$ cd test
	$ git -c fetch.writeCommitGraph=true fetch origin
	Computing commit graph generation numbers: 100% (12/12), done.
	BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
	Aborted (core dumped)

As an initial fix, I converted the code in builtin/fetch.c that calls
write_commit_graph_reachable() to instead launch a "git commit-graph
write --reachable --split" process. That code worked, but is not how we
want the feature to work long-term.

That test did demonstrate that the issue must be something to do with
internal state of the 'git fetch' process.

The write_commit_graph() method in commit-graph.c ensures the commits we
plan to write are "closed under reachability" using close_reachable().
This method walks from the input commits, and uses the UNINTERESTING
flag to mark which commits have already been visited. This allows the
walk to take O(N) time, where N is the number of commits, instead of
O(P) time, where P is the number of paths. (The number of paths can be
exponential in the number of commits.)

However, the UNINTERESTING flag is used in lots of places in the
codebase. This flag usually means some barrier to stop a commit walk,
such as in revision-walking to compare histories. It is not often
cleared after the walk completes because the starting points of those
walks do not have the UNINTERESTING flag, and clear_commit_marks() would
stop immediately.

This is happening during a 'git fetch' call with a remote. The fetch
negotiation is comparing the remote refs with the local refs and marking
some commits as UNINTERESTING.

I tested running clear_commit_marks_many() to clear the UNINTERESTING
flag inside close_reachable(), but the tips did not have the flag, so
that did nothing.

It turns out that the calculate_changed_submodule_paths() method is at
fault. Thanks, Peff, for pointing out this detail! More specifically,
for each submodule, the collect_changed_submodules() runs a revision
walk to essentially do file-history on the list of submodules. That
revision walk marks commits UNININTERESTING if they are simplified away
by not changing the submodule.

Instead, I finally arrived on the conclusion that I should use a flag
that is not used in any other part of the code. In commit-reach.c, a
number of flags were defined for commit walk algorithms. The REACHABLE
flag seemed like it made the most sense, and it seems it was not
actually used in the file. The REACHABLE flag was used in early versions
of commit-reach.c, but was removed by 4fbcca4 (commit-reach: make
can_all_from_reach... linear, 2018-07-20).

Add the REACHABLE flag to commit-graph.c and use it instead of
UNINTERESTING in close_reachable(). This fixes the bug in manual
testing.

Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Szeder Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-25 11:19:16 +09:00
Junio C Hamano 6e12570822 Merge branch 'ah/cleanups'
Miscellaneous code clean-ups.

* ah/cleanups:
  git_mkstemps_mode(): replace magic numbers with computed value
  wrapper: use a loop instead of repetitive statements
  diffcore-break: use a goto instead of a redundant if statement
  commit-graph: remove a duplicate assignment
2019-10-09 14:01:00 +09:00
Junio C Hamano 80693e3f09 Merge branch 'tb/commit-graph-harden'
The code to parse and use the commit-graph file has been made more
robust against corrupted input.

* tb/commit-graph-harden:
  commit-graph.c: handle corrupt/missing trees
  commit-graph.c: handle commit parsing errors
  t/t5318: introduce failing 'git commit-graph write' tests
2019-10-07 11:32:58 +09:00
Junio C Hamano caf150ce7d Merge branch 'gs/commit-graph-progress'
* gs/commit-graph-progress:
  commit-graph: add --[no-]progress to write and verify
2019-10-07 11:32:57 +09:00
Junio C Hamano 1398171378 Merge branch 'rs/commit-graph-use-list-count'
Code cleanup.

* rs/commit-graph-use-list-count:
  commit-graph: use commit_list_count()
2019-10-07 11:32:57 +09:00
Junio C Hamano 098e8c6716 Merge branch 'jk/disable-commit-graph-during-upload-pack'
The "upload-pack" (the counterpart of "git fetch") needs to disable
commit-graph when responding to a shallow clone/fetch request, but
the way this was done made Git panic, which has been corrected.

* jk/disable-commit-graph-during-upload-pack:
  upload-pack: disable commit graph more gently for shallow traversal
  commit-graph: bump DIE_ON_LOAD check to actual load-time
2019-10-07 11:32:55 +09:00
Junio C Hamano cda8faa37e Merge branch 'jk/commit-graph-cleanup'
A pair of small fixups to "git commit-graph" have been applied.

* jk/commit-graph-cleanup:
  commit-graph: turn off save_commit_buffer
  commit-graph: don't show progress percentages while expanding reachable commits
2019-10-07 11:32:55 +09:00
Alex Henrie 8da02ce62a commit-graph: remove a duplicate assignment
Leave the variable 'g' uninitialized before it is set just before its
first use in front of a loop, which is a much more appropriate place to
indicate what it is used for.

Also initialize the variable 'num_commits' just before the loop instead
of at the beginning of the function for the same reason.

Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Alex Henrie <alexhenrie24@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-02 15:04:16 +09:00
Garima Singh 7371612255 commit-graph: add --[no-]progress to write and verify
Add --[no-]progress to git commit-graph write and verify.
The progress feature was introduced in 7b0f229
("commit-graph write: add progress output", 2018-09-17) but
the ability to opt-out was overlooked.

Signed-off-by: Garima Singh <garima.singh@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-18 14:23:09 -07:00
René Scharfe 689a146c91 commit-graph: use commit_list_count()
Let commit_list_count() count the number of parents instead of
duplicating it.  Also store the result in an unsigned int, as that's
what the function returns, and the count is never negative.

Signed-off-by: René Scharfe <l.s.r@web.de>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-16 13:00:50 -07:00
Jeff King 6abada1880 upload-pack: disable commit graph more gently for shallow traversal
When the client has asked for certain shallow options like
"deepen-since", we do a custom rev-list walk that pretends to be
shallow. Before doing so, we have to disable the commit-graph, since it
is not compatible with the shallow view of the repository. That's
handled by 829a321569 (commit-graph: close_commit_graph before shallow
walk, 2018-08-20). That commit literally closes and frees our
repo->objects->commit_graph struct.

That creates an interesting problem for commits that have _already_ been
parsed using the commit graph. Their commit->object.parsed flag is set,
their commit->graph_pos is set, but their commit->maybe_tree may still
be NULL. When somebody later calls repo_get_commit_tree(), we see that
we haven't loaded the tree oid yet and try to get it from the commit
graph. But since it has been freed, we segfault!

So the root of the issue is a data dependency between the commit's
lazy-load of the tree oid and the fact that the commit graph can go
away mid-process. How can we resolve it?

There are a couple of general approaches:

  1. The obvious answer is to avoid loading the tree from the graph when
     we see that it's NULL. But then what do we return for the tree oid?
     If we return NULL, our caller in do_traverse() will rightly
     complain that we have no tree. We'd have to fallback to loading the
     actual commit object and re-parsing it. That requires teaching
     parse_commit_buffer() to understand re-parsing (i.e., not starting
     from a clean slate and not leaking any allocated bits like parent
     list pointers).

  2. When we close the commit graph, walk through the set of in-memory
     objects and clear any graph_pos pointers. But this means we also
     have to "unparse" any such commits so that we know they still need
     to open the commit object to fill in their trees. So it's no less
     complicated than (1), and is more expensive (since we clear objects
     we might not later need).

  3. Stop freeing the commit-graph struct. Continue to let it be used
     for lazy-loads of tree oids, but let upload-pack specify that it
     shouldn't be used for further commit parsing.

  4. Push the whole shallow rev-list out to its own sub-process, with
     the commit-graph disabled from the start, giving it a clean memory
     space to work from.

I've chosen (3) here. Options (1) and (2) would work, but are
non-trivial to implement. Option (4) is more expensive, and I'm not sure
how complicated it is (shelling out for the actual rev-list part is
easy, but we do then parse the resulting commits internally, and I'm not
clear which parts need to be handling shallow-ness).

The new test in t5500 triggers this segfault, but see the comments there
for how horribly intimate it has to be with how both upload-pack and
commit graphs work.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12 12:30:08 -07:00
Jeff King fbab552a53 commit-graph: bump DIE_ON_LOAD check to actual load-time
Commit 43d3561805 (commit-graph write: don't die if the existing graph
is corrupt, 2019-03-25) added an environment variable we use only in the
test suite, $GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD. But it put the check for
this variable at the very top of prepare_commit_graph(), which is called
every time we want to use the commit graph. Most importantly, it comes
_before_ we check the fast-path "did we already try to load?", meaning
we end up calling getenv() for every single use of the commit graph,
rather than just when we load.

getenv() is allowed to have unexpected side effects, but that shouldn't
be a problem here; we're lazy-loading the graph so it's clear that at
least _one_ invocation of this function is going to call it.

But it is inefficient. getenv() typically has to do a linear search
through the environment space.

We could memoize the call, but it's simpler still to just bump the check
down to the actual loading step. That's fine for our sole user in t5318,
and produces this minor real-world speedup:

  [before]
  Benchmark #1: git -C linux rev-list HEAD >/dev/null
    Time (mean ± σ):      1.460 s ±  0.017 s    [User: 1.174 s, System: 0.285 s]
    Range (min … max):    1.440 s …  1.491 s    10 runs

  [after]
  Benchmark #1: git -C linux rev-list HEAD >/dev/null
    Time (mean ± σ):      1.391 s ±  0.005 s    [User: 1.118 s, System: 0.273 s]
    Range (min … max):    1.385 s …  1.399 s    10 runs

Of course that actual speedup depends on how big your environment is. We
can game it like this:

  for i in $(seq 10000); do
    export dummy$i=$i
  done

in which case I get:

  [before]
  Benchmark #1: git -C linux rev-list HEAD >/dev/null
    Time (mean ± σ):      6.257 s ±  0.061 s    [User: 6.005 s, System: 0.250 s]
    Range (min … max):    6.174 s …  6.337 s    10 runs

  [after]
  Benchmark #1: git -C linux rev-list HEAD >/dev/null
  Time (mean ± σ):      1.403 s ±  0.005 s    [User: 1.146 s, System: 0.256 s]
  Range (min … max):    1.396 s …  1.412 s    10 runs

So this is really more about avoiding the pathological case than
providing a big real-world speedup.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-12 12:29:58 -07:00
Junio C Hamano f4f8dfe127 Merge branch 'ds/feature-macros'
A mechanism to affect the default setting for a (related) group of
configuration variables is introduced.

* ds/feature-macros:
  repo-settings: create feature.experimental setting
  repo-settings: create feature.manyFiles setting
  repo-settings: parse core.untrackedCache
  commit-graph: turn on commit-graph by default
  t6501: use 'git gc' in quiet mode
  repo-settings: consolidate some config settings
2019-09-09 12:26:36 -07:00
SZEDER Gábor 67fa6aac5a commit-graph: don't show progress percentages while expanding reachable commits
Commit 49bbc57a57 (commit-graph write: emit a percentage for all
progress, 2019-01-19) was a bit overeager when it added progress
percentages to the "Expanding reachable commits in commit graph" phase
as well, because most of the time the number of commits that phase has
to iterate over is not known in advance and grows significantly, and,
consequently, we end up with nonsensical numbers:

  $ git commit-graph write --reachable
  Expanding reachable commits in commit graph: 138606% (824706/595), done.
  [...]

  $ git rev-parse v5.0 | git commit-graph write --stdin-commits
  Expanding reachable commits in commit graph: 81264400% (812644/1), done.
  [...]

Even worse, because the percentage grows so quickly, the progress code
outputs much more often than it should (because it ticks every second,
or every 1%), slowing the whole process down. My time for "git
commit-graph write --reachable" on linux.git went from 13.463s to
12.521s with this patch, ~7% savings.

Therefore, don't show progress percentages in the "Expanding reachable
commits in commit graph" phase.

Note that the current code does sometimes do the right thing, if we
picked up all commits initially (e.g., omitting "--reachable" in a
fully-packed repository would get the correct count without any parent
traversal). So it may be possible to come up with a way to tell when we
could use a percentage here. But in the meantime, let's make sure we
robustly avoid printing nonsense.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-09 10:56:43 -07:00
Taylor Blau 806278dead commit-graph.c: handle corrupt/missing trees
Apply similar treatment as in the previous commit to handle an unchecked
call to 'get_commit_tree_oid()'. Previously, a NULL return value from
this function would be immediately dereferenced with '->hash', and then
cause a segfault.

Before dereferencing to access the 'hash' member, check the return value
of 'get_commit_tree_oid()' to make sure that it is not NULL.

To make this check correct, a related change is also needed in
'commit.c', which is to check the return value of 'get_commit_tree'
before taking its address. If 'get_commit_tree' returns NULL, we
encounter an undefined behavior when taking the address of the return
value of 'get_commit_tree' and then taking '->object.oid'. (On my system,
this is memory address 0x8, which is obviously wrong).

Fix this by making sure that 'get_commit_tree' returns something
non-NULL before digging through a structure that is not there, thus
preventing a segfault down the line in the commit graph code.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-09 10:55:59 -07:00
Taylor Blau 16749b8dd2 commit-graph.c: handle commit parsing errors
To write a commit graph chunk, 'write_graph_chunk_data()' takes a list
of commits to write and parses each one before writing the necessary
data, and continuing on to the next commit in the list.

Since the majority of these commits are not parsed ahead of time (an
exception is made for the *last* commit in the list, which is parsed
early within 'copy_oids_to_commits'), it is possible that calling
'parse_commit_no_graph()' on them may return an error. Failing to catch
these errors before de-referencing later calls can result in a undefined
memory access and a SIGSEGV.

One such example of this is 'get_commit_tree_oid()', which expects a
parsed object as its input (in this case, the commit-graph code passes
'*list'). If '*list' causes a parse error, the subsequent call will
fail.

Prevent such an issue by checking the return value of
'parse_commit_no_graph()' to avoid passing an unparsed object to a
function which expects a parsed object, thus preventing a segfault.

It is worth noting that this fix is really skirting around the issue in
object.c's 'parse_object()', which makes it difficult to tell how
corrupt an object is without digging into it. Presumably one could
change the meaning of 'parse_object' returns, but this would require
adjusting each callsite accordingly. Instead of that, add an additional
check to the object parsed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-09 10:55:58 -07:00
Junio C Hamano 6ba06b582b Merge branch 'sg/commit-graph-validate'
The code to write commit-graph over given commit object names has
been made a bit more robust.

* sg/commit-graph-validate:
  commit-graph: error out on invalid commit oids in 'write --stdin-commits'
  commit-graph: turn a group of write-related macro flags into an enum
  t5318-commit-graph: use 'test_expect_code'
2019-08-22 12:34:11 -07:00
Derrick Stolee 7211b9e753 repo-settings: consolidate some config settings
There are a few important config settings that are not loaded
during git_default_config. These are instead loaded on-demand.

Centralize these config options to a single scan, and store
all of the values in a repo_settings struct. The values for
each setting are initialized as negative to indicate "unset".

This centralization will be particularly important in a later
change to introduce "meta" config settings that change the
defaults for these config settings.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-13 13:33:54 -07:00
Junio C Hamano 203cf46fac Merge branch 'ds/commit-graph-incremental'
Leakfix.

* ds/commit-graph-incremental:
  commit-graph: release strbufs after use
2019-08-09 10:13:13 -07:00
René Scharfe 0aa6bce736 commit-graph: release strbufs after use
Signed-off-by: René Scharfe <l.s.r@web.de>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-07 12:24:01 -07:00
Derrick Stolee a35bea40b6 commit-graph: fix bug around octopus merges
In 1771be90 "commit-graph: merge commit-graph chains" (2019-06-18),
the method sort_and_scan_merged_commits() was added to merge the
commit lists of two commit-graph files in the incremental format.
Unfortunately, there was an off-by-one error in that method around
incrementing num_extra_edges, which leads to an incorrect offset
for the base graph chunk.

When we store an octopus merge in the commit-graph file, we store
the first parent in the normal place, but use the second parent
position to point into the "extra edges" chunk where the remaining
parents exist. This means we should be adding "num_parents - 1"
edges to this list, not "num_parents - 2". That is the basic error.

The reason this was not caught in the test suite is more subtle.
In 5324-split-commit-graph.sh, we test creating an octopus merge
and adding it to the tip of a commit-graph chain, then verify the
result. This _should_ have caught the problem, except that when
we load the commit-graph files we were overly careful to not fail
when the commit-graph chain does not match. This care was on
purpose to avoid race conditions as one process reads the chain
and another process modifies it. In such a case, the reading
process outputs the following message to stderr:

	warning: commit-graph chain does not match

These warnings are output in the test suite, but ignored. By
checking the stderr of `git commit-graph verify` to include
the expected progress output, it will now catch this error.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-05 14:59:50 -07:00
SZEDER Gábor 7c5c9b9c57 commit-graph: error out on invalid commit oids in 'write --stdin-commits'
While 'git commit-graph write --stdin-commits' expects commit object
ids as input, it accepts and silently skips over any invalid commit
object ids, and still exits with success:

  # nonsense
  $ echo not-a-commit-oid | git commit-graph write --stdin-commits
  $ echo $?
  0
  # sometimes I forgot that refs are not good...
  $ echo HEAD | git commit-graph write --stdin-commits
  $ echo $?
  0
  # valid tree OID, but not a commit OID
  $ git rev-parse HEAD^{tree} | git commit-graph write --stdin-commits
  $ echo $?
  0
  $ ls -l .git/objects/info/commit-graph
  ls: cannot access '.git/objects/info/commit-graph': No such file or directory

Check that all input records are indeed valid commit object ids and
return with error otherwise, the same way '--stdin-packs' handles
invalid input; see e103f7276f (commit-graph: return with errors during
write, 2019-06-12).

Note that it should only return with error when encountering an
invalid commit object id coming from standard input.  However,
'--reachable' uses the same code path to process object ids pointed to
by all refs, and that includes tag object ids as well, which should
still be skipped over.  Therefore add a new flag to 'enum
commit_graph_write_flags' and a corresponding field to 'struct
write_commit_graph_context', so we can differentiate between those two
cases.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-05 14:33:39 -07:00
SZEDER Gábor 39d8831856 commit-graph: turn a group of write-related macro flags into an enum
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Acked-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-05 14:31:24 -07:00
Junio C Hamano 92b1ea66b9 Merge branch 'ds/commit-graph-incremental'
The commits in a repository can be described by multiple
commit-graph files now, which allows the commit-graph files to be
updated incrementally.

* ds/commit-graph-incremental:
  commit-graph: test verify across alternates
  commit-graph: normalize commit-graph filenames
  commit-graph: test --split across alternate without --split
  commit-graph: test octopus merges with --split
  commit-graph: clean up chains after flattened write
  commit-graph: verify chains with --shallow mode
  commit-graph: create options for split files
  commit-graph: expire commit-graph files
  commit-graph: allow cross-alternate chains
  commit-graph: merge commit-graph chains
  commit-graph: add --split option to builtin
  commit-graph: write commit-graph chains
  commit-graph: rearrange chunk count logic
  commit-graph: add base graphs chunk
  commit-graph: load commit-graph chains
  commit-graph: rename commit_compare to oid_compare
  commit-graph: prepare for commit-graph chains
  commit-graph: document commit-graph chains
2019-07-19 11:30:20 -07:00
Junio C Hamano a7db4c193d Merge branch 'jk/oidhash'
Code clean-up to remove hardcoded SHA-1 hash from many places.

* jk/oidhash:
  hashmap: convert sha1hash() to oidhash()
  hash.h: move object_id definition from cache.h
  khash: rename oid helper functions
  khash: drop sha1-specific map types
  pack-bitmap: convert khash_sha1 maps into kh_oid_map
  delta-islands: convert island_marks khash to use oids
  khash: rename kh_oid_t to kh_oid_set
  khash: drop broken oid_map typedef
  object: convert create_object() to use object_id
  object: convert internal hash_obj() to object_id
  object: convert lookup_object() to use object_id
  object: convert lookup_unknown_object() to use object_id
  pack-objects: convert locate_object_entry_hash() to object_id
  pack-objects: convert packlist_find() to use object_id
  pack-bitmap-write: convert some helpers to use object_id
  upload-pack: rename a "sha1" variable to "oid"
  describe: fix accidental oid/hash type-punning
2019-07-09 15:25:43 -07:00
Junio C Hamano 5cb7c73589 Merge branch 'ds/close-object-store'
The commit-graph file is now part of the "files that the runtime
may keep open file descriptors on, all of which would need to be
closed when done with the object store", and the file descriptor to
an existing commit-graph file now is closed before "gc" finalizes a
new instance to replace it.

* ds/close-object-store:
  packfile: rename close_all_packs to close_object_store
  packfile: close commit-graph in close_all_packs
  commit-graph: use raw_object_store when closing
2019-07-09 15:25:37 -07:00
Junio C Hamano e1168940ce Merge branch 'ds/commit-graph-write-refactor'
Renamed from commit-graph-format-v2 and changed scope.

* ds/commit-graph-write-refactor:
  commit-graph: extract write_commit_graph_file()
  commit-graph: extract copy_oids_to_commits()
  commit-graph: extract count_distinct_commits()
  commit-graph: extract fill_oids_from_all_packs()
  commit-graph: extract fill_oids_from_commit_hex()
  commit-graph: extract fill_oids_from_packs()
  commit-graph: create write_commit_graph_context
  commit-graph: remove Future Work section
  commit-graph: collapse parameters into flags
  commit-graph: return with errors during write
  commit-graph: fix the_repository reference
2019-07-09 15:25:36 -07:00
Jeff King a378509e1c object: convert create_object() to use object_id
There are no callers left of create_object() that aren't just passing us
the "hash" member of a "struct object_id". Let's take the whole struct,
which gets us closer to removing all raw sha1 variables.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-20 10:20:51 -07:00
Derrick Stolee 16110c9348 commit-graph: normalize commit-graph filenames
When writing commit-graph files, we append path data to an
object directory, which may be specified by the user via the
'--object-dir' option. If the user supplies a trailing slash,
or some other alternative path format, the resulting path may
be usable for writing to the correct location. However, when
expiring graph files from the <obj-dir>/info/commit-graphs
directory during a write, we need to compare paths with exact
string matches.

Normalize the commit-graph filenames to avoid ambiguity. This
creates extra allocations, but this is a constant multiple of
the number of commit-graph files, which should be a number in
the single digits.

Further normalize the object directory in the context. Due to
a comparison between g->obj_dir and ctx->obj_dir in
split_graph_merge_strategy(), a trailing slash would prevent
any merging of layers within the same object directory. The
check is there to ensure we do not merge across alternates.
Update the tests to include a case with this trailing slash
problem.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:27 -07:00
Derrick Stolee 1771be90c8 commit-graph: merge commit-graph chains
When searching for a commit in a commit-graph chain of G graphs with N
commits, the search takes O(G log N) time. If we always add a new tip
graph with every write, the linear G term will start to dominate and
slow the lookup process.

To keep lookups fast, but also keep most incremental writes fast, create
a strategy for merging levels of the commit-graph chain. The strategy is
detailed in the commit-graph design document, but is summarized by these
two conditions:

  1. If the number of commits we are adding is more than half the number
     of commits in the graph below, then merge with that graph.

  2. If we are writing more than 64,000 commits into a single graph,
     then merge with all lower graphs.

The numeric values in the conditions above are currently constant, but
can become config options in a future update.

As we merge levels of the commit-graph chain, check that the commits
still exist in the repository. A garbage-collection operation may have
removed those commits from the object store and we do not want to
persist them in the commit-graph chain. This is a non-issue if the
'git gc' process wrote a new, single-level commit-graph file.

After we merge levels, the old graph-{hash}.graph files are no longer
referenced by the commit-graph-chain file. We will expire these files in
a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee 135a712375 commit-graph: add --split option to builtin
Add a new "--split" option to the 'git commit-graph write' subcommand. This
option allows the optional behavior of writing a commit-graph chain.

The current behavior will add a tip commit-graph containing any commits that
are not in the existing commit-graph or commit-graph chain. Later changes
will allow merging the chain and expiring out-dated files.

Add a new test script (t5324-split-commit-graph.sh) that demonstrates this
behavior.

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee 6c622f9f0b commit-graph: write commit-graph chains
Extend write_commit_graph() to write a commit-graph chain when given the
COMMIT_GRAPH_SPLIT flag.

This implementation is purposefully simplistic in how it creates a new
chain. The commits not already in the chain are added to a new tip
commit-graph file.

Much of the logic around writing a graph-{hash}.graph file and updating
the commit-graph-chain file is the same as the commit-graph file case.
However, there are several places where we need to do some extra logic
in the split case.

Track the list of graph filenames before and after the planned write.
This will be more important when we start merging graph files, but it
also allows us to upgrade our commit-graph file to the appropriate
graph-{hash}.graph file when we upgrade to a chain of commit-graphs.

Note that we use the eighth byte of the commit-graph header to store the
number of base graph files. This determines the length of the base
graphs chunk.

A subtle change of behavior with the new logic is that we do not write a
commit-graph if we our commit list is empty. This extends to the typical
case, which is reflected in t5318-commit-graph.sh.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee ba41112a63 commit-graph: clean up chains after flattened write
If we write a commit-graph file without the split option, then
we write to $OBJDIR/info/commit-graph and start to ignore
the chains in $OBJDIR/info/commit-graphs/.

Unlink the commit-graph-chain file and expire the graph-{hash}.graph
files in $OBJDIR/info/commit-graphs/ during every write.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee 144354b054 commit-graph: rearrange chunk count logic
The number of chunks in a commit-graph file can change depending on
whether we need the Extra Edges Chunk. We are going to add more optional
chunks, and it will be helpful to rearrange this logic around the chunk
count before doing so.

Specifically, we need to finalize the number of chunks before writing
the commit-graph header. Further, we also need to fill out the chunk
lookup table dynamically and using "num_chunks" as we add optional
chunks is useful for adding optional chunks in the future.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee 3da4b609bb commit-graph: verify chains with --shallow mode
If we wrote a commit-graph chain, we only modified the tip file in
the chain. It is valuable to verify what we wrote, but not waste
time checking files we did not write.

Add a '--shallow' option to the 'git commit-graph verify' subcommand
and check that it does not read the base graph in a two-file chain.

Making the verify subcommand read from a chain of commit-graphs takes
some rearranging of the builtin code.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee 118bd57002 commit-graph: add base graphs chunk
To quickly verify a commit-graph chain is valid on load, we will
read from the new "Base Graphs Chunk" of each file in the chain.
This will prevent accidentally loading incorrect data from manually
editing the commit-graph-chain file or renaming graph-{hash}.graph
files.

The commit_graph struct already had an object_id struct "oid", but
it was never initialized or used. Add a line to read the hash from
the end of the commit-graph file and into the oid member.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee c2bc6e6ab0 commit-graph: create options for split files
The split commit-graph feature is now fully implemented, but needs
some more run-time configurability. Allow direct callers to 'git
commit-graph write --split' to specify the values used in the
merge strategy and the expire time.

Update the documentation to specify these values.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee 5c84b3396c commit-graph: load commit-graph chains
Prepare the logic for reading a chain of commit-graphs.

First, look for a file at $OBJDIR/info/commit-graph. If it exists,
then use that file and stop.

Next, look for the chain file at $OBJDIR/info/commit-graphs/commit-graph-chain.
If this file exists, then load the hash values as line-separated values in that
file and load $OBJDIR/info/commit-graphs/graph-{hash[i]}.graph for each hash[i]
in that file. The file is given in order, so the first hash corresponds to the
"base" file and the final hash corresponds to the "tip" file.

This implementation assumes that all of the graph-{hash}.graph files are in
the same object directory as the commit-graph-chain file. This will be updated
in a future change. This change is purposefully simple so we can isolate the
different concerns.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee 8d84097f96 commit-graph: expire commit-graph files
As we merge commit-graph files in a commit-graph chain, we should clean
up the files that are no longer used.

This change introduces an 'expiry_window' value to the context, which is
always zero (for now). We then check the modified time of each
graph-{hash}.graph file in the $OBJDIR/info/commit-graphs folder and
unlink the files that are older than the expiry_window.

Since this is always zero, this immediately clears all unused graph
files. We will update the value to match a config setting in a future
change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee 3cbc6ed3ee commit-graph: rename commit_compare to oid_compare
The helper function commit_compare() actually compares object_id
structs, not commits. A future change to commit-graph.c will need
to sort commit structs, so rename this function in advance.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee c523035cbd commit-graph: allow cross-alternate chains
In an environment like a fork network, it is helpful to have a
commit-graph chain that spans both the base repo and the fork repo. The
fork is usually a small set of data on top of the large repo, but
sometimes the fork is much larger. For example, git-for-windows/git has
almost double the number of commits as git/git because it rebases its
commits on every major version update.

To allow cross-alternate commit-graph chains, we need a few pieces:

1. When looking for a graph-{hash}.graph file, check all alternates.

2. When merging commit-graph chains, do not merge across alternates.

3. When writing a new commit-graph chain based on a commit-graph file
   in another object directory, do not allow success if the base file
   has of the name "commit-graph" instead of
   "commit-graphs/graph-{hash}.graph".

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:26 -07:00
Derrick Stolee d4f4d60f6d commit-graph: prepare for commit-graph chains
To prepare for a chain of commit-graph files, augment the
commit_graph struct to point to a base commit_graph. As we load
commits from the graph, we may actually want to read from a base
file according to the graph position.

The "graph position" of a commit is given by concatenating the
lexicographic commit orders from each of the commit-graph files in
the chain. This means that we must distinguish two values:

 * lexicographic index : the position within the lexicographic
   order in a single commit-graph file.

 * graph position: the position within the concatenated order
   of multiple commit-graph files

Given the lexicographic index of a commit in a graph, we can
compute the graph position by adding the number of commits in
the lower-level graphs. To find the lexicographic index of
a commit, we subtract the number of commits in lower-level graphs.

While here, change insert_parent_or_die() to take a uint32_t
position, as that is the type used by its only caller and that
makes more sense with the limits in the commit-graph format.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-19 20:46:25 -07:00
Derrick Stolee c3a3a964b2 commit-graph: use raw_object_store when closing
The close_commit_graph() method took a repository struct, but then
only uses the raw_object_store within. Change the function prototype
to make the method more flexible.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:33:54 -07:00
Derrick Stolee 238def57fe commit-graph: extract write_commit_graph_file()
The write_commit_graph() method is too complex, so we are
extracting helper functions one by one.

Extract write_commit_graph_file() that takes all of the information
in the context struct and writes the data to a commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:54 -07:00
Derrick Stolee f998d54226 commit-graph: extract copy_oids_to_commits()
The write_commit_graph() method is too complex, so we are
extracting helper functions one by one.

Extract copy_oids_to_commits(), which fills the commits list
with the distinct commits from the oids list. During this loop,
it also counts the number of "extra" edges from octopus merges.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:54 -07:00
Derrick Stolee 014e3440f5 commit-graph: extract count_distinct_commits()
The write_commit_graph() method is too complex, so we are
extracting helper functions one by one.

Extract count_distinct_commits(), which sorts the oids list, then
iterates through to find duplicates.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:54 -07:00
Derrick Stolee b2c8306052 commit-graph: extract fill_oids_from_all_packs()
The write_commit_graph() method is too complex, so we are
extracting helper functions one by one.

Extract fill_oids_from_all_packs() that reads all pack-files
for commits and fills the oid list in the context.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:54 -07:00
Derrick Stolee 4c9efe850d commit-graph: extract fill_oids_from_commit_hex()
The write_commit_graph() method is too complex, so we are
extracting helper functions one by one.

Extract fill_oids_from_commit_hex() that reads the given commit
id list and fille the oid list in the context.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:54 -07:00
Derrick Stolee ef5b83f2cf commit-graph: extract fill_oids_from_packs()
The write_commit_graph() method is too complex, so we are
extracting helper functions one by one.

This extracts fill_oids_from_packs() that reads the given
pack-file list and fills the oid list in the context.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:53 -07:00
Derrick Stolee c9905beade commit-graph: create write_commit_graph_context
The write_commit_graph() method is too large and complex. To simplify
it, we should extract several helper functions. However, we will risk
repeating a lot of declarations related to progress incidators and
object id or commit lists.

Create a new write_commit_graph_context struct that contains the
core data structures used in this process. Replace the other local
variables with the values inside the context object. Following this
change, we will start to lift code segments wholesale out of the
write_commit_graph() method and into helper functions.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:53 -07:00
Derrick Stolee 5af8039452 commit-graph: collapse parameters into flags
The write_commit_graph() and write_commit_graph_reachable() methods
currently take two boolean parameters: 'append' and 'report_progress'.
As we update these methods, adding more parameters this way becomes
cluttered and hard to maintain.

Collapse these parameters into a 'flags' parameter, and adjust the
callers to provide flags as necessary.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:53 -07:00
Derrick Stolee e103f7276f commit-graph: return with errors during write
The write_commit_graph() method uses die() to report failure and
exit when confronted with an unexpected condition. This use of
die() in a library function is incorrect and is now replaced by
error() statements and an int return type. Return zero on success
and a negative value on failure.

Now that we use 'goto cleanup' to jump to the terminal condition
on an error, we have new paths that could lead to uninitialized
values. New initializers are added to correct for this.

The builtins 'commit-graph', 'gc', and 'commit' call these methods,
so update them to check the return value. Test that 'git commit-graph
write' returns a proper error code when hitting a failure condition
in write_commit_graph().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-12 11:20:53 -07:00
Junio C Hamano 3c9b393ca8 Merge branch 'js/commit-graph-parse-leakfix'
Leakfix.

* js/commit-graph-parse-leakfix:
  commit-graph: fix memory leak
2019-05-19 16:45:28 +09:00
Junio C Hamano 0b179f3175 Merge branch 'nd/sha1-name-c-wo-the-repository'
Further code clean-up to allow the lowest level of name-to-object
mapping layer to work with a passed-in repository other than the
default one.

* nd/sha1-name-c-wo-the-repository: (34 commits)
  sha1-name.c: remove the_repo from get_oid_mb()
  sha1-name.c: remove the_repo from other get_oid_*
  sha1-name.c: remove the_repo from maybe_die_on_misspelt_object_name
  submodule-config.c: use repo_get_oid for reading .gitmodules
  sha1-name.c: add repo_get_oid()
  sha1-name.c: remove the_repo from get_oid_with_context_1()
  sha1-name.c: remove the_repo from resolve_relative_path()
  sha1-name.c: remove the_repo from diagnose_invalid_index_path()
  sha1-name.c: remove the_repo from handle_one_ref()
  sha1-name.c: remove the_repo from get_oid_1()
  sha1-name.c: remove the_repo from get_oid_basic()
  sha1-name.c: remove the_repo from get_describe_name()
  sha1-name.c: remove the_repo from get_oid_oneline()
  sha1-name.c: add repo_interpret_branch_name()
  sha1-name.c: remove the_repo from interpret_branch_mark()
  sha1-name.c: remove the_repo from interpret_nth_prior_checkout()
  sha1-name.c: remove the_repo from get_short_oid()
  sha1-name.c: add repo_for_each_abbrev()
  sha1-name.c: store and use repo in struct disambiguate_state
  sha1-name.c: add repo_find_unique_abbrev_r()
  ...
2019-05-09 00:37:25 +09:00
Josh Steadmon 98552f252a commit-graph: fix memory leak
Free the commit graph when verify_commit_graph_lite() reports an error.
Credit to OSS-Fuzz for finding this leak.

Signed-off-by: Josh Steadmon <steadmon@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-07 13:49:28 +09:00
Nguyễn Thái Ngọc Duy a133c40b23 commit.cocci: refactor code, avoid double rewrite
"maybe" pointer in 'struct commit' is tricky because it can be lazily
initialized to take advantage of commit-graph if available. This makes
it not safe to access directly.

This leads to a rule in commit.cocci to rewrite 'x->maybe_tree' to
'get_commit_tree(x)'. But that rule alone could lead to incorrectly
rewrite assignments, e.g. from

    x->maybe_tree = yes

to

    get_commit_tree(x) = yes

Because of this we have a second rule to revert this effect. Szeder
found out that we could do better by performing the assignment rewrite
rule first, then the remaining is read-only access and handled by the
current first rule.

For this to work, we need to transform "x->maybe_tree = y" to something
that does NOT contain "x->maybe_tree" to avoid the original first
rule. This is where set_commit_tree() comes in.

Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-16 18:56:51 +09:00
Ævar Arnfjörð Bjarmason 93b4405ffe commit-graph: improve & i18n error messages
Change the error emitted when a commit-graph file is corrupt so that
we actually mention the commit-graph, e.g. change errors like:

    error: improper chunk offset 0000000000385e0c

To:

    error: commit-graph improper chunk offset 0000000000385e0c

As discussed in the commits leading up to this one the commit-graph
machinery is now used by common commands like "status". If the graph
was corrupt we'd often emit some error that gave no indication what
was wrong. Now some of them are still cryptic, but they'll at least
mention "commit-graph" to give the user a hint as to where to look.

While I'm at it mark some of the strings that hadn't been marked for
translation. It's clear from the commit history and the code that this
was merely forgotten at the time, and wasn't intentional.p5

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01 12:14:50 +09:00
Ævar Arnfjörð Bjarmason 43d3561805 commit-graph write: don't die if the existing graph is corrupt
When the commit-graph is written we end up calling
parse_commit(). This will in turn invoke code that'll consult the
existing commit-graph about the commit, if the graph is corrupted we
die.

We thus get into a state where a failing "commit-graph verify" can't
be followed-up with a "commit-graph write" if core.commitGraph=true is
set, the graph either needs to be manually removed to proceed, or
core.commitGraph needs to be set to "false".

Change the "commit-graph write" codepath to use a new
parse_commit_no_graph() helper instead of parse_commit() to avoid
this. The latter will call repo_parse_commit_internal() with
use_commit_graph=1 as seen in 177722b344 ("commit: integrate commit
graph with commit parsing", 2018-04-10).

Not using the old graph at all slows down the writing of the new graph
by some small amount, but is a sensible way to prevent an error in the
existing commit-graph from spreading.

Just fixing the current issue would be likely to result in code that's
inadvertently broken in the future. New code might use the
commit-graph at a distance. To detect such cases introduce a
"GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD" setting used when we do our
corruption tests, and test that a "write/verify" combo works after
every one of our current test cases where we now detect commit-graph
corruption.

Some of the code changes here might be strictly unnecessary, e.g. I
was unable to find cases where the parse_commit() called from
write_graph_chunk_data() didn't exit early due to
"item->object.parsed" being true in
repo_parse_commit_internal() (before the use_commit_graph=1 has any
effect). But let's also convert those cases for good measure, we do
not have exhaustive tests for all possible types of commit-graph
corruption.

This might need to be re-visited if we learn to write the commit-graph
incrementally, but probably not. Hopefully we'll just start by finding
out what commits we have in total, then read the old graph(s) to see
what they cover, and finally write a new graph file with everything
that's missing. In that case the new graph writing code just needs to
continue to use e.g. a parse_commit() that doesn't consult the
existing commit-graphs.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01 12:14:50 +09:00
Ævar Arnfjörð Bjarmason 67a530fab3 commit-graph: don't pass filename to load_commit_graph_one_fd_st()
An earlier change implemented load_commit_graph_one_fd_st() in a way
that was bug-compatible with earlier code in terms of the "graph file
%s is too small" error message printing out the path to the
commit-graph (".git/objects/info/commit-graph").

But change that, because:

 * A function that takes an already-open file descriptor also needing
   the filename isn't very intuitive.

 * The vast majority of errors we might emit when loading the graph
   come from parse_commit_graph(), which doesn't report the
   filename. Let's not do that either in this case for consistency.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01 12:14:50 +09:00
Ævar Arnfjörð Bjarmason 61df89c8e5 commit-graph: don't early exit(1) on e.g. "git status"
Make the commit-graph loading code work as a library that returns an
error code instead of calling exit(1) when the commit-graph is
corrupt. This means that e.g. "status" will now report commit-graph
corruption as an "error: [...]" at the top of its output, but then
proceed to work normally.

This required splitting up the load_commit_graph_one() function so
that the code that deals with open()-ing and stat()-ing the graph can
now be called independently as open_commit_graph().

This is needed because "commit-graph verify" where the graph doesn't
exist isn't an error. See the third paragraph in
283e68c72f ("commit-graph: add 'verify' subcommand",
2018-06-27). There's a bug in that logic where we conflate the
intended ENOENT with other errno values (e.g. EACCES), but this change
doesn't address that. That'll be addressed in a follow-up change.

I'm then splitting most of the logic out of load_commit_graph_one()
into load_commit_graph_one_fd_st(), which allows for providing an
existing file descriptor and stat information to the loading
code. This isn't strictly needed, but it would be redundant and
confusing to open() and stat() the file twice for some of the
codepaths, this allows for calling open_commit_graph() followed by
load_commit_graph_one_fd_st(). The "graph_file" still needs to be
passed to that function for the the "graph file %s is too small" error
message.

This leaves load_commit_graph_one() unused by everything except the
internal prepare_commit_graph_one() function, so let's mark it as
"static". If someone needs it in the future we can remove the "static"
attribute. I could also rewrite its sole remaining
user ("prepare_commit_graph_one()") to use
load_commit_graph_one_fd_st() instead, but let's leave it at this.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01 12:14:50 +09:00
Ævar Arnfjörð Bjarmason 2ac138d568 commit-graph: fix segfault on e.g. "git status"
When core.commitGraph=true is set, various common commands now consult
the commit graph. Because the commit-graph code is very trusting of
its input data, it's possibly to construct a graph that'll cause an
immediate segfault on e.g. "status" (and e.g. "log", "blame", ...). In
some other cases where git immediately exits with a cryptic error
about the graph being broken.

The root cause of this is that while the "commit-graph verify"
sub-command exhaustively verifies the graph, other users of the graph
simply trust the graph, and will e.g. deference data found at certain
offsets as pointers, causing segfaults.

This change does the bare minimum to ensure that we don't segfault in
the common fill_commit_in_graph() codepath called by
e.g. setup_revisions(), to do this instrument the "commit-graph
verify" tests to always check if "status" would subsequently
segfault. This fixes the following tests which would previously
segfault:

    not ok 50 - detect low chunk count
    not ok 51 - detect missing OID fanout chunk
    not ok 52 - detect missing OID lookup chunk
    not ok 53 - detect missing commit data chunk

Those happened because with the commit-graph enabled setup_revisions()
would eventually call fill_commit_in_graph(), where e.g.
g->chunk_commit_data is used early as an offset (and will be
0x0). With this change we get far enough to detect that the graph is
broken, and show an error instead. E.g.:

    $ git status; echo $?
    error: commit-graph is missing the Commit Data chunk
    1

That also sucks, we should *warn* and not hard-fail "status" just
because the commit-graph is corrupt, but fixing is left to a follow-up
change.

A side-effect of changing the reporting from graph_report() to error()
is that we now have an "error: " prefix for these even for
"commit-graph verify". Pseudo-diff before/after:

    $ git commit-graph verify
    -commit-graph is missing the Commit Data chunk
    +error: commit-graph is missing the Commit Data chunk

Changing that is OK. Various errors it emits now early on are prefixed
with "error: ", moving these over and changing the output doesn't
break anything.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01 12:14:49 +09:00
Junio C Hamano e5eac57356 Merge branch 'ab/commit-graph-write-progress'
The codepath to show progress meter while writing out commit-graph
file has been improved.

* ab/commit-graph-write-progress:
  commit-graph write: emit a percentage for all progress
  commit-graph write: add itermediate progress
  commit-graph write: remove empty line for readability
  commit-graph write: add more descriptive progress output
  commit-graph write: show progress for object search
  commit-graph write: more descriptive "writing out" output
  commit-graph write: add "Writing out" progress output
  commit-graph: don't call write_graph_chunk_extra_edges() unnecessarily
  commit-graph: rename "large edges" to "extra edges"
2019-02-05 14:26:14 -08:00
Junio C Hamano 04d67b6ab2 Merge branch 'ab/commit-graph-write-optim'
The codepath to write out commit-graph has been optimized by
following the usual pattern of visiting objects in in-pack order.

* ab/commit-graph-write-optim:
  commit-graph write: use pack order when finding commits
2019-02-05 14:26:14 -08:00
Junio C Hamano 19a504d92b Merge branch 'js/commit-graph-chunk-table-fix'
The codepath to read from the commit-graph file attempted to read
past the end of it when the file's table-of-contents was corrupt.

* js/commit-graph-chunk-table-fix:
  Makefile: correct example fuzz build
  commit-graph: fix buffer read-overflow
  commit-graph, fuzz: add fuzzer for commit-graph
2019-02-05 14:26:11 -08:00
Junio C Hamano b99a579f8e Merge branch 'sb/more-repo-in-api'
The in-core repository instances are passed through more codepaths.

* sb/more-repo-in-api: (23 commits)
  t/helper/test-repository: celebrate independence from the_repository
  path.h: make REPO_GIT_PATH_FUNC repository agnostic
  commit: prepare free_commit_buffer and release_commit_memory for any repo
  commit-graph: convert remaining functions to handle any repo
  submodule: don't add submodule as odb for push
  submodule: use submodule repos for object lookup
  pretty: prepare format_commit_message to handle arbitrary repositories
  commit: prepare logmsg_reencode to handle arbitrary repositories
  commit: prepare repo_unuse_commit_buffer to handle any repo
  commit: prepare get_commit_buffer to handle any repo
  commit-reach: prepare in_merge_bases[_many] to handle any repo
  commit-reach: prepare get_merge_bases to handle any repo
  commit-reach.c: allow get_merge_bases_many_0 to handle any repo
  commit-reach.c: allow remove_redundant to handle any repo
  commit-reach.c: allow merge_bases_many to handle any repo
  commit-reach.c: allow paint_down_to_common to handle any repo
  commit: allow parse_commit* to handle any repo
  object: parse_object to honor its repository argument
  object-store: prepare has_{sha1, object}_file to handle any repo
  object-store: prepare read_object_file to deal with any repo
  ...
2019-02-05 14:26:09 -08:00
Junio C Hamano 33e4ae9c50 Merge branch 'bc/sha-256'
Add sha-256 hash and plug it through the code to allow building Git
with the "NewHash".

* bc/sha-256:
  hash: add an SHA-256 implementation using OpenSSL
  sha256: add an SHA-256 implementation using libgcrypt
  Add a base implementation of SHA-256 support
  commit-graph: convert to using the_hash_algo
  t/helper: add a test helper to compute hash speed
  sha1-file: add a constant for hash block size
  t: make the sha1 test-tool helper generic
  t: add basic tests for our SHA-1 implementation
  cache: make hashcmp and hasheq work with larger hashes
  hex: introduce functions to print arbitrary hashes
  sha1-file: provide functions to look up hash algorithms
  sha1-file: rename algorithm to "sha1"
2019-01-29 12:47:55 -08:00
Ævar Arnfjörð Bjarmason 49bbc57a57 commit-graph write: emit a percentage for all progress
Follow-up 01ca387774 ("commit-graph: split up close_reachable()
progress output", 2018-11-19) by making the progress bars in
close_reachable() report a completion percentage. This fixes the last
occurrence where in the commit graph writing where we didn't report
that.

The change in 01ca387774 split up the 1x progress bar in
close_reachable() into 3x, but left them as dumb counters without a
percentage completion. Fixing that is easy, and the only reason it
wasn't done already is because that commit was rushed in during the
v2.20.0 RC period to fix the unrelated issue of over-reporting commit
numbers. See [1] and follow-ups for ML activity at the time and [2]
for an alternative approach where the progress bars weren't split up.

Now for e.g. linux.git we'll emit:

    $ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
    Finding commits for commit graph among packed objects: 100% (6529159/6529159), done.
    Expanding reachable commits in commit graph: 100% (815990/815980), done.
    Computing commit graph generation numbers: 100% (815983/815983), done.
    Writing out commit graph in 4 passes: 100% (3263932/3263932), done.

1. https://public-inbox.org/git/20181119202300.18670-1-avarab@gmail.com/
2. https://public-inbox.org/git/20181122153922.16912-11-avarab@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23 13:14:08 -08:00
Ævar Arnfjörð Bjarmason 890226ccb5 commit-graph write: add itermediate progress
Add progress output to sections of code between "Annotating[...]" and
"Computing[...]generation numbers". This can collectively take 5-10
seconds on a large enough repository.

On a test repository with I have with ~7 million commits and ~50
million objects we'll now emit:

    $ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
    Finding commits for commit graph among packed objects: 100% (124763727/124763727), done.
    Loading known commits in commit graph: 100% (18989461/18989461), done.
    Expanding reachable commits in commit graph: 100% (18989507/18989461), done.
    Clearing commit marks in commit graph: 100% (18989507/18989507), done.
    Counting distinct commits in commit graph: 100% (18989507/18989507), done.
    Finding extra edges in commit graph: 100% (18989507/18989507), done.
    Computing commit graph generation numbers: 100% (7250302/7250302), done.
    Writing out commit graph in 4 passes: 100% (29001208/29001208), done.

Whereas on a medium-sized repository such as linux.git these new
progress bars won't have time to kick in and as before and we'll still
emit output like:

    $ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
    Finding commits for commit graph among packed objects: 100% (6529159/6529159), done.
    Expanding reachable commits in commit graph: 815990, done.
    Computing commit graph generation numbers: 100% (815983/815983), done.
    Writing out commit graph in 4 passes: 100% (3263932/3263932), done.

The "Counting distinct commits in commit graph" phase will spend most
of its time paused at "0/*" as we QSORT(...) the list. That's not
optimal, but at least we don't seem to be stalling anymore most of the
time.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23 13:14:08 -08:00
Ævar Arnfjörð Bjarmason e59c615e3c commit-graph write: remove empty line for readability
Remove the empty line between a QSORT(...) and the subsequent oideq()
for-loop. This makes it clearer that the QSORT(...) is being done so
that we can run the oideq() loop on adjacent OIDs. Amends code added
in 08fd81c9b6 ("commit-graph: implement write_commit_graph()",
2018-04-02).

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23 13:14:08 -08:00
Ævar Arnfjörð Bjarmason 7c7b8a7fc7 commit-graph write: add more descriptive progress output
Make the progress output shown when we're searching for commits to
include in the graph more descriptive. This amends code I added in
7b0f229222 ("commit-graph write: add progress output", 2018-09-17).

Now, on linux.git, we'll emit this sort of output in the various modes
we support:

    $ git commit-graph write
    Finding commits for commit graph among packed objects: 100% (6529159/6529159), done.
    [...]

    # Actually we don't emit this since this takes almost no time at
    # all. But if we did (s/_delayed//) we'd show:
    $ git for-each-ref --format='%(objectname)' | git commit-graph write --stdin-commits
    Finding commits for commit graph from 630 refs: 100% (630/630), done.
    [...]

    $ (cd .git/objects/pack/ && ls *idx) | git commit-graph write --stdin-pack
    Finding commits for commit graph in 3 packs: 6529159, done.
    [...]

The middle on of those is going to be the output users might see in
practice, since it'll be emitted when they get the commit graph via
gc.writeCommitGraph=true. But as noted above you need a really large
number of refs for this message to show. It'll show up on a test
repository I have with ~165k refs:

    Finding commits for commit graph from 165203 refs: 100% (165203/165203), done.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23 13:14:08 -08:00
Ævar Arnfjörð Bjarmason d9b1b309cf commit-graph write: show progress for object search
Show the percentage progress for the "Finding commits for commit
graph" phase for the common case where we're operating on all packs in
the repository, as "commit-graph write" or "gc" will do.

Before we'd emit on e.g. linux.git with "commit-graph write":

    Finding commits for commit graph: 6529159, done.
    [...]

And now:

    Finding commits for commit graph: 100% (6529159/6529159), done.
    [...]

Since the commit graph only includes those commits that are packed
(via for_each_packed_object(...)) the approximate_object_count()
returns the actual number of objects we're going to process.

Still, it is possible due to a race with "gc" or another process
maintaining packs that the number of objects we're going to process is
lower than what approximate_object_count() reported. In that case we
don't want to stop the progress bar short of 100%. So let's make sure
it snaps to 100% at the end.

The inverse case is also possible and more likely. I.e. that a new
pack has been added between approximate_object_count() and
for_each_packed_object(). In that case the percentage will go beyond
100%, and we'll do nothing to snap it back to 100% at the end.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23 13:14:08 -08:00
Ævar Arnfjörð Bjarmason 289447397c commit-graph write: more descriptive "writing out" output
Make the "Writing out" part of the progress output more
descriptive. Depending on the shape of the graph we either make 3 or 4
passes over it.

Let's present this information to the user in case they're wondering
what this number, which is much larger than their number of commits,
has to do with writing out the commit graph. Now e.g. on linux.git we
emit:

    $ ~/g/git/git --exec-path=$HOME/g/git -C ~/g/linux commit-graph write
    Finding commits for commit graph: 6529159, done.
    Expanding reachable commits in commit graph: 815990, done.
    Computing commit graph generation numbers: 100% (815983/815983), done.
    Writing out commit graph in 4 passes: 100% (3263932/3263932), done.

A note on i18n: Why are we using the Q_() function and passing a
number & English text for a singular which'll never be used? Because
the plural rules of translated languages may not match those of
English, and to use the plural function we need to use this format.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23 13:14:08 -08:00
Ævar Arnfjörð Bjarmason 53035c4f0b commit-graph write: add "Writing out" progress output
Add progress output to be shown when we're writing out the
commit-graph, this adds to the output already added in 7b0f229222
("commit-graph write: add progress output", 2018-09-17).

As noted in that commit most of the progress output isn't displayed on
small repositories, but before this change we'd noticeably hang for
2-3 seconds at the end on medium sized repositories such as linux.git.

Now we'll instead show output like this, and reduce the
human-observable times at which we're not producing progress output:

    $ ~/g/git/git --exec-path=$HOME/g/git -C ~/g/2015-04-03-1M-git commit-graph write
    Finding commits for commit graph: 13064614, done.
    Expanding reachable commits in commit graph: 1000447, done.
    Computing commit graph generation numbers: 100% (1000447/1000447), done.
    Writing out commit graph: 100% (3001341/3001341), done.

This "Writing out" number is 3x or 4x the number of commits, depending
on the graph we're processing. A later change will make this explicit
to the user.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23 13:14:08 -08:00
SZEDER Gábor 857ba928a4 commit-graph: don't call write_graph_chunk_extra_edges() unnecessarily
The optional 'Extra Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents.  Since the
chunk is optional, write_commit_graph() looks through all commits to
find those with more than two parents, and then writes the commit
graph file header accordingly, i.e. if there are no such commits, then
there won't be a 'Extra Edge List' chunk written, only the three
mandatory chunks.

However, when it later comes to writing actual chunk data,
write_commit_graph() unconditionally invokes
write_graph_chunk_extra_edges(), even when it was decided earlier that
that chunk won't be written.  Strictly speaking there is no bug here,
because write_graph_chunk_extra_edges() won't write anything if it
doesn't find any commits with more than two parents, but then it
unnecessarily and in vain looks through all commits once again in
search for such commits.

Don't call write_graph_chunk_extra_edges() when that chunk won't be
written to spare an unnecessary iteration over all commits.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-23 13:12:56 -08:00
SZEDER Gábor 5af7417bd8 commit-graph: rename "large edges" to "extra edges"
The optional 'Large Edge List' chunk of the commit graph file stores
parent information for commits with more than two parents, and the
names of most of the macros, variables, struct fields, and functions
related to this chunk contain the term "large edges", e.g.
write_graph_chunk_large_edges().  However, it's not a really great
term, as the edges to the second and subsequent parents stored in this
chunk are not any larger than the edges to the first and second
parents stored in the "main" 'Commit Data' chunk.  It's the number of
edges, IOW number of parents, that is larger compared to non-merge and
"regular" two-parent merge commits.  And indeed, two functions in
'commit-graph.c' have a local variable called 'num_extra_edges' that
refer to the same thing, and this "extra edges" term is much better at
describing these edges.

So let's rename all these references to "large edges" in macro,
variable, function, etc. names to "extra edges".  There is a
GRAPH_OCTOPUS_EDGES_NEEDED macro as well; for the sake of consistency
rename it to GRAPH_EXTRA_EDGES_NEEDED.

We can do so safely without causing any incompatibility issues,
because the term "large edges" doesn't come up in the file format
itself in any form (the chunk's magic is {'E', 'D', 'G', 'E'}, there
is no 'L' in there), but only in the specification text.  The string
"large edges", however, does come up in the output of 'git
commit-graph read' and in tests looking at its input, but that command
is explicitly documented as debugging aid, so we can change its output
and the affected tests safely.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-22 11:33:46 -08:00
Ævar Arnfjörð Bjarmason d7574c95bb commit-graph write: use pack order when finding commits
Slightly optimize the "commit-graph write" step by using
FOR_EACH_OBJECT_PACK_ORDER with for_each_object_in_pack(). See commit
[1] and [2] for the facility and a similar optimization for "cat-file".

On Linux it is around 5% slower to run:

    echo 1 >/proc/sys/vm/drop_caches &&
    cat .git/objects/pack/* >/dev/null &&
    git cat-file --batch-all-objects --batch-check --unordered

Than the same thing with the "cat" omitted. This is as expected, since
we're iterating in pack order and the "cat" is extra work.

Before this change the opposite was true of "commit-graph write". We
were 6% faster if we first ran "cat" to efficiently populate the FS
cache for our sole big pack on linux.git, than if we had populated it
via for_each_object_in_pack(). Now we're 3% faster without the "cat"
instead.

My tests were done on an unloaded Linux 3.10 system with 10 runs for
each. Derrick Stolee did his own tests on Windows[3] showing a 2%
improvement with a high degree of accuracy.

1. 736eb88fdc ("for_each_packed_object: support iterating in
   pack-order", 2018-08-10)

2. 0750bb5b51 ("cat-file: support "unordered" output for
   --batch-all-objects", 2018-08-10)

3. https://public-inbox.org/git/f71fa868-25e8-a9c9-46a6-611b987f1a8f@gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-01-22 11:32:56 -08:00