Commit graph

249 commits

Author SHA1 Message Date
Abhishek Kumar 8d00d7c3df commit-reach: use corrected commit dates in paint_down_to_common()
091f4cf (commit: don't use generation numbers if not needed,
2018-08-30) changed paint_down_to_common() to use commit dates instead
of generation numbers v1 (topological levels) as the performance
regressed on certain topologies. With generation number v2 (corrected
commit dates) implemented, we no longer have to rely on commit dates and
can use generation numbers.

For example, the command `git merge-base v4.8 v4.9` on the Linux
repository walks 167468 commits, taking 0.135s for committer date and
167496 commits, taking 0.157s for corrected committer date respectively.

While using corrected commit dates, Git walks nearly the same number of
commits as commit date, the process is slower as for each comparision we
have to access a commit-slab (for corrected committer date) instead of
accessing struct member (for committer date).

This change incidentally broke the fragile t6404-recursive-merge test.
t6404-recursive-merge sets up a unique repository where all commits have
the same committer date without a well-defined merge-base.

While running tests with GIT_TEST_COMMIT_GRAPH unset, we use committer
date as a heuristic in paint_down_to_common(). 6404.1 'combined merge
conflicts' merges commits in the order:
- Merge C with B to form an intermediate commit.
- Merge the intermediate commit with A.

With GIT_TEST_COMMIT_GRAPH=1, we write a commit-graph and subsequently
use the corrected committer date, which changes the order in which
commits are merged:
- Merge A with B to form an intermediate commit.
- Merge the intermediate commit with C.

While resulting repositories are equivalent, 6404.4 'virtual trees were
processed' fails with GIT_TEST_COMMIT_GRAPH=1 as we are selecting
different merge-bases and thus have different object ids for the
intermediate commits.

As this has already causes problems (as noted in 859fdc0 (commit-graph:
define GIT_TEST_COMMIT_GRAPH, 2018-08-29)), we disable commit graph
within t6404-recursive-merge.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar 1fdc383c5c commit-graph: use generation v2 only if entire chain does
Since there are released versions of Git that understand generation
numbers in the commit-graph's CDAT chunk but do not understand the GDAT
chunk, the following scenario is possible:

1. "New" Git writes a commit-graph with the GDAT chunk.
2. "Old" Git writes a split commit-graph on top without a GDAT chunk.

If each layer of split commit-graph is treated independently, as it was
the case before this commit, with Git inspecting only the current layer
for chunk_generation_data pointer, commits in the lower layer (one with
GDAT) whould have corrected commit date as their generation number,
while commits in the upper layer would have topological levels as their
generation. Corrected commit dates usually have much larger values than
topological levels. This means that if we take two commits, one from the
upper layer, and one reachable from it in the lower layer, then the
expectation that the generation of a parent is smaller than the
generation of a child would be violated.

It is difficult to expose this issue in a test. Since we _start_ with
artificially low generation numbers, any commit walk that prioritizes
generation numbers will walk all of the commits with high generation
number before walking the commits with low generation number. In all the
cases I tried, the commit-graph layers themselves "protect" any
incorrect behavior since none of the commits in the lower layer can
reach the commits in the upper layer.

This issue would manifest itself as a performance problem in this case,
especially with something like "git log --graph" since the low
generation numbers would cause the in-degree queue to walk all of the
commits in the lower layer before allowing the topo-order queue to write
anything to output (depending on the size of the upper layer).

Therefore, When writing the new layer in split commit-graph, we write a
GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees
that if a layer has GDAT chunk, all lower layers must have a GDAT chunk
as well.

Rewriting layers follows similar approach: if the topmost layer below
the set of layers being rewritten (in the split commit-graph chain)
exists, and it does not contain GDAT chunk, then the result of rewrite
does not have GDAT chunks either.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar e8b63005c4 commit-graph: implement generation data chunk
As discovered by Ævar, we cannot increment graph version to
distinguish between generation numbers v1 and v2 [1]. Thus, one of
pre-requistes before implementing generation number v2 was to
distinguish between graph versions in a backwards compatible manner.

We are going to introduce a new chunk called Generation DATa chunk (or
GDAT). GDAT will store corrected committer date offsets whereas CDAT
will still store topological level.

Old Git does not understand GDAT chunk and would ignore it, reading
topological levels from CDAT. New Git can parse GDAT and take advantage
of newer generation numbers, falling back to topological levels when
GDAT chunk is missing (as it would happen with a commit-graph written
by old Git).

We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT'
which forces commit-graph file to be written without generation data
chunk to emulate a commit-graph file written by old Git.

To minimize the space required to store corrrected commit date, Git
stores corrected commit date offsets into the commit-graph file, instea
of corrected commit dates. This saves us 4 bytes per commit, decreasing
the GDAT chunk size by half, but it's possible for the offset to
overflow the 4-bytes allocated for storage. As such overflows are and
should be exceedingly rare, we use the following overflow management
scheme:

We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV')
to store corrected commit dates for commits with offsets greater than
GENERATION_NUMBER_V2_OFFSET_MAX.

If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set
the MSB of the offset and the other bits store the position of corrected
commit date in GDOV chunk, similar to how Extra Edge List is maintained.

We test the overflow-related code with the following repo history:

           F - N - U
          /         \
U - N - U            N
         \          /
	  N - F - N

Where the commits denoted by U have committer date of zero seconds
since Unix epoch, the commits denoted by N have committer date of
1112354055 (default committer date for the test suite) seconds since
Unix epoch and the commits denoted by F have committer date of
(2 ^ 31 - 2) seconds since Unix epoch.

The largest offset observed is 2 ^ 31, just large enough to overflow.

[1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar c1a09119f6 commit-graph: implement corrected commit date
With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if a commit is malformed but valid
and has committer date of 0 Unix time, as the offset would be the same
as corrected commit date and thus require 64-bits to be stored properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar d7f92784c6 commit-graph: return 64-bit generation number
In a preparatory step for introducing corrected commit dates, let's
return timestamp_t values from commit_graph_generation(), use
timestamp_t for local variables and define GENERATION_NUMBER_INFINITY
as (2 ^ 63 - 1) instead.

We rename GENERATION_NUMBER_MAX to GENERATION_NUMBER_V1_MAX to
represent the largest topological level we can store in the commit data
chunk.

With corrected commit dates implemented, we will have two such *_MAX
variables to denote the largest offset and largest topological level
that can be stored.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar 72a2bfcaf0 commit-graph: add a slab to store topological levels
In a later commit we will introduce corrected commit date as the
generation number v2. Corrected commit dates will be stored in the new
seperate Generation Data chunk. However, to ensure backwards
compatibility with "Old" Git we need to continue to write generation
number v1 (topological levels) to the commit data chunk. Thus, we need
to compute and store both versions of generation numbers to write the
commit-graph file.

Therefore, let's introduce a commit-slab `topo_level_slab` to store
topological levels; corrected commit date will be stored in the member
`generation` of struct commit_graph_data.

The macros `GENERATION_NUMBER_INFINITY` and `GENERATION_NUMBER_ZERO`
mark commits not in the commit-graph file and commits written by a
version of Git that did not compute generation numbers respectively.
Generation numbers are computed identically for both kinds of commits.

A "slab-miss" should return `GENERATION_NUMBER_INFINITY` as the commit
is not in the commit-graph file. However, since the slab is
zero-initialized, it returns 0 (or rather `GENERATION_NUMBER_ZERO`).
Thus, we no longer need to check if the topological level of a commit is
`GENERATION_NUMBER_INFINITY`.

We will add a pointer to the slab in `struct write_commit_graph_context`
and `struct commit_graph` to populate the slab in
`fill_commit_graph_info` if the commit has a pre-computed topological
level as in case of split commit-graphs.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar f90fca638e commit-graph: consolidate fill_commit_graph_info
Both fill_commit_graph_info() and fill_commit_in_graph() parse
information present in commit data chunk. Let's simplify the
implementation by calling fill_commit_graph_info() within
fill_commit_in_graph().

fill_commit_graph_info() used to not load committer data from commit data
chunk. However, with the upcoming switch to using corrected committer
date as generation number v2, we will have to load committer date to
compute generation number value anyway.

e51217e15 (t5000: test tar files that overflow ustar headers,
30-06-2016) introduced a test 'generate tar with future mtime' that
creates a commit with committer date of (2^36 + 1) seconds since
EPOCH. The CDAT chunk provides 34-bits for storing committer date, thus
committer time overflows into generation number (within CDAT chunk) and
has undefined behavior.

The test used to pass as fill_commit_graph_info() would not set struct
member `date` of struct commit and load committer date from the object
database, generating a tar file with the expected mtime.

However, with corrected commit date, we will load the committer date
from CDAT chunk (truncated to lower 34-bits to populate the generation
number. Thus, Git sets date and generates tar file with the truncated
mtime.

The ustar format (the header format used by most modern tar programs)
only has room for 11 (or 12, depending on some implementations) octal
digits for the size and mtime of each file.

As the CDAT chunk is overflow by 12-octal digits but not 11-octal
digits, we split the existing tests to test both implementations
separately and add a new explicit test for 11-digit implementation.

To test the 11-octal digit implementation, we create a future commit
with committer date of 2^34 - 1, which overflows 11-octal digits without
overflowing 34-bits of the Commit Date chunks.

To test the 12-octal digit implementation, the smallest committer date
possible is 2^36 + 1, which overflows the CDAT chunk and thus
commit-graph must be disabled for the test.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:18 -08:00
Abhishek Kumar e30c5ee76c commit-graph: fix regression when computing Bloom filters
Before computing Bloom filters, the commit-graph machinery uses
commit_gen_cmp to sort commits by generation order for improved diff
performance. 3d11275505 (commit-graph: examine commits by generation
number, 2020-03-30) claims that this sort can reduce the time spent to
compute Bloom filters by nearly half.

But since c49c82aa4c (commit: move members graph_pos, generation to a
slab, 2020-06-17), this optimization is broken, since asking for a
'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
while writing.

Not all hope is lost, though: 'commit_gen_cmp()' falls back to
comparing commits by their date when they have equal generation number,
and so since c49c82aa4c is purely a date comparison function. This
heuristic is good enough that we don't seem to loose appreciable
performance while computing Bloom filters.

Applying this patch (compared with v2.30.0) speeds up computing Bloom
filters by factors ranging from 0.40% to 5.19% on various repositories [1].

So, avoid the useless 'commit_graph_generation()' while writing by
instead accessing the slab directly. This returns the newly-computed
generation numbers, and allows us to avoid the heuristic by directly
comparing generation numbers.

[1]: https://lore.kernel.org/git/20210105094535.GN8396@szeder.dev/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-18 16:21:17 -08:00
Jeff King 3361390cbe commit-graph: use size_t for array allocation and indexing
Our packed_commit_list is an array of pointers to commit structs. We use
"int" for the allocation, which is 32-bit even on 64-bit platforms. This
isn't likely to overflow in practice (we're writing commit graphs, so
you'd need to actually have billions of unique commits in the
repository). But it's good practice to use size_t for allocations.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07 12:32:04 -08:00
Jeff King a5f1c44899 commit-graph: replace packed_oid_list with oid_array
Our custom packed_oid_list data structure is really just an oid_array in
disguise. Let's switch to using the generic structure, which shortens
and simplifies the code slightly.

There's one slightly awkward part: in the old code we copied a hash
straight from the mmap'd on-disk data into the final object_id. And now
we'll copy to a temporary oid, which we'll then pass to
oid_array_append(). But this is an operation we have to do all over the
commit-graph code already, since it mostly uses object_id structs
internally. I also measured "git commit-graph --append", which triggers
this code path, and it showed no difference.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07 12:32:04 -08:00
Jeff King 1cbdbf3bef commit-graph: drop count_distinct_commits() function
When writing a commit graph, we collect a list of object ids in an
array, which we'll eventually copy into an array of "struct commit"
pointers. Before we do that, though, we count the number of distinct
commit entries. There's a subtle bug in this step, though.

We eliminate not only duplicate oids, but also in split mode, any oids
which are not commits or which are already in a graph file. However, the
loop starts at index 1, always counting index 0 as distinct. And indeed
it can't be a duplicate, since we check for those by comparing against
the previous entry, and there isn't one for index 0. But it could be a
commit that's already in a graph file, and we'd overcount the number of
commits by 1 in that case.

That turns out not to be a problem, though. The only things we do with
the count are:

  - check if our count will overflow our data structures. But the limit
    there is 2^31 commits, so while this is a useful check, the
    off-by-one is not likely to matter.

  - pre-allocate the array of commit pointers. But over-allocating by
    one isn't a problem; we'll just waste a few extra bytes.

The bug would be easy enough to fix, but we can observe that neither of
those steps is necessary.

After building the actual commit array, we'll likewise check its count
for overflow. So the extra check of the distinct commit count here is
redundant.

And likewise we use ALLOC_GROW() when building the commit array, so
there's no need to preallocate it (it's possible that doing so is
slightly more efficient, but if we care we can just optimistically
allocate one slot for each oid; I didn't bother here).

So count_distinct_commits() isn't doing anything useful. Let's just get
rid of that step.

Note that a side effect of the function was that we sorted the list of
oids, which we do rely on in copy_oids_to_commits(), since it must also
skip the duplicates. So we'll move the qsort there. I didn't copy the
"TODO" about adding more progress meters. It's actually quite hard to
make a repository large enough for this qsort would take an appreciable
amount of time, so this doesn't seem like a useful note.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-12-07 12:32:04 -08:00
Junio C Hamano 307a53dd99 Merge branch 'ds/commit-graph-merging-fix'
When "git commit-graph" detects the same commit recorded more than
once while it is merging the layers, it used to die.  The code now
ignores all but one of them and continues.

* ds/commit-graph-merging-fix:
  commit-graph: don't write commit-graph when disabled
  commit-graph: ignore duplicates when merging layers
2020-11-02 13:17:39 -08:00
Derrick Stolee 85102ac71b commit-graph: don't write commit-graph when disabled
The core.commitGraph config setting can be set to 'false' to prevent
parsing commits from the commit-graph file(s). This causes an issue when
trying to write with "--split" which needs to distinguish between
commits that are in the existing commit-graph layers and commits that
are not. The existing mechanism uses parse_commit() and follows by
checking if there is a 'graph_pos' that shows the commit was parsed from
the commit-graph file.

When core.commitGraph=false, we do not parse the commits from the
commit-graph and 'graph_pos' indicates that no commits are in the
existing file. The --split logic moves forward creating a new layer on
top that holds all reachable commits, then possibly merges down into
those layers, resulting in duplicate commits. The previous change makes
that merging process more robust to such a situation in case it happens
in the written commit-graph data.

The easy answer here is to avoid writing a commit-graph if reading the
commit-graph is disabled. Since the resulting commit-graph will would not
be read by subsequent Git processes. This is more natural than forcing
core.commitGraph to be true for the 'write' process.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09 14:16:32 -07:00
Derrick Stolee 150f11574b commit-graph: ignore duplicates when merging layers
Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

The root cause is due to disabling core.commitGraph, which prevents
parsing commits from the lower layers during a 'git commit-graph write
--split' command. Since we use the 'graph_pos' value to determine
whether a commit is in a lower layer, we never discover that those
commits are already in the commit-graph chain and add them to the top
layer. This layer is then merged down, creating duplicates.

The test added in t5324-split-commit-graph.sh fails without this change.
However, we still have not completely removed the need for this
duplicate check. That will come in a follow-up change.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09 14:16:23 -07:00
Junio C Hamano 288ed98bf7 Merge branch 'tb/bloom-improvements'
"git commit-graph write" learned to limit the number of bloom
filters that are computed from scratch with the --max-new-filters
option.

* tb/bloom-improvements:
  commit-graph: introduce 'commitGraph.maxNewFilters'
  builtin/commit-graph.c: introduce '--max-new-filters=<n>'
  commit-graph: rename 'split_commit_graph_opts'
  bloom: encode out-of-bounds filters as non-empty
  bloom/diff: properly short-circuit on max_changes
  bloom: use provided 'struct bloom_filter_settings'
  bloom: split 'get_bloom_filter()' in two
  commit-graph.c: store maximum changed paths
  commit-graph: respect 'commitGraph.readChangedPaths'
  t/helper/test-read-graph.c: prepare repo settings
  commit-graph: pass a 'struct repository *' in more places
  t4216: use an '&&'-chain
  commit-graph: introduce 'get_bloom_filter_settings()'
2020-09-29 14:01:20 -07:00
Junio C Hamano 48794acc50 Merge branch 'ds/maintenance-part-1'
A "git gc"'s big brother has been introduced to take care of more
repository maintenance tasks, not limited to the object database
cleaning.

* ds/maintenance-part-1:
  maintenance: add trace2 regions for task execution
  maintenance: add auto condition for commit-graph task
  maintenance: use pointers to check --auto
  maintenance: create maintenance.<task>.enabled config
  maintenance: take a lock on the objects directory
  maintenance: add --task option
  maintenance: add commit-graph task
  maintenance: initialize task array
  maintenance: replace run_auto_gc()
  maintenance: add --quiet option
  maintenance: create basic maintenance runner
2020-09-25 15:25:38 -07:00
Taylor Blau 809e0327f5 builtin/commit-graph.c: introduce '--max-new-filters=<n>'
Introduce a command-line flag to specify the maximum number of new Bloom
filters that a 'git commit-graph write' is willing to compute from
scratch.

Prior to this patch, a commit-graph write with '--changed-paths' would
compute Bloom filters for all selected commits which haven't already
been computed (i.e., by a previous commit-graph write with '--split'
such that a roll-up or replacement is performed).

This behavior can cause prohibitively-long commit-graph writes for a
variety of reasons:

  * There may be lots of filters whose diffs take a long time to
    generate (for example, they have close to the maximum number of
    changes, diffing itself takes a long time, etc).

  * Old-style commit-graphs (which encode filters with too many entries
    as not having been computed at all) cause us to waste time
    recomputing filters that appear to have not been computed only to
    discover that they are too-large.

This can make the upper-bound of the time it takes for 'git commit-graph
write --changed-paths' to be rather unpredictable.

To make this command behave more predictably, introduce
'--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
from scratch. This lets "computing" already-known filters proceed
quickly, while bounding the number of slow tasks that Git is willing to
do.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-18 10:35:39 -07:00
Taylor Blau 98bb796191 commit-graph: rename 'split_commit_graph_opts'
In the subsequent commit, additional options will be added to the
commit-graph API which have nothing to do with splitting.

Rename the 'split_commit_graph_opts' structure to the more-generic
'commit_graph_opts' to encompass both. Likewise, rename the 'flags'
member to instead be 'split_flags' to clarify that it only has to do
with the behavior implied by '--split'.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 21:55:50 -07:00
Taylor Blau 59f0d5073f bloom: encode out-of-bounds filters as non-empty
When a changed-path Bloom filter has either zero, or more than a
certain number (commonly 512) of entries, the commit-graph machinery
encodes it as "missing". More specifically, it sets the indices adjacent
in the BIDX chunk as equal to each other to indicate a "length 0"
filter; that is, that the filter occupies zero bytes on disk.

This has heretofore been fine, since the commit-graph machinery has no
need to care about these filters with too few or too many changed paths.
Both cases act like no filter has been generated at all, and so there is
no need to store them.

In a subsequent commit, however, the commit-graph machinery will learn
to only compute Bloom filters for some commits in the current
commit-graph layer. This is a change from the current implementation
which computes Bloom filters for all commits that are in the layer being
written. Critically for this patch, only computing some of the Bloom
filters means adding a third state for length 0 Bloom filters: zero
entries, too many entries, or "hasn't been computed".

It will be important for that future patch to distinguish between "not
representable" (i.e., zero or too-many changed paths), and "hasn't been
computed". In particular, we don't want to waste time recomputing
filters that have already been computed.

To that end, change how we store Bloom filters in the "computed but not
representable" category:

  - Bloom filters with no entries are stored as a single byte with all
    bits low (i.e., all queries to that Bloom filter will return
    "definitely not")

  - Bloom filters with too many entries are stored as a single byte with
    all bits set high (i.e., all queries to that Bloom filter will
    return "maybe").

These rules are sufficient to not incur a behavior change by changing
the on-disk representation of these two classes. Likewise, no
specification changes are necessary for the commit-graph format, either:

  - Filters that were previously empty will be recomputed and stored
    according to the new rules, and

  - old clients reading filters generated by new clients will interpret
    the filters correctly and be none the wiser to how they were
    generated.

Clients will invoke the Bloom machinery in more cases than before, but
this can be addressed by returning a NULL filter when all bits are set
high. This can be addressed in a future patch.

Note that this does increase the size of on-disk commit-graphs, but far
less than other proposals. In particular, this is generally more
efficient than storing a bitmap for which commits haven't computed their
Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
whereas storing explicit filters as above incurs a penalty of one byte
per too-large or empty commit.

In practice, these boundary commits likely occupy a small proportion of
the overall number of commits, and so the size penalty is likely smaller
than storing a bitmap for all commits.

See, for example, these relative proportions of such boundary commits
(collected by SZEDER Gábor):

                  |     Percentage of     |    commit-graph   |           |
                  |   commits modifying   |     file size     |           |
                  ├────────┬──────────────┼───────────────────┤    pct.   |
                  | 0 path | >= 512 paths | before  |  after  |   change  |
 ┌────────────────┼────────┼──────────────┼─────────┼─────────┼───────────┤
 | android-base   | 13.20% |        0.13% | 37.468M | 37.534M | +0.1741 % |
 | cmssw          |  0.15% |        0.23% | 17.118M | 17.119M | +0.0091 % |
 | cpython        |  3.07% |        0.01% |  7.967M |  7.971M | +0.0423 % |
 | elasticsearch  |  0.70% |        1.00% |  8.833M |  8.835M | +0.0128 % |
 | gcc            |  0.00% |        0.08% | 16.073M | 16.074M | +0.0030 % |
 | gecko-dev      |  0.14% |        0.64% | 59.868M | 59.874M | +0.0105 % |
 | git            |  0.11% |        0.02% |  3.895M |  3.895M | +0.0020 % |
 | glibc          |  0.02% |        0.10% |  3.555M |  3.555M | +0.0021 % |
 | go             |  0.00% |        0.07% |  3.186M |  3.186M | +0.0018 % |
 | homebrew-cask  |  0.40% |        0.02% |  7.035M |  7.035M | +0.0065 % |
 | homebrew-core  |  0.01% |        0.01% | 11.611M | 11.611M | +0.0002 % |
 | jdk            |  0.26% |        5.64% |  5.537M |  5.540M | +0.0590 % |
 | linux          |  0.01% |        0.51% | 63.735M | 63.740M | +0.0073 % |
 | llvm-project   |  0.12% |        0.03% | 25.515M | 25.516M | +0.0050 % |
 | rails          |  0.10% |        0.10% |  6.252M |  6.252M | +0.0027 % |
 | rust           |  0.07% |        0.17% |  9.364M |  9.364M | +0.0033 % |
 | tensorflow     |  0.09% |        1.02% |  7.009M |  7.010M | +0.0158 % |
 | webkit         |  0.05% |        0.31% | 17.405M | 17.406M | +0.0047 % |

(where the above increase is determined by computing a non-split
commit-graph before and after this patch).

Given that these projects are all "large" by commit count, the storage
cost by writing these filters explicitly is negligible. In the most
extreme example, android-base (which has 494,848 commits at the time of
writing) would have its commit-graph increase by a modest 68.4 KB.

Finally, a test to exercise filters which contain too many changed path
entries will be introduced in a subsequent patch.

Suggested-by: SZEDER Gábor <szeder.dev@gmail.com>
Suggested-by: Jakub Narębski <jnareb@gmail.com>
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 21:55:50 -07:00
Derrick Stolee 663b2b1b90 maintenance: add commit-graph task
The first new task in the 'git maintenance' builtin is the
'commit-graph' task. This updates the commit-graph file
incrementally with the command

	git commit-graph write --reachable --split

By writing an incremental commit-graph file using the "--split"
option we minimize the disruption from this operation. The default
behavior is to merge layers until the new "top" layer is less than
half the size of the layer below. This provides quick writes most
of the time, with the longer writes following a power law
distribution.

Most importantly, concurrent Git processes only look at the
commit-graph-chain file for a very short amount of time, so they
will verly likely not be holding a handle to the file when we try
to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but
our job expires some of the .graph files before they can be read,
then those processes will see a warning message (but not fail).
This could be avoided by a future update to use the --expire-time
argument when writing the commit-graph.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 11:30:05 -07:00
Taylor Blau 9a7a9ed10d bloom: use provided 'struct bloom_filter_settings'
When 'get_or_compute_bloom_filter()' needs to compute a Bloom filter
from scratch, it looks to the default 'struct bloom_filter_settings' in
order to determine the maximum number of changed paths, number of bits
per entry, and so on.

All of these values have so far been constant, and so there was no need
to pass in a pointer from the caller (eg., the one that is stored in the
'struct write_commit_graph_context').

Start passing in a 'struct bloom_filter_settings *' instead of using the
default values to respect graph-specific settings (eg., in the case of
setting 'GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS').

In order to have an initialized value for these settings, move its
initialization to earlier in the commit-graph write.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 09:31:25 -07:00
Taylor Blau 312cff5207 bloom: split 'get_bloom_filter()' in two
'get_bloom_filter' takes a flag to control whether it will compute a
Bloom filter if the requested one is missing. In the next patch, we'll
add yet another parameter to this method, which would force all but one
caller to specify an extra 'NULL' parameter at the end.

Instead of doing this, split 'get_bloom_filter' into two functions:
'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only
looks up a Bloom filter (and does not compute one if it's missing,
thus dropping the 'compute_if_not_present' flag). The latter does
compute missing Bloom filters, with an additional parameter to store
whether or not it needed to do so.

This simplifies many call-sites, since the majority of existing callers
to 'get_bloom_filter' do not want missing Bloom filters to be computed
(so they can drop the parameter entirely and use the simpler version of
the function).

While we're at it, instrument the new 'get_or_compute_bloom_filter()'
with counters in the 'write_commit_graph_context' struct which store
the number of filters that we did and didn't compute, as well as filters
that were truncated.

It would be nice to drop the 'compute_if_not_present' flag entirely,
since all remaining callers of 'get_or_compute_bloom_filter' pass it as
'1', but this will change in a future patch and hence cannot be removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 09:31:25 -07:00
Taylor Blau 97ffa4fab5 commit-graph.c: store maximum changed paths
For now, we assume that there is a fixed constant describing the
maximum number of changed paths we are willing to store in a Bloom
filter.

Prepare for that to (at least partially) not be the case by making it a
member of the 'struct bloom_filter_settings'. This will be helpful in
the subsequent patches by reducing the size of test cases that exercise
storing too many changed paths, as well as preparing for an eventual
future in which this value might change.

This patch alone does not cause newly generated Bloom filters to use
a custom upper-bound on the maximum number of changed paths a single
Bloom filter can hold, that will occur in a later patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-17 09:29:22 -07:00
Taylor Blau b66d84756f commit-graph: respect 'commitGraph.readChangedPaths'
Git uses the 'core.commitGraph' configuration value to control whether
or not the commit graph is used when parsing commits or performing a
traversal.

Now that commit-graphs can also contain a section for changed-path Bloom
filters, administrators that already have commit-graphs may find it
convenient to use those graphs without relying on their changed-path
Bloom filters. This can happen, for example, during a staged roll-out,
or in the event of an incident.

Introduce 'commitGraph.readChangedPaths' to control whether or not Bloom
filters are read. Note that this configuration is independent from both:

  - 'core.commitGraph', to allow flexibility in using all parts of a
    commit-graph _except_ for its Bloom filters.

  - The '--changed-paths' option for 'git commit-graph write', to allow
    reading and writing Bloom filters to be controlled independently.

When the variable is set, pretend as if no Bloom data was specified at
all. This avoids adding additional special-casing outside of the
commit-graph internals.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 12:51:48 -07:00
Taylor Blau ab14d0676c commit-graph: pass a 'struct repository *' in more places
In a future commit, some commit-graph internals will want access to
'r->settings', but we only have the 'struct object_directory *'
corresponding to that repository.

Add an additional parameter to pass the repository around in more
places.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 12:51:48 -07:00
Taylor Blau 4f3644056a commit-graph: introduce 'get_bloom_filter_settings()'
Many places in the code often need a pointer to the commit-graph's
'struct bloom_filter_settings', in which case they often take the value
from the top-most commit-graph.

In the non-split case, this works as expected. In the split case,
however, things get a little tricky. Not all layers in a chain of
incremental commit-graphs are required to themselves have Bloom data,
and so whether or not some part of the code uses Bloom filters depends
entirely on whether or not the top-most level of the commit-graph chain
has Bloom filters.

This has been the behavior since Bloom filters were introduced, and has
been codified into the tests since a759bfa9ee (t4216: add end to end
tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130
requires that Bloom filters are not used in exactly the case described
earlier.

There is no reason that this needs to be the case, since it is perfectly
valid for commits in an earlier layer to have Bloom filters when commits
in a newer layer do not.

Since Bloom settings are guaranteed in practice to be the same for any
layer in a chain that has Bloom data, it is sufficient to traverse the
'->base_graph' pointer until either (1) a non-null 'struct
bloom_filter_settings *' is found, or (2) until we are at the root of
the commit-graph chain.

Introduce a 'get_bloom_filter_settings()' function that does just this,
and use it instead of purely dereferencing the top-most graph's
'->bloom_filter_settings' pointer.

While we're at it, add an additional test in t5324 to guard against code
in the commit-graph writing machinery that doesn't correctly handle a
NULL 'struct bloom_filter *'.

Co-authored-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 12:51:48 -07:00
Derrick Stolee 665d70ad03 commit-graph: use the "hash version" byte
The commit-graph format reserved a byte among the header of the file to
store a "hash version". During the SHA-256 work, this was not modified
because file formats are not necessarily intended to work across hash
versions. If a repository has SHA-256 as its hash algorithm, it
automatically up-shifts the lengths of object names in all necessary
formats.

However, since we have this byte available for adjusting the version, we
can make the file formats more obviously incompatible instead of relying
on other context from the repository.

Update the oid_version() method in commit-graph.c to add a new value, 2,
for sha-256. This automatically writes the new value in a SHA-256
repository _and_ verifies the value is correct. This is a breaking
change relative to the current 'master' branch since 092b677 (Merge
branch 'bc/sha-256-cvs-svn-updates', 2020-08-13) but it is not breaking
relative to any released version of Git.

The test impact is relatively minor: the output of 'test-tool
read-graph' lists the header information, so those instances of '1' need
to be replaced with a variable determined by GIT_TEST_DEFAULT_HASH. A
more careful test is added that specifically creates a repository of
each type then swaps the commit-graph files. The important value here is
that the "git log" command succeeds while writing a message to stderr.

Helped-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 16:45:14 -07:00
Junio C Hamano 70cdbbe3a7 Merge branch 'ds/commit-graph-bloom-updates' into master
Updates to the changed-paths bloom filter.

* ds/commit-graph-bloom-updates:
  commit-graph: check all leading directories in changed path Bloom filters
  revision: empty pathspecs should not use Bloom filters
  revision.c: fix whitespace
  commit-graph: check chunk sizes after writing
  commit-graph: simplify chunk writes into loop
  commit-graph: unify the signatures of all write_graph_chunk_*() functions
  commit-graph: persist existence of changed-paths
  bloom: fix logic in get_bloom_filter()
  commit-graph: change test to die on parse, not load
  commit-graph: place bloom_settings in context
2020-07-30 13:20:31 -07:00
Junio C Hamano de6dda0dc3 Merge branch 'sg/commit-graph-cleanups' into master
The changed-path Bloom filter is improved using ideas from an
independent implementation.

* sg/commit-graph-cleanups:
  commit-graph: simplify write_commit_graph_file() #2
  commit-graph: simplify write_commit_graph_file() #1
  commit-graph: simplify parse_commit_graph() #2
  commit-graph: simplify parse_commit_graph() #1
  commit-graph: clean up #includes
  diff.h: drop diff_tree_oid() & friends' return value
  commit-slab: add a function to deep free entries on the slab
  commit-graph-format.txt: all multi-byte numbers are in network byte order
  commit-graph: fix parsing the Chunk Lookup table
  tree-walk.c: don't match submodule entries for 'submod/anything'
2020-07-30 13:20:30 -07:00
Junio C Hamano 12f5eb9f08 Merge branch 'sg/commit-graph-progress-fix' into master
The code to produce progress output from "git commit-graph --write"
had a few breakages, which have been fixed.

* sg/commit-graph-progress-fix:
  commit-graph: fix "Writing out commit graph" progress counter
  commit-graph: fix progress of reachable commits
2020-07-15 16:29:43 -07:00
Junio C Hamano 24ecfdf206 Merge branch 'tb/fix-persistent-shallow' into master
When "fetch.writeCommitGraph" configuration is set in a shallow
repository and a fetch moves the shallow boundary, we wrote out
broken commit-graph files that do not match the reality, which has
been corrected.

* tb/fix-persistent-shallow:
  commit.c: don't persist substituted parents when unshallowing
2020-07-09 14:00:44 -07:00
SZEDER Gábor 150cd3b61d commit-graph: fix "Writing out commit graph" progress counter
76ffbca71a (commit-graph: write Bloom filters to commit graph file,
2020-04-06) added two delayed progress lines to writing the Bloom
filter index and data chunk.  This is wrong, because a single common
progress is used while writing all chunks, which is not updated while
writing these two new chunks, resulting in incomplete-looking "done"
lines:

  Expanding reachable commits in commit graph: 888679, done.
  Computing commit changed paths Bloom filters: 100% (888678/888678), done.
  Writing out commit graph in 6 passes:  66% (3554712/5332068), done.

Use the common 'struct progress' instance while writing the Bloom
filter chunks as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-09 10:28:49 -07:00
SZEDER Gábor 6f9d5f2fda commit-graph: fix progress of reachable commits
To display a progress line while iterating over all refs,
d335ce8f24 (commit-graph.c: show progress of finding reachable
commits, 2020-05-13) should have added a pair of
start_delayed_progress() and stop_progress() calls around a
for_each_ref() invocation.  Alas, the stop_progress() call ended up at
the wrong place, after write_commit_graph(), which does all the
commit-graph computation and writing, and has several progress lines
of its own.  Consequently, that new

  Collecting referenced commits: 123

progress line is overwritten by the first progress line shown by
write_commit_graph(), and its final "done" line is shown last, after
everything is finished:

  Expanding reachable commits in commit graph: 344786, done.
  Computing commit changed paths Bloom filters: 100% (344786/344786), done.
  Collecting referenced commits: 154, done.

Move that stop_progress() call to the right place.

While at it, drop the unnecessary 'if (data.progress)' condition
protecting the stop_progress() call, because that function is prepared
to handle a NULL progress struct.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-09 10:27:23 -07:00
Taylor Blau ce16364e89 commit.c: don't persist substituted parents when unshallowing
Since 37b9dcabfc (shallow.c: use '{commit,rollback}_shallow_file',
2020-04-22), Git knows how to reset stat-validity checks for the
$GIT_DIR/shallow file, allowing it to change between a shallow and
non-shallow state in the same process (e.g., in the case of 'git fetch
--unshallow').

However, when $GIT_DIR/shallow changes, Git does not alter or remove any
grafts (nor substituted parents) in memory.

This comes up in a "git fetch --unshallow" with fetch.writeCommitGraph
set to true. Ordinarily in a shallow repository (and before 37b9dcabfc,
even in this case), commit_graph_compatible() would return false,
indicating that the repository should not be used to write a
commit-graphs (since commit-graph files cannot represent a shallow
history). But since 37b9dcabfc, in an --unshallow operation that check
succeeds.

Thus even though the repository isn't shallow any longer (that is, we
have all of the objects), the in-core representation of those objects
still has munged parents at the shallow boundaries.  When the
commit-graph write proceeds, we use the incorrect parentage, producing
wrong results.

There are two ways for a user to work around this: either (1) set
'fetch.writeCommitGraph' to 'false', or (2) drop the commit-graph after
unshallowing.

One way to fix this would be to reset the parsed object pool entirely
(flushing the cache and thus preventing subsequent reads from modifying
their parents) after unshallowing. That would produce a problem when
callers have a now-stale reference to the old pool, and so this patch
implements a different approach. Instead, attach a new bit to the pool,
'substituted_parent', which indicates if the repository *ever* stored a
commit which had its parents modified (i.e., the shallow boundary
prior to unshallowing).

This bit needs to be sticky because all reads subsequent to modifying a
commit's parents are unreliable when unshallowing. Modify the check in
'commit_graph_compatible' to take this bit into account, and correctly
avoid generating commit-graphs in this case, thus solving the bug.

Helped-by: Derrick Stolee <dstolee@microsoft.com>
Helped-by: Jonathan Nieder <jrnieder@gmail.com>
Reported-by: Jay Conrod <jayconrod@google.com>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-08 16:13:46 -07:00
SZEDER Gábor 2dd4fed927 commit-graph: check chunk sizes after writing
In my experience while experimenting with new commit-graph chunks,
early versions of the corresponding new write_commit_graph_my_chunk()
functions are, sadly but not surprisingly, often buggy, and write more
or less data than they are supposed to, especially if the chunk size
is not directly proportional to the number of commits.  This then
causes all kinds of issues when reading such a bogus commit-graph
file, raising the question of whether the writing or the reading part
happens to be buggy this time.

Let's catch such issues early, already when writing the commit-graph
file, and check that each write_graph_chunk_*() function wrote the
amount of data that it was expected to, and what has been encoded in
the Chunk Lookup table.  Now that all commit-graph chunks are written
in a loop we can do this check in a single place for all chunks, and
any chunks added in the future will get checked as well.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
SZEDER Gábor 17e6275fc9 commit-graph: simplify chunk writes into loop
In write_commit_graph_file() we now have one block of code filling the
array of 'struct chunk_info' with the IDs and sizes of chunks to be
written, and an other block of code calling the functions responsible
for writing individual chunks.  In case of optional chunks like Extra
Edge List an Base Graphs List there is also a condition checking
whether that chunk is necessary/desired, and that same condition is
repeated in both blocks of code. Other, newer chunks have similar
optional conditions.

Eliminate these repeated conditions by storing the function pointers
responsible for writing individual chunks in the 'struct chunk_info'
array as well, and calling them in a loop to write the commit-graph
file.  This will open up the possibility for a bit of foolproofing in
the following patch.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
SZEDER Gábor 9bab081dfa commit-graph: unify the signatures of all write_graph_chunk_*() functions
Update the write_graph_chunk_*() helper functions to have the same
signature:

  - Return an int error code from all these functions.
    write_graph_chunk_base() already has an int error code, now the
    others will have one, too, but since they don't indicate any
    error, they will always return 0.

  - Drop the hash size parameter of write_graph_chunk_oids() and
    write_graph_chunk_data(); its value can be read directly from
    'the_hash_algo' inside these functions as well.

This opens up the possibility for further cleanups and foolproofing in
the following two patches.

Helped-by: René Scharfe <l.s.r@web.de>
Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
Derrick Stolee 0087a87ba8 commit-graph: persist existence of changed-paths
The changed-path Bloom filters were released in v2.27.0, but have a
significant drawback. A user can opt-in to writing the changed-path
filters using the "--changed-paths" option to "git commit-graph write"
but the next write will drop the filters unless that option is
specified.

This becomes even more important when considering the interaction with
gc.writeCommitGraph (on by default) or fetch.writeCommitGraph (part of
features.experimental). These config options trigger commit-graph writes
that the user did not signal, and hence there is no --changed-paths
option available.

Allow a user that opts-in to the changed-path filters to persist the
property of "my commit-graph has changed-path filters" automatically. A
user can drop filters using the --no-changed-paths option.

In the process, we need to be extremely careful to match the Bloom
filter settings as specified by the commit-graph. This will allow future
versions of Git to customize these settings, and the version with this
change will persist those settings as commit-graphs are rewritten on
top.

Use the trace2 API to signal the settings used during the write, and
check that output in a test after manually adjusting the correct bytes
in the commit-graph file.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
Derrick Stolee 949197420e bloom: fix logic in get_bloom_filter()
The get_bloom_filter() method is a bit complicated in some parts where
it does not need to be. In particular, it needs to return a NULL filter
only when compute_if_not_present is zero AND the filter data cannot be
loaded from a commit-graph file. This currently happens by accident
because the commit-graph does not load changed-path Bloom filters from
an existing commit-graph when writing a new one. This will change in a
later patch.

Also clean up some style issues while we are here.

One side-effect of returning a NULL filter is that the filters that are
reported as "too large" will now be reported as NULL insead of length
zero. This case was not properly covered before, so add a test. Further,
remote the counting of the zero-length filters from revision.c and the
trace2 logs.

Helped-by: René Scharfe <l.s.r@web.de>
Helped-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-01 14:17:43 -07:00
Derrick Stolee 7b671f8c2b commit-graph: change test to die on parse, not load
43d3561 (commit-graph write: don't die if the existing graph is corrupt,
2019-03-25) introduced the GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD environment
variable. This was created to verify that commit-graph was not loaded
when writing a new non-incremental commit-graph.

An upcoming change wants to load a commit-graph in some valuable cases,
but we want to maintain that we don't trust the commit-graph data when
writing our new file. Instead of dying on load, instead die if we ever
try to parse a commit from the commit-graph. This functionally verifies
the same intended behavior, but allows a more advanced feature in the
next change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-23 17:12:08 -07:00
Derrick Stolee 98037f2bf2 commit-graph: place bloom_settings in context
Place an instance of struct bloom_settings into the struct
write_commit_graph_context. This allows simplifying the function
prototype of write_graph_chunk_bloom_data(). This will allow us
to combine the function prototypes and use function pointers to
simplify write_commit_graph_file().

By using a pointer, we can later replace the settings to match those
that exist in the current commit-graph, in case a future Git version
allows customization of these parameters.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-23 17:12:08 -07:00
Abhishek Kumar c752ad09c4 commit-graph: minimize commit_graph_data_slab access
In an earlier patch, multiple struct acccesses to `graph_pos` and
`generation` were auto-converted to multiple method calls.

Since the values are fixed and commit-slab access costly, we would be
better off with storing the values as a local variable and reusing it.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17 14:37:52 -07:00
Abhishek Kumar c49c82aa4c commit: move members graph_pos, generation to a slab
We remove members `graph_pos` and `generation` from the struct commit.
The default assignments in init_commit_node() are no longer valid,
which is fine as the slab helpers return appropriate default values and
the assignments are removed.

We will replace existing use of commit->generation and commit->graph_pos
by commit_graph_data_slab helpers using
`contrib/coccinelle/commit.cocci'.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17 14:37:30 -07:00
Abhishek Kumar 4844812b9e commit-graph: introduce commit_graph_data_slab
The struct commit is used in many contexts. However, members
`generation` and `graph_pos` are only used for commit-graph related
operations and otherwise waste memory.

This wastage would have been more pronounced as we transition to
generation number v2, which uses 64-bit generation number instead of
current 32-bits.

As they are often accessed together, let's introduce struct
commit_graph_data and move them to a commit_graph_data slab.

While the overall test suite runs just as fast as master,
(series: 26m48s, master: 27m34s, faster by 2.87%), certain commands
like `git merge-base --is-ancestor` were slowed by 40% as discovered
by Szeder Gábor [1]. After minimizing commit-slab access, the slow down
persists but is closer to 20%.

Derrick Stolee believes the slow down is attributable to the underlying
algorithm rather than the slowness of commit-slab access [2] and we will
follow-up in a later series.

[1]: https://lore.kernel.org/git/20200607195347.GA8232@szeder.dev/
[2]: https://lore.kernel.org/git/13db757a-9412-7f1e-805c-8a028c4ab2b1@gmail.com/

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-17 14:37:23 -07:00
Junio C Hamano dc57a9be5e Merge branch 'tb/commit-graph-no-check-oids'
Clean-up the commit-graph codepath.

* tb/commit-graph-no-check-oids:
  commit-graph: drop COMMIT_GRAPH_WRITE_CHECK_OIDS flag
  t5318: reorder test below 'graph_read_expect'
  commit-graph.c: simplify 'fill_oids_from_commits'
  builtin/commit-graph.c: dereference tags in builtin
  builtin/commit-graph.c: extract 'read_one_commit()'
  commit-graph.c: peel refs in 'add_ref_to_set'
  commit-graph.c: show progress of finding reachable commits
  commit-graph.c: extract 'refs_cb_data'
2020-06-08 18:06:27 -07:00
SZEDER Gábor 7fbfe07ab4 commit-graph: simplify write_commit_graph_file() #2
Unify the 'chunk_ids' and 'chunk_sizes' arrays into an array of
'struct chunk_info'.  This will allow more cleanups in the following
patches.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor bb4d60e5d5 commit-graph: simplify write_commit_graph_file() #1
In write_commit_graph_file() one block of code fills the array of
chunk IDs, another block of code fills the array of chunk offsets,
then the chunk IDs and offsets are written to the Chunk Lookup table,
and finally a third block of code writes the actual chunks.  In case
of optional chunks like Extra Edge List and Base Graphs List there is
also a condition checking whether that chunk is necessary/desired, and
that same condition is repeated in all those three blocks of code.
This patch series is about to add more optional chunks, so there would
be even more repeated conditions.

Those chunk offsets are relative to the beginning of the file, so they
inherently depend on the size of the Chunk Lookup table, which in turn
depends on the number of chunks that are to be written to the
commit-graph file.  IOW at the time we set the first chunk's ID we
can't yet know its offset, because we don't yet know how many chunks
there are.

Simplify this by initially filling an array of chunk sizes, not
offsets, and calculate the offsets based on the chunk sizes only
later, while we are writing the Chunk Lookup table.  This way we can
fill the arrays of chunk IDs and sizes in one go, eliminating one set
of repeated conditions.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor 5cfa438a76 commit-graph: simplify parse_commit_graph() #2
The Chunk Lookup table stores the chunks' starting offset in the
commit-graph file, not their sizes.  Consequently, the size of a chunk
can only be calculated by subtracting its offset from the offset of
the subsequent chunk (or that of the terminating label).  This is
currenly implemented in a bit complicated way: as we iterate over the
entries of the Chunk Lookup table, we check the id of each chunk and
store its starting offset, then we check the id of the last seen chunk
and calculate its size using its previously saved offset.  At the
moment there is only one chunk for which we calculate its size, but
this patch series will add more, and the repeated chunk id checks are
not that pretty.

Instead let's read ahead the offset of the next chunk on each
iteration, so we can calculate the size of each chunk right away,
right where we store its starting offset.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor 2ad4f1a7c4 commit-graph: simplify parse_commit_graph() #1
While we iterate over all entries of the Chunk Lookup table we make
sure that we don't attempt to read past the end of the mmap-ed
commit-graph file, and check in each iteration that the chunk ID and
offset we are about to read is still within the mmap-ed memory region.
However, these checks in each iteration are not really necessary,
because the number of chunks in the commit-graph file is already known
before this loop from the just parsed commit-graph header.

So let's check that the commit-graph file is large enough for all
entries in the Chunk Lookup table before we start iterating over those
entries, and drop those per-iteration checks.  While at it, take into
account the size of everything that is necessary to have a valid
commit-graph file, i.e. the size of the header, the size of the
mandatory OID Fanout chunk, and the size of the signature in the
trailer as well.

Note that this necessitates the change of the error message as well,
and, consequently, have to update the 'detect incorrect chunk count'
test in 't5318-commit-graph.sh' as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00
SZEDER Gábor fa7965309e commit-graph: clean up #includes
Our CodingGuidelines says that it's sufficient to include one of
'git-compat-util.h' and 'cache.h', but both 'commit-graph.c' and
'commit-graph.h' include both.  Let's include only 'git-compat-util.h'
to loose a bunch of unnecessary dependencies; but include 'hash.h',
because 'commit-graph.h' does require the definition of 'struct
object_id'.

'commit-graph.h' explicitly includes 'repository.h' and
'string-list.h', but only needs the declaration of a few structs from
them.  Drop these includes and forward-declare the necessary structs
instead.

'commit-graph.c' includes 'dir.h', but doesn't actually use anything
from there, so let's drop that #include as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-06-08 12:28:49 -07:00