git/t/t5324-split-commit-graph.sh

722 lines
22 KiB
Bash
Raw Normal View History

#!/bin/sh
test_description='split commit graph'
TEST_PASSES_SANITIZE_LEAK=true
. ./test-lib.sh
. "$TEST_DIRECTORY"/lib-chunk.sh
GIT_TEST_COMMIT_GRAPH=0
GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS=0
test_expect_success 'setup repo' '
git init &&
git config core.commitGraph true &&
git config gc.writeCommitGraph false &&
infodir=".git/objects/info" &&
graphdir="$infodir/commit-graphs" &&
test_oid_cache <<-EOM
commit-graph: implement generation data chunk As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:15 +00:00
shallow sha1:2132
shallow sha256:2436
commit-graph: implement generation data chunk As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:15 +00:00
base sha1:1408
base sha256:1528
commit-graph: use the "hash version" byte The commit-graph format reserved a byte among the header of the file to store a "hash version". During the SHA-256 work, this was not modified because file formats are not necessarily intended to work across hash versions. If a repository has SHA-256 as its hash algorithm, it automatically up-shifts the lengths of object names in all necessary formats. However, since we have this byte available for adjusting the version, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Update the oid_version() method in commit-graph.c to add a new value, 2, for sha-256. This automatically writes the new value in a SHA-256 repository _and_ verifies the value is correct. This is a breaking change relative to the current 'master' branch since 092b677 (Merge branch 'bc/sha-256-cvs-svn-updates', 2020-08-13) but it is not breaking relative to any released version of Git. The test impact is relatively minor: the output of 'test-tool read-graph' lists the header information, so those instances of '1' need to be replaced with a variable determined by GIT_TEST_DEFAULT_HASH. A more careful test is added that specifically creates a repository of each type then swaps the commit-graph files. The important value here is that the "git log" command succeeds while writing a message to stderr. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:47 +00:00
oid_version sha1:1
oid_version sha256:2
EOM
'
graph_read_expect() {
NUM_BASE=0
if test ! -z $2
then
NUM_BASE=$2
fi
commit-graph: start parsing generation v2 (again) The 'read_generation_data' member of 'struct commit_graph' was introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire chain does, 2021-01-16). The intention was to avoid using corrected commit dates if not all layers of a commit-graph had that data stored. The logic in validate_mixed_generation_chain() at that point incorrectly initialized read_generation_data to 1 if and only if the tip commit-graph contained the Corrected Commit Date chunk. This was "fixed" in 448a39e65 (commit-graph: validate layers for generation data, 2021-02-02) to validate that read_generation_data was either non-zero for all layers, or it would set read_generation_data to zero for all layers. The problem here is that read_generation_data is not initialized to be non-zero anywhere! This change initializes read_generation_data immediately after the chunk is parsed, so each layer will have its value present as soon as possible. The read_generation_data member is used in fill_commit_graph_info() to determine if we should use the corrected commit date or the topological levels stored in the Commit Data chunk. Due to this bug, all previous versions of Git were defaulting to topological levels in all cases! This can be measured with some performance tests. Using the Linux kernel as a testbed, I generated a complete commit-graph containing corrected commit dates and tested the 'new' version against the previous, 'old' version. First, rev-list with --topo-order demonstrates a 26% improvement using corrected commit dates: hyperfine \ -n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \ -n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 57.1 ms ± 3.1 ms Range (min … max): 52.9 ms … 62.0 ms 55 runs Benchmark 2: new Time (mean ± σ): 45.5 ms ± 3.3 ms Range (min … max): 39.9 ms … 51.7 ms 59 runs Summary 'new' ran 1.26 ± 0.11 times faster than 'old' These performance improvements are due to the algorithmic improvements given by walking fewer commits due to the higher cutoffs from corrected commit dates. However, this comes at a cost. The additional I/O cost of parsing the corrected commit dates is visible in case of merge-base commands that do not reduce the overall number of walked commits. hyperfine \ -n "old" "$OLD_GIT merge-base v4.8 v4.9" \ -n "new" "$NEW_GIT merge-base v4.8 v4.9" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 110.4 ms ± 6.4 ms Range (min … max): 96.0 ms … 118.3 ms 25 runs Benchmark 2: new Time (mean ± σ): 150.7 ms ± 1.1 ms Range (min … max): 149.3 ms … 153.4 ms 19 runs Summary 'old' ran 1.36 ± 0.08 times faster than 'new' Performance issues like this are what motivated 702110aac (commit-graph: use config to specify generation type, 2021-02-25). In the future, we could fix this performance problem by inserting the corrected commit date offsets into the Commit Date chunk instead of having that data in an extra chunk. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01 19:48:31 +00:00
OPTIONS=
if test -z "$3"
then
OPTIONS=" read_generation_data"
fi
cat >expect <<- EOF
commit-graph: implement generation data chunk As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:15 +00:00
header: 43475048 1 $(test_oid oid_version) 4 $NUM_BASE
num_commits: $1
commit-graph: implement generation data chunk As discovered by Ævar, we cannot increment graph version to distinguish between generation numbers v1 and v2 [1]. Thus, one of pre-requistes before implementing generation number v2 was to distinguish between graph versions in a backwards compatible manner. We are going to introduce a new chunk called Generation DATa chunk (or GDAT). GDAT will store corrected committer date offsets whereas CDAT will still store topological level. Old Git does not understand GDAT chunk and would ignore it, reading topological levels from CDAT. New Git can parse GDAT and take advantage of newer generation numbers, falling back to topological levels when GDAT chunk is missing (as it would happen with a commit-graph written by old Git). We introduce a test environment variable 'GIT_TEST_COMMIT_GRAPH_NO_GDAT' which forces commit-graph file to be written without generation data chunk to emulate a commit-graph file written by old Git. To minimize the space required to store corrrected commit date, Git stores corrected commit date offsets into the commit-graph file, instea of corrected commit dates. This saves us 4 bytes per commit, decreasing the GDAT chunk size by half, but it's possible for the offset to overflow the 4-bytes allocated for storage. As such overflows are and should be exceedingly rare, we use the following overflow management scheme: We introduce a new commit-graph chunk, Generation Data OVerflow ('GDOV') to store corrected commit dates for commits with offsets greater than GENERATION_NUMBER_V2_OFFSET_MAX. If the offset is greater than GENERATION_NUMBER_V2_OFFSET_MAX, we set the MSB of the offset and the other bits store the position of corrected commit date in GDOV chunk, similar to how Extra Edge List is maintained. We test the overflow-related code with the following repo history: F - N - U / \ U - N - U N \ / N - F - N Where the commits denoted by U have committer date of zero seconds since Unix epoch, the commits denoted by N have committer date of 1112354055 (default committer date for the test suite) seconds since Unix epoch and the commits denoted by F have committer date of (2 ^ 31 - 2) seconds since Unix epoch. The largest offset observed is 2 ^ 31, just large enough to overflow. [1]: https://lore.kernel.org/git/87a7gdspo4.fsf@evledraar.gmail.com/ Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:15 +00:00
chunks: oid_fanout oid_lookup commit_metadata generation_data
commit-graph: start parsing generation v2 (again) The 'read_generation_data' member of 'struct commit_graph' was introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire chain does, 2021-01-16). The intention was to avoid using corrected commit dates if not all layers of a commit-graph had that data stored. The logic in validate_mixed_generation_chain() at that point incorrectly initialized read_generation_data to 1 if and only if the tip commit-graph contained the Corrected Commit Date chunk. This was "fixed" in 448a39e65 (commit-graph: validate layers for generation data, 2021-02-02) to validate that read_generation_data was either non-zero for all layers, or it would set read_generation_data to zero for all layers. The problem here is that read_generation_data is not initialized to be non-zero anywhere! This change initializes read_generation_data immediately after the chunk is parsed, so each layer will have its value present as soon as possible. The read_generation_data member is used in fill_commit_graph_info() to determine if we should use the corrected commit date or the topological levels stored in the Commit Data chunk. Due to this bug, all previous versions of Git were defaulting to topological levels in all cases! This can be measured with some performance tests. Using the Linux kernel as a testbed, I generated a complete commit-graph containing corrected commit dates and tested the 'new' version against the previous, 'old' version. First, rev-list with --topo-order demonstrates a 26% improvement using corrected commit dates: hyperfine \ -n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \ -n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 57.1 ms ± 3.1 ms Range (min … max): 52.9 ms … 62.0 ms 55 runs Benchmark 2: new Time (mean ± σ): 45.5 ms ± 3.3 ms Range (min … max): 39.9 ms … 51.7 ms 59 runs Summary 'new' ran 1.26 ± 0.11 times faster than 'old' These performance improvements are due to the algorithmic improvements given by walking fewer commits due to the higher cutoffs from corrected commit dates. However, this comes at a cost. The additional I/O cost of parsing the corrected commit dates is visible in case of merge-base commands that do not reduce the overall number of walked commits. hyperfine \ -n "old" "$OLD_GIT merge-base v4.8 v4.9" \ -n "new" "$NEW_GIT merge-base v4.8 v4.9" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 110.4 ms ± 6.4 ms Range (min … max): 96.0 ms … 118.3 ms 25 runs Benchmark 2: new Time (mean ± σ): 150.7 ms ± 1.1 ms Range (min … max): 149.3 ms … 153.4 ms 19 runs Summary 'old' ran 1.36 ± 0.08 times faster than 'new' Performance issues like this are what motivated 702110aac (commit-graph: use config to specify generation type, 2021-02-25). In the future, we could fix this performance problem by inserting the corrected commit date offsets into the Commit Date chunk instead of having that data in an extra chunk. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01 19:48:31 +00:00
options:$OPTIONS
EOF
test-tool read-graph >output &&
test_cmp expect output
}
test_expect_success POSIXPERM 'tweak umask for modebit tests' '
umask 022
'
test_expect_success 'create commits and write commit-graph' '
for i in $(test_seq 3)
do
test_commit $i &&
git branch commits/$i || return 1
done &&
git commit-graph write --reachable &&
test_path_is_file $infodir/commit-graph &&
graph_read_expect 3
'
graph_git_two_modes() {
git ${2:+ -C "$2"} -c core.commitGraph=true $1 >output &&
git ${2:+ -C "$2"} -c core.commitGraph=false $1 >expect &&
test_cmp expect output
}
graph_git_behavior() {
MSG=$1
BRANCH=$2
COMPARE=$3
DIR=$4
test_expect_success "check normal git operations: $MSG" '
graph_git_two_modes "log --oneline $BRANCH" "$DIR" &&
graph_git_two_modes "log --topo-order $BRANCH" "$DIR" &&
graph_git_two_modes "log --graph $COMPARE..$BRANCH" "$DIR" &&
graph_git_two_modes "branch -vv" "$DIR" &&
graph_git_two_modes "merge-base -a $BRANCH $COMPARE" "$DIR"
'
}
graph_git_behavior 'graph exists' commits/3 commits/1
verify_chain_files_exist() {
for hash in $(cat $1/commit-graph-chain)
do
test_path_is_file $1/graph-$hash.graph || return 1
done
}
test_expect_success 'add more commits, and write a new base graph' '
git reset --hard commits/1 &&
for i in $(test_seq 4 5)
do
test_commit $i &&
git branch commits/$i || return 1
done &&
git reset --hard commits/2 &&
for i in $(test_seq 6 10)
do
test_commit $i &&
git branch commits/$i || return 1
done &&
git reset --hard commits/2 &&
git merge commits/4 &&
git branch merge/1 &&
git reset --hard commits/4 &&
git merge commits/6 &&
git branch merge/2 &&
git commit-graph write --reachable &&
graph_read_expect 12
'
test_expect_success 'fork and fail to base a chain on a commit-graph file' '
test_when_finished rm -rf fork &&
git clone . fork &&
(
cd fork &&
rm .git/objects/info/commit-graph &&
echo "$(pwd)/../.git/objects" >.git/objects/info/alternates &&
test_commit new-commit &&
git commit-graph write --reachable --split &&
test_path_is_file $graphdir/commit-graph-chain &&
test_line_count = 1 $graphdir/commit-graph-chain &&
verify_chain_files_exist $graphdir
)
'
test_expect_success 'add three more commits, write a tip graph' '
git reset --hard commits/3 &&
git merge merge/1 &&
git merge commits/5 &&
git merge merge/2 &&
git branch merge/3 &&
git commit-graph write --reachable --split &&
test_path_is_missing $infodir/commit-graph &&
test_path_is_file $graphdir/commit-graph-chain &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 2 graph-files &&
verify_chain_files_exist $graphdir
'
graph_git_behavior 'split commit-graph: merge 3 vs 2' merge/3 merge/2
test_expect_success 'add one commit, write a tip graph' '
test_commit 11 &&
git branch commits/11 &&
git commit-graph write --reachable --split &&
test_path_is_missing $infodir/commit-graph &&
test_path_is_file $graphdir/commit-graph-chain &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 3 graph-files &&
verify_chain_files_exist $graphdir
'
graph_git_behavior 'three-layer commit-graph: commit 11 vs 6' commits/11 commits/6
test_expect_success 'add one commit, write a merged graph' '
test_commit 12 &&
git branch commits/12 &&
git commit-graph write --reachable --split &&
test_path_is_file $graphdir/commit-graph-chain &&
test_line_count = 2 $graphdir/commit-graph-chain &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 2 graph-files &&
verify_chain_files_exist $graphdir
'
graph_git_behavior 'merged commit-graph: commit 12 vs 6' commits/12 commits/6
test_expect_success 'create fork and chain across alternate' '
git clone . fork &&
(
cd fork &&
git config core.commitGraph true &&
rm -rf $graphdir &&
echo "$(pwd)/../.git/objects" >.git/objects/info/alternates &&
test_commit 13 &&
git branch commits/13 &&
git commit-graph write --reachable --split &&
test_path_is_file $graphdir/commit-graph-chain &&
test_line_count = 3 $graphdir/commit-graph-chain &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 1 graph-files &&
git -c core.commitGraph=true rev-list HEAD >expect &&
git -c core.commitGraph=false rev-list HEAD >actual &&
test_cmp expect actual &&
test_commit 14 &&
git commit-graph write --reachable --split --object-dir=.git/objects/ &&
test_line_count = 3 $graphdir/commit-graph-chain &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 1 graph-files
)
'
if test -d fork
then
graph_git_behavior 'alternate: commit 13 vs 6' commits/13 origin/commits/6 "fork"
fi
test_expect_success 'test merge stragety constants' '
git clone . merge-2 &&
(
cd merge-2 &&
git config core.commitGraph true &&
test_line_count = 2 $graphdir/commit-graph-chain &&
test_commit 14 &&
git commit-graph write --reachable --split --size-multiple=2 &&
test_line_count = 3 $graphdir/commit-graph-chain
) &&
git clone . merge-10 &&
(
cd merge-10 &&
git config core.commitGraph true &&
test_line_count = 2 $graphdir/commit-graph-chain &&
test_commit 14 &&
git commit-graph write --reachable --split --size-multiple=10 &&
test_line_count = 1 $graphdir/commit-graph-chain &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 1 graph-files
) &&
git clone . merge-10-expire &&
(
cd merge-10-expire &&
git config core.commitGraph true &&
test_line_count = 2 $graphdir/commit-graph-chain &&
test_commit 15 &&
touch $graphdir/to-delete.graph $graphdir/to-keep.graph &&
test-tool chmtime =1546362000 $graphdir/to-delete.graph &&
test-tool chmtime =1546362001 $graphdir/to-keep.graph &&
git commit-graph write --reachable --split --size-multiple=10 \
--expire-time="2019-01-01 12:00 -05:00" &&
test_line_count = 1 $graphdir/commit-graph-chain &&
test_path_is_missing $graphdir/to-delete.graph &&
test_path_is_file $graphdir/to-keep.graph &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 3 graph-files
) &&
git clone --no-hardlinks . max-commits &&
(
cd max-commits &&
git config core.commitGraph true &&
test_line_count = 2 $graphdir/commit-graph-chain &&
test_commit 16 &&
test_commit 17 &&
git commit-graph write --reachable --split --max-commits=1 &&
test_line_count = 1 $graphdir/commit-graph-chain &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 1 graph-files
)
'
test_expect_success 'remove commit-graph-chain file after flattening' '
git clone . flatten &&
(
cd flatten &&
test_line_count = 2 $graphdir/commit-graph-chain &&
git commit-graph write --reachable &&
test_path_is_missing $graphdir/commit-graph-chain &&
ls $graphdir >graph-files &&
test_line_count = 0 graph-files
)
'
corrupt_file() {
file=$1
pos=$2
data="${3:-\0}"
chmod a+w "$file" &&
printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
}
test_expect_success 'verify hashes along chain, even in shallow' '
git clone --no-hardlinks . verify &&
(
cd verify &&
git commit-graph verify &&
base_file=$graphdir/graph-$(head -n 1 $graphdir/commit-graph-chain).graph &&
corrupt_file "$base_file" $(test_oid shallow) "\01" &&
test_must_fail git commit-graph verify --shallow 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "incorrect checksum" err
)
'
test_expect_success 'verify notices chain slice which is bogus (base)' '
git clone --no-hardlinks . verify-chain-bogus-base &&
(
cd verify-chain-bogus-base &&
git commit-graph verify &&
base_file=$graphdir/graph-$(sed -n 1p $graphdir/commit-graph-chain).graph &&
echo "garbage" >$base_file &&
test_must_fail git commit-graph verify 2>test_err &&
grep -v "^+" test_err >err &&
grep "commit-graph file is too small" err
)
'
test_expect_success 'verify notices chain slice which is bogus (tip)' '
git clone --no-hardlinks . verify-chain-bogus-tip &&
(
cd verify-chain-bogus-tip &&
git commit-graph verify &&
tip_file=$graphdir/graph-$(sed -n 2p $graphdir/commit-graph-chain).graph &&
echo "garbage" >$tip_file &&
test_must_fail git commit-graph verify 2>test_err &&
grep -v "^+" test_err >err &&
grep "commit-graph file is too small" err
)
'
test_expect_success 'verify --shallow does not check base contents' '
git clone --no-hardlinks . verify-shallow &&
(
cd verify-shallow &&
git commit-graph verify &&
base_file=$graphdir/graph-$(head -n 1 $graphdir/commit-graph-chain).graph &&
commit-graph: check consistency of fanout table We use bsearch_hash() to look up items in the oid index of a commit-graph. It also has a fanout table to reduce the initial range in which we'll search. But since the fanout comes from the on-disk file, a corrupted or malicious file can cause us to look outside of the allocated index memory. One solution here would be to pass the total table size to bsearch_hash(), which could then bounds check the values it reads from the fanout. But there's an inexpensive up-front check we can do, and it's the same one used by the midx and pack idx code (both of which likewise have fanout tables and use bsearch_hash(), but are not affected by this bug): 1. We can check the value of the final fanout entry against the size of the table we got from the index chunk. These must always match, since the fanout is just slicing up the index. As a side note, the midx and pack idx code compute it the other way around: they use the final fanout value as the object count, and check the index size against it. Either is valid; if they disagree we cannot know which is wrong (a corrupted fanout value, or a too-small table of oids). 2. We can quickly scan the fanout table to make sure it is monotonically increasing. If it is, then we know that every value is less than or equal to the final value, and therefore less than or equal to the table size. It would also be sufficient to just check that each fanout value is smaller than the final one, but the midx and pack idx code both do a full monotonicity check. It's the same cost, and it catches some other corruptions (though not all; the checks done by "commit-graph verify" are more complete but more expensive, and our goal here is to be fast and memory-safe). There are two new tests. One just checks the final fanout value (this is the mirror image of the "too small oid lookup" case added for the midx in the previous commit; it's flipped here because commit-graph considers the oid lookup chunk to be the source of truth). The other actually creates a fanout with many out-of-bounds entries, and prior to this patch, it does cause the segfault you'd expect. But note that the error is not "your fanout entry is out-of-bounds", but rather "fanout value out of order". That's because we leave the final fanout value in place (to get past the table size check), making the index non-monotonic (the second-to-last entry is big, but the last one must remain small to match the actual table). We need adjustments to a few existing tests, as well: - an earlier test in t5318 corrupts the fanout and runs "commit-graph verify". Its message is now changed, since we catch the problem earlier (during the load step, rather than the careful validation step). - in t5324, we test that "commit-graph verify --shallow" does not do expensive verification on the base file of the chain. But the corruption it uses (munging a byte at offset 1000) happens to be in the middle of the fanout table. And now we detect that problem in the cheaper checks that are performed for every part of the graph. We'll push this back to offset 1500, which is only caught by the more expensive checksum validation. Likewise, there's a later test in t5324 which munges an offset 100 bytes into a file (also in the fanout table) that is referenced by an alternates file. So we now find that corruption during the load step, rather than the verification step. At the very least we need to change the error message (like the case above in t5318). But it is probably good to make sure we handle all parts of the verification even for alternate graph files. So let's likewise corrupt byte 1500 and make sure we found the invalid checksum. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 21:04:58 +00:00
corrupt_file "$base_file" 1500 "\01" &&
git commit-graph verify --shallow &&
test_must_fail git commit-graph verify 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "incorrect checksum" err
)
'
test_expect_success 'warn on base graph chunk incorrect' '
git clone --no-hardlinks . base-chunk &&
(
cd base-chunk &&
git commit-graph verify &&
base_file=$graphdir/graph-$(tail -n 1 $graphdir/commit-graph-chain).graph &&
corrupt_file "$base_file" $(test_oid base) "\01" &&
test_must_fail git commit-graph verify --shallow 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "commit-graph chain does not match" err
)
'
test_expect_success 'verify after commit-graph-chain corruption (base)' '
git clone --no-hardlinks . verify-chain-base &&
(
cd verify-chain-base &&
corrupt_file "$graphdir/commit-graph-chain" 30 "G" &&
commit-graph: detect read errors when verifying graph chain Because it's OK to not have a graph file at all, the graph_verify() function needs to tell the difference between a missing file and a real error. So when loading a traditional graph file, we call open_commit_graph() separately from load_commit_graph_chain_fd_st(), and don't complain if the first one fails with ENOENT. When the function learned about chain files in 3da4b609bb (commit-graph: verify chains with --shallow mode, 2019-06-18), we couldn't be as careful, since the only way to load a chain was with read_commit_graph_one(), which did both the open/load as a single unit. So we'll miss errors in chain files we load, thinking instead that there was just no chain file at all. Note that we do still report some of these problems to stderr, as the loading function calls error() and warning(). But we'd exit with a successful exit code, which is wrong. We can fix that by using the recently split open/load functions for chains. That lets us treat the chain file just like a single file with respect to error handling here. An existing test (from 3da4b609bb) shows off the problem; we were expecting "commit-graph verify" to report success, but that makes no sense. We did not even verify the contents of the graph data, because we couldn't load it! I don't think this was an intentional exception, but rather just the test covering what happened to occur. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 04:38:59 +00:00
test_must_fail git commit-graph verify 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "invalid commit-graph chain" err &&
corrupt_file "$graphdir/commit-graph-chain" 30 "A" &&
commit-graph: detect read errors when verifying graph chain Because it's OK to not have a graph file at all, the graph_verify() function needs to tell the difference between a missing file and a real error. So when loading a traditional graph file, we call open_commit_graph() separately from load_commit_graph_chain_fd_st(), and don't complain if the first one fails with ENOENT. When the function learned about chain files in 3da4b609bb (commit-graph: verify chains with --shallow mode, 2019-06-18), we couldn't be as careful, since the only way to load a chain was with read_commit_graph_one(), which did both the open/load as a single unit. So we'll miss errors in chain files we load, thinking instead that there was just no chain file at all. Note that we do still report some of these problems to stderr, as the loading function calls error() and warning(). But we'd exit with a successful exit code, which is wrong. We can fix that by using the recently split open/load functions for chains. That lets us treat the chain file just like a single file with respect to error handling here. An existing test (from 3da4b609bb) shows off the problem; we were expecting "commit-graph verify" to report success, but that makes no sense. We did not even verify the contents of the graph data, because we couldn't load it! I don't think this was an intentional exception, but rather just the test covering what happened to occur. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-28 04:38:59 +00:00
test_must_fail git commit-graph verify 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "unable to find all commit-graph files" err
)
'
test_expect_success 'verify after commit-graph-chain corruption (tip)' '
git clone --no-hardlinks . verify-chain-tip &&
(
cd verify-chain-tip &&
corrupt_file "$graphdir/commit-graph-chain" 70 "G" &&
test_must_fail git commit-graph verify 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "invalid commit-graph chain" err &&
corrupt_file "$graphdir/commit-graph-chain" 70 "A" &&
test_must_fail git commit-graph verify 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "unable to find all commit-graph files" err
)
'
test_expect_success 'verify notices too-short chain file' '
git clone --no-hardlinks . verify-chain-short &&
(
cd verify-chain-short &&
git commit-graph verify &&
echo "garbage" >$graphdir/commit-graph-chain &&
test_must_fail git commit-graph verify 2>test_err &&
grep -v "^+" test_err >err &&
grep "commit-graph chain file too small" err
)
'
test_expect_success 'verify across alternates' '
git clone --no-hardlinks . verify-alt &&
(
cd verify-alt &&
rm -rf $graphdir &&
altdir="$(pwd)/../.git/objects" &&
echo "$altdir" >.git/objects/info/alternates &&
git commit-graph verify --object-dir="$altdir/" &&
test_commit extra &&
git commit-graph write --reachable --split &&
tip_file=$graphdir/graph-$(tail -n 1 $graphdir/commit-graph-chain).graph &&
commit-graph: check consistency of fanout table We use bsearch_hash() to look up items in the oid index of a commit-graph. It also has a fanout table to reduce the initial range in which we'll search. But since the fanout comes from the on-disk file, a corrupted or malicious file can cause us to look outside of the allocated index memory. One solution here would be to pass the total table size to bsearch_hash(), which could then bounds check the values it reads from the fanout. But there's an inexpensive up-front check we can do, and it's the same one used by the midx and pack idx code (both of which likewise have fanout tables and use bsearch_hash(), but are not affected by this bug): 1. We can check the value of the final fanout entry against the size of the table we got from the index chunk. These must always match, since the fanout is just slicing up the index. As a side note, the midx and pack idx code compute it the other way around: they use the final fanout value as the object count, and check the index size against it. Either is valid; if they disagree we cannot know which is wrong (a corrupted fanout value, or a too-small table of oids). 2. We can quickly scan the fanout table to make sure it is monotonically increasing. If it is, then we know that every value is less than or equal to the final value, and therefore less than or equal to the table size. It would also be sufficient to just check that each fanout value is smaller than the final one, but the midx and pack idx code both do a full monotonicity check. It's the same cost, and it catches some other corruptions (though not all; the checks done by "commit-graph verify" are more complete but more expensive, and our goal here is to be fast and memory-safe). There are two new tests. One just checks the final fanout value (this is the mirror image of the "too small oid lookup" case added for the midx in the previous commit; it's flipped here because commit-graph considers the oid lookup chunk to be the source of truth). The other actually creates a fanout with many out-of-bounds entries, and prior to this patch, it does cause the segfault you'd expect. But note that the error is not "your fanout entry is out-of-bounds", but rather "fanout value out of order". That's because we leave the final fanout value in place (to get past the table size check), making the index non-monotonic (the second-to-last entry is big, but the last one must remain small to match the actual table). We need adjustments to a few existing tests, as well: - an earlier test in t5318 corrupts the fanout and runs "commit-graph verify". Its message is now changed, since we catch the problem earlier (during the load step, rather than the careful validation step). - in t5324, we test that "commit-graph verify --shallow" does not do expensive verification on the base file of the chain. But the corruption it uses (munging a byte at offset 1000) happens to be in the middle of the fanout table. And now we detect that problem in the cheaper checks that are performed for every part of the graph. We'll push this back to offset 1500, which is only caught by the more expensive checksum validation. Likewise, there's a later test in t5324 which munges an offset 100 bytes into a file (also in the fanout table) that is referenced by an alternates file. So we now find that corruption during the load step, rather than the verification step. At the very least we need to change the error message (like the case above in t5318). But it is probably good to make sure we handle all parts of the verification even for alternate graph files. So let's likewise corrupt byte 1500 and make sure we found the invalid checksum. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 21:04:58 +00:00
corrupt_file "$tip_file" 1500 "\01" &&
test_must_fail git commit-graph verify --shallow 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "incorrect checksum" err
)
'
test_expect_success 'reader bounds-checks base-graph chunk' '
git clone --no-hardlinks . corrupt-base-chunk &&
(
cd corrupt-base-chunk &&
tip_file=$graphdir/graph-$(tail -n 1 $graphdir/commit-graph-chain).graph &&
corrupt_chunk_file "$tip_file" BASE clear 01020304 &&
git -c core.commitGraph=false log >expect.out &&
git -c core.commitGraph=true log >out 2>err &&
test_cmp expect.out out &&
grep "commit-graph base graphs chunk is too small" err
)
'
test_expect_success 'add octopus merge' '
git reset --hard commits/10 &&
git merge commits/3 commits/4 &&
git branch merge/octopus &&
git commit-graph write --reachable --split &&
git commit-graph verify --progress 2>err &&
test_line_count = 1 err &&
grep "Verifying commits in commit graph: 100% (18/18)" err &&
test_grep ! warning err &&
test_line_count = 3 $graphdir/commit-graph-chain
'
graph_git_behavior 'graph exists' merge/octopus commits/12
test_expect_success 'split across alternate where alternate is not split' '
git commit-graph write --reachable &&
test_path_is_file .git/objects/info/commit-graph &&
cp .git/objects/info/commit-graph . &&
git clone --no-hardlinks . alt-split &&
(
cd alt-split &&
rm -f .git/objects/info/commit-graph &&
echo "$(pwd)"/../.git/objects >.git/objects/info/alternates &&
test_commit 18 &&
git commit-graph write --reachable --split &&
test_line_count = 1 $graphdir/commit-graph-chain
) &&
test_cmp commit-graph .git/objects/info/commit-graph
'
test_expect_success '--split=no-merge always writes an incremental' '
test_when_finished rm -rf a b &&
rm -rf $graphdir $infodir/commit-graph &&
git reset --hard commits/2 &&
git rev-list HEAD~1 >a &&
git rev-list HEAD >b &&
git commit-graph write --split --stdin-commits <a &&
git commit-graph write --split=no-merge --stdin-commits <b &&
test_line_count = 2 $graphdir/commit-graph-chain
'
builtin/commit-graph.c: introduce split strategy 'replace' When using split commit-graphs, it is sometimes useful to completely replace the commit-graph chain with a new base. For example, consider a scenario in which a repository builds a new commit-graph incremental for each push. Occasionally (say, after some fixed number of pushes), they may wish to rebuild the commit-graph chain with all reachable commits. They can do so with $ git commit-graph write --reachable but this removes the chain entirely and replaces it with a single commit-graph in 'objects/info/commit-graph'. Unfortunately, this means that the next push will have to move this commit-graph into the first layer of a new chain, and then write its new commits on top. Avoid such copying entirely by allowing the caller to specify that they wish to replace the entirety of their commit-graph chain, while also specifying that the new commit-graph should become the basis of a fresh, length-one chain. This addresses the above situation by making it possible for the caller to instead write: $ git commit-graph write --reachable --split=replace which writes a new length-one chain to 'objects/info/commit-graphs', making the commit-graph incremental generated by the subsequent push relatively cheap by avoiding the aforementioned copy. In order to do this, remove an assumption in 'write_commit_graph_file' that chains are always at least two incrementals long. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-14 04:04:17 +00:00
test_expect_success '--split=replace replaces the chain' '
rm -rf $graphdir $infodir/commit-graph &&
git reset --hard commits/3 &&
git rev-list -1 HEAD~2 >a &&
git rev-list -1 HEAD~1 >b &&
git rev-list -1 HEAD >c &&
git commit-graph write --split=no-merge --stdin-commits <a &&
git commit-graph write --split=no-merge --stdin-commits <b &&
git commit-graph write --split=no-merge --stdin-commits <c &&
test_line_count = 3 $graphdir/commit-graph-chain &&
git commit-graph write --stdin-commits --split=replace <b &&
test_path_is_missing $infodir/commit-graph &&
test_path_is_file $graphdir/commit-graph-chain &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 1 graph-files &&
verify_chain_files_exist $graphdir &&
graph_read_expect 2
'
commit-graph.c: gracefully handle file descriptor exhaustion When writing a layered commit-graph, the commit-graph machinery uses 'commit_graph_filenames_after' and 'commit_graph_hash_after' to keep track of the layers in the chain that we are in the process of writing. When the number of commit-graph layers shrinks, we initialize all entries in the aforementioned arrays, because we know the structure of the new commit-graph chain immediately (since there are no new layers, there are no unknown hash values). But when the number of commit-graph layers grows (i.e., that 'num_commit_graphs_after > num_commit_graphs_before'), then we leave some entries in the filenames and hashes arrays as uninitialized, because we will fill them in later as those values become available. For instance, we rely on 'write_commit_graph_file's to store the filename and hash of the last layer in the new chain, which is the one that it is responsible for writing. But, it's possible that 'write_commit_graph_file' may fail, e.g., from file descriptor exhaustion. In this case it is possible that 'git_mkstemp_mode' will fail, and that function will return early *before* setting the values for the last commit-graph layer's filename and hash. This causes a number of upleasant side-effects. For instance, trying to 'free()' each entry in 'ctx->commit_graph_filenames_after' (and similarly for the hashes array) causes us to 'free()' uninitialized memory, since the area is allocated with 'malloc()' and is therefore subject to contain garbage (which is left alone when 'write_commit_graph_file' returns early). This can manifest in other issues, like a general protection fault, and/or leaving a stray 'commit-graph-chain.lock' around after the process dies. (The reasoning for this is still a mystery to me, since we'd otherwise usually expect the kernel to run tempfile.c's 'atexit()' handlers in the case of a normal death...) To resolve this, initialize the memory with 'CALLOC_ARRAY' so that uninitialized entries are filled with zeros, and can thus be 'free()'d as a noop instead of causing a fault. Helped-by: Jeff King <peff@peff.net> Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-23 21:41:09 +00:00
test_expect_success ULIMIT_FILE_DESCRIPTORS 'handles file descriptor exhaustion' '
git init ulimit &&
(
cd ulimit &&
for i in $(test_seq 64)
do
test_commit $i &&
run_with_limited_open_files test_might_fail git commit-graph write \
commit-graph.c: gracefully handle file descriptor exhaustion When writing a layered commit-graph, the commit-graph machinery uses 'commit_graph_filenames_after' and 'commit_graph_hash_after' to keep track of the layers in the chain that we are in the process of writing. When the number of commit-graph layers shrinks, we initialize all entries in the aforementioned arrays, because we know the structure of the new commit-graph chain immediately (since there are no new layers, there are no unknown hash values). But when the number of commit-graph layers grows (i.e., that 'num_commit_graphs_after > num_commit_graphs_before'), then we leave some entries in the filenames and hashes arrays as uninitialized, because we will fill them in later as those values become available. For instance, we rely on 'write_commit_graph_file's to store the filename and hash of the last layer in the new chain, which is the one that it is responsible for writing. But, it's possible that 'write_commit_graph_file' may fail, e.g., from file descriptor exhaustion. In this case it is possible that 'git_mkstemp_mode' will fail, and that function will return early *before* setting the values for the last commit-graph layer's filename and hash. This causes a number of upleasant side-effects. For instance, trying to 'free()' each entry in 'ctx->commit_graph_filenames_after' (and similarly for the hashes array) causes us to 'free()' uninitialized memory, since the area is allocated with 'malloc()' and is therefore subject to contain garbage (which is left alone when 'write_commit_graph_file' returns early). This can manifest in other issues, like a general protection fault, and/or leaving a stray 'commit-graph-chain.lock' around after the process dies. (The reasoning for this is still a mystery to me, since we'd otherwise usually expect the kernel to run tempfile.c's 'atexit()' handlers in the case of a normal death...) To resolve this, initialize the memory with 'CALLOC_ARRAY' so that uninitialized entries are filled with zeros, and can thus be 'free()'d as a noop instead of causing a fault. Helped-by: Jeff King <peff@peff.net> Helped-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-23 21:41:09 +00:00
--split=no-merge --reachable || return 1
done
)
'
while read mode modebits
do
test_expect_success POSIXPERM "split commit-graph respects core.sharedrepository $mode" '
rm -rf $graphdir $infodir/commit-graph &&
git reset --hard commits/1 &&
test_config core.sharedrepository "$mode" &&
git commit-graph write --split --reachable &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 1 graph-files &&
echo "$modebits" >expect &&
test_modebits $graphdir/graph-*.graph >actual &&
test_cmp expect actual &&
test_modebits $graphdir/commit-graph-chain >actual &&
test_cmp expect actual
'
done <<\EOF
0666 -r--r--r--
0600 -r--------
EOF
commit-graph: introduce 'get_bloom_filter_settings()' Many places in the code often need a pointer to the commit-graph's 'struct bloom_filter_settings', in which case they often take the value from the top-most commit-graph. In the non-split case, this works as expected. In the split case, however, things get a little tricky. Not all layers in a chain of incremental commit-graphs are required to themselves have Bloom data, and so whether or not some part of the code uses Bloom filters depends entirely on whether or not the top-most level of the commit-graph chain has Bloom filters. This has been the behavior since Bloom filters were introduced, and has been codified into the tests since a759bfa9ee (t4216: add end to end tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130 requires that Bloom filters are not used in exactly the case described earlier. There is no reason that this needs to be the case, since it is perfectly valid for commits in an earlier layer to have Bloom filters when commits in a newer layer do not. Since Bloom settings are guaranteed in practice to be the same for any layer in a chain that has Bloom data, it is sufficient to traverse the '->base_graph' pointer until either (1) a non-null 'struct bloom_filter_settings *' is found, or (2) until we are at the root of the commit-graph chain. Introduce a 'get_bloom_filter_settings()' function that does just this, and use it instead of purely dereferencing the top-most graph's '->bloom_filter_settings' pointer. While we're at it, add an additional test in t5324 to guard against code in the commit-graph writing machinery that doesn't correctly handle a NULL 'struct bloom_filter *'. Co-authored-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-09 15:22:44 +00:00
test_expect_success '--split=replace with partial Bloom data' '
rm -rf $graphdir $infodir/commit-graph &&
git reset --hard commits/3 &&
git rev-list -1 HEAD~2 >a &&
git rev-list -1 HEAD~1 >b &&
git commit-graph write --split=no-merge --stdin-commits --changed-paths <a &&
git commit-graph write --split=no-merge --stdin-commits <b &&
git commit-graph write --split=replace --stdin-commits --changed-paths <c &&
ls $graphdir/graph-*.graph >graph-files &&
test_line_count = 1 graph-files &&
verify_chain_files_exist $graphdir
'
commit-graph: ignore duplicates when merging layers Thomas reported [1] that a "git fetch" command was failing with an error saying "unexpected duplicate commit id". The root cause is that they had fetch.writeCommitGraph enabled which generates commit-graph chains, and this instance was merging two layers that both contained the same commit ID. [1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/ The initial assumption is that Git would not write a commit ID into a commit-graph layer if it already exists in a lower commit-graph layer. Somehow, this specific case did get into that situation, leading to this error. While unexpected, this isn't actually invalid (as long as the two layers agree on the metadata for the commit). When we parse a commit that does not have a graph_pos in the commit_graph_data_slab, we use binary search in the commit-graph layers to find the commit and set graph_pos. That position is never used again in this case. However, when we parse a commit from the commit-graph file, we load its parents from the commit-graph and assign graph_pos at that point. If those parents were already parsed from the commit-graph, then nothing needs to be done. Otherwise, this graph_pos is a valid position in the commit-graph so we can parse the parents, when necessary. Thus, this die() is too aggressive. The easiest thing to do would be to ignore the duplicates. If we only ignore the duplicates, then we will produce a commit-graph that has identical commit IDs listed in adjacent positions. This excess data will never be removed from the commit-graph, which could cascade into significantly bloated file sizes. Thankfully, we can collapse the list to erase the duplicate commit pointers. This allows us to get the end result we want without extra memory costs and minimal CPU time. The root cause is due to disabling core.commitGraph, which prevents parsing commits from the lower layers during a 'git commit-graph write --split' command. Since we use the 'graph_pos' value to determine whether a commit is in a lower layer, we never discover that those commits are already in the commit-graph chain and add them to the top layer. This layer is then merged down, creating duplicates. The test added in t5324-split-commit-graph.sh fails without this change. However, we still have not completely removed the need for this duplicate check. That will come in a follow-up change. Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de> Helped-by: Taylor Blau <me@ttaylorr.com> Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09 20:53:51 +00:00
test_expect_success 'prevent regression for duplicate commits across layers' '
git init dup &&
git -C dup commit --allow-empty -m one &&
git -C dup -c core.commitGraph=false commit-graph write --split=no-merge --reachable 2>err &&
test_grep "attempting to write a commit-graph" err &&
commit-graph: ignore duplicates when merging layers Thomas reported [1] that a "git fetch" command was failing with an error saying "unexpected duplicate commit id". The root cause is that they had fetch.writeCommitGraph enabled which generates commit-graph chains, and this instance was merging two layers that both contained the same commit ID. [1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/ The initial assumption is that Git would not write a commit ID into a commit-graph layer if it already exists in a lower commit-graph layer. Somehow, this specific case did get into that situation, leading to this error. While unexpected, this isn't actually invalid (as long as the two layers agree on the metadata for the commit). When we parse a commit that does not have a graph_pos in the commit_graph_data_slab, we use binary search in the commit-graph layers to find the commit and set graph_pos. That position is never used again in this case. However, when we parse a commit from the commit-graph file, we load its parents from the commit-graph and assign graph_pos at that point. If those parents were already parsed from the commit-graph, then nothing needs to be done. Otherwise, this graph_pos is a valid position in the commit-graph so we can parse the parents, when necessary. Thus, this die() is too aggressive. The easiest thing to do would be to ignore the duplicates. If we only ignore the duplicates, then we will produce a commit-graph that has identical commit IDs listed in adjacent positions. This excess data will never be removed from the commit-graph, which could cascade into significantly bloated file sizes. Thankfully, we can collapse the list to erase the duplicate commit pointers. This allows us to get the end result we want without extra memory costs and minimal CPU time. The root cause is due to disabling core.commitGraph, which prevents parsing commits from the lower layers during a 'git commit-graph write --split' command. Since we use the 'graph_pos' value to determine whether a commit is in a lower layer, we never discover that those commits are already in the commit-graph chain and add them to the top layer. This layer is then merged down, creating duplicates. The test added in t5324-split-commit-graph.sh fails without this change. However, we still have not completely removed the need for this duplicate check. That will come in a follow-up change. Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de> Helped-by: Taylor Blau <me@ttaylorr.com> Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-10-09 20:53:51 +00:00
git -C dup commit-graph write --split=no-merge --reachable &&
git -C dup commit --allow-empty -m two &&
git -C dup commit-graph write --split=no-merge --reachable &&
git -C dup commit --allow-empty -m three &&
git -C dup commit-graph write --split --reachable &&
git -C dup commit-graph verify
'
commit-graph: use generation v2 only if entire chain does Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:16 +00:00
NUM_FIRST_LAYER_COMMITS=64
NUM_SECOND_LAYER_COMMITS=16
NUM_THIRD_LAYER_COMMITS=7
NUM_FOURTH_LAYER_COMMITS=8
NUM_FIFTH_LAYER_COMMITS=16
SECOND_LAYER_SEQUENCE_START=$(($NUM_FIRST_LAYER_COMMITS + 1))
SECOND_LAYER_SEQUENCE_END=$(($SECOND_LAYER_SEQUENCE_START + $NUM_SECOND_LAYER_COMMITS - 1))
THIRD_LAYER_SEQUENCE_START=$(($SECOND_LAYER_SEQUENCE_END + 1))
THIRD_LAYER_SEQUENCE_END=$(($THIRD_LAYER_SEQUENCE_START + $NUM_THIRD_LAYER_COMMITS - 1))
FOURTH_LAYER_SEQUENCE_START=$(($THIRD_LAYER_SEQUENCE_END + 1))
FOURTH_LAYER_SEQUENCE_END=$(($FOURTH_LAYER_SEQUENCE_START + $NUM_FOURTH_LAYER_COMMITS - 1))
FIFTH_LAYER_SEQUENCE_START=$(($FOURTH_LAYER_SEQUENCE_END + 1))
FIFTH_LAYER_SEQUENCE_END=$(($FIFTH_LAYER_SEQUENCE_START + $NUM_FIFTH_LAYER_COMMITS - 1))
# Current split graph chain:
#
# 16 commits (No GDAT)
# ------------------------
# 64 commits (GDAT)
#
test_expect_success 'setup repo for mixed generation commit-graph-chain' '
graphdir=".git/objects/info/commit-graphs" &&
test_oid_cache <<-EOF &&
oid_version sha1:1
oid_version sha256:2
EOF
git init mixed &&
(
cd mixed &&
git config core.commitGraph true &&
git config gc.writeCommitGraph false &&
for i in $(test_seq $NUM_FIRST_LAYER_COMMITS)
do
test_commit $i &&
git branch commits/$i || return 1
done &&
git -c commitGraph.generationVersion=2 commit-graph write --reachable --split &&
commit-graph: use generation v2 only if entire chain does Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:16 +00:00
graph_read_expect $NUM_FIRST_LAYER_COMMITS &&
test_line_count = 1 $graphdir/commit-graph-chain &&
for i in $(test_seq $SECOND_LAYER_SEQUENCE_START $SECOND_LAYER_SEQUENCE_END)
do
test_commit $i &&
git branch commits/$i || return 1
done &&
git -c commitGraph.generationVersion=1 commit-graph write --reachable --split=no-merge &&
commit-graph: use generation v2 only if entire chain does Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:16 +00:00
test_line_count = 2 $graphdir/commit-graph-chain &&
test-tool read-graph >output &&
cat >expect <<-EOF &&
header: 43475048 1 $(test_oid oid_version) 4 1
num_commits: $NUM_SECOND_LAYER_COMMITS
chunks: oid_fanout oid_lookup commit_metadata
options:
commit-graph: use generation v2 only if entire chain does Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:16 +00:00
EOF
test_cmp expect output &&
git commit-graph verify &&
cat $graphdir/commit-graph-chain
)
'
# The new layer will be added without generation data chunk as it was not
# present on the layer underneath it.
#
# 7 commits (No GDAT)
# ------------------------
# 16 commits (No GDAT)
# ------------------------
# 64 commits (GDAT)
#
test_expect_success 'do not write generation data chunk if not present on existing tip' '
git clone mixed mixed-no-gdat &&
(
cd mixed-no-gdat &&
for i in $(test_seq $THIRD_LAYER_SEQUENCE_START $THIRD_LAYER_SEQUENCE_END)
do
test_commit $i &&
git branch commits/$i || return 1
done &&
git commit-graph write --reachable --split=no-merge &&
test_line_count = 3 $graphdir/commit-graph-chain &&
test-tool read-graph >output &&
cat >expect <<-EOF &&
header: 43475048 1 $(test_oid oid_version) 4 2
num_commits: $NUM_THIRD_LAYER_COMMITS
chunks: oid_fanout oid_lookup commit_metadata
options:
commit-graph: use generation v2 only if entire chain does Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:16 +00:00
EOF
test_cmp expect output &&
git commit-graph verify
)
'
# Number of commits in each layer of the split-commit graph before merge:
#
# 8 commits (No GDAT)
# ------------------------
# 7 commits (No GDAT)
# ------------------------
# 16 commits (No GDAT)
# ------------------------
# 64 commits (GDAT)
#
# The top two layers are merged and do not have generation data chunk as layer below them does
# not have generation data chunk.
#
# 15 commits (No GDAT)
# ------------------------
# 16 commits (No GDAT)
# ------------------------
# 64 commits (GDAT)
#
test_expect_success 'do not write generation data chunk if the topmost remaining layer does not have generation data chunk' '
git clone mixed-no-gdat mixed-merge-no-gdat &&
(
cd mixed-merge-no-gdat &&
for i in $(test_seq $FOURTH_LAYER_SEQUENCE_START $FOURTH_LAYER_SEQUENCE_END)
do
test_commit $i &&
git branch commits/$i || return 1
done &&
git commit-graph write --reachable --split --size-multiple 1 &&
test_line_count = 3 $graphdir/commit-graph-chain &&
test-tool read-graph >output &&
cat >expect <<-EOF &&
header: 43475048 1 $(test_oid oid_version) 4 2
num_commits: $(($NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS))
chunks: oid_fanout oid_lookup commit_metadata
options:
commit-graph: use generation v2 only if entire chain does Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:16 +00:00
EOF
test_cmp expect output &&
git commit-graph verify
)
'
# Number of commits in each layer of the split-commit graph before merge:
#
# 16 commits (No GDAT)
# ------------------------
# 15 commits (No GDAT)
# ------------------------
# 16 commits (No GDAT)
# ------------------------
# 64 commits (GDAT)
#
# The top three layers are merged and has generation data chunk as the topmost remaining layer
# has generation data chunk.
#
# 47 commits (GDAT)
# ------------------------
# 64 commits (GDAT)
#
test_expect_success 'write generation data chunk if topmost remaining layer has generation data chunk' '
git clone mixed-merge-no-gdat mixed-merge-gdat &&
(
cd mixed-merge-gdat &&
for i in $(test_seq $FIFTH_LAYER_SEQUENCE_START $FIFTH_LAYER_SEQUENCE_END)
do
test_commit $i &&
git branch commits/$i || return 1
done &&
git commit-graph write --reachable --split --size-multiple 1 &&
test_line_count = 2 $graphdir/commit-graph-chain &&
test-tool read-graph >output &&
cat >expect <<-EOF &&
header: 43475048 1 $(test_oid oid_version) 5 1
num_commits: $(($NUM_SECOND_LAYER_COMMITS + $NUM_THIRD_LAYER_COMMITS + $NUM_FOURTH_LAYER_COMMITS + $NUM_FIFTH_LAYER_COMMITS))
chunks: oid_fanout oid_lookup commit_metadata generation_data
commit-graph: start parsing generation v2 (again) The 'read_generation_data' member of 'struct commit_graph' was introduced by 1fdc383c5 (commit-graph: use generation v2 only if entire chain does, 2021-01-16). The intention was to avoid using corrected commit dates if not all layers of a commit-graph had that data stored. The logic in validate_mixed_generation_chain() at that point incorrectly initialized read_generation_data to 1 if and only if the tip commit-graph contained the Corrected Commit Date chunk. This was "fixed" in 448a39e65 (commit-graph: validate layers for generation data, 2021-02-02) to validate that read_generation_data was either non-zero for all layers, or it would set read_generation_data to zero for all layers. The problem here is that read_generation_data is not initialized to be non-zero anywhere! This change initializes read_generation_data immediately after the chunk is parsed, so each layer will have its value present as soon as possible. The read_generation_data member is used in fill_commit_graph_info() to determine if we should use the corrected commit date or the topological levels stored in the Commit Data chunk. Due to this bug, all previous versions of Git were defaulting to topological levels in all cases! This can be measured with some performance tests. Using the Linux kernel as a testbed, I generated a complete commit-graph containing corrected commit dates and tested the 'new' version against the previous, 'old' version. First, rev-list with --topo-order demonstrates a 26% improvement using corrected commit dates: hyperfine \ -n "old" "$OLD_GIT rev-list --topo-order -1000 v3.6" \ -n "new" "$NEW_GIT rev-list --topo-order -1000 v3.6" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 57.1 ms ± 3.1 ms Range (min … max): 52.9 ms … 62.0 ms 55 runs Benchmark 2: new Time (mean ± σ): 45.5 ms ± 3.3 ms Range (min … max): 39.9 ms … 51.7 ms 59 runs Summary 'new' ran 1.26 ± 0.11 times faster than 'old' These performance improvements are due to the algorithmic improvements given by walking fewer commits due to the higher cutoffs from corrected commit dates. However, this comes at a cost. The additional I/O cost of parsing the corrected commit dates is visible in case of merge-base commands that do not reduce the overall number of walked commits. hyperfine \ -n "old" "$OLD_GIT merge-base v4.8 v4.9" \ -n "new" "$NEW_GIT merge-base v4.8 v4.9" \ --warmup=10 Benchmark 1: old Time (mean ± σ): 110.4 ms ± 6.4 ms Range (min … max): 96.0 ms … 118.3 ms 25 runs Benchmark 2: new Time (mean ± σ): 150.7 ms ± 1.1 ms Range (min … max): 149.3 ms … 153.4 ms 19 runs Summary 'old' ran 1.36 ± 0.08 times faster than 'new' Performance issues like this are what motivated 702110aac (commit-graph: use config to specify generation type, 2021-02-25). In the future, we could fix this performance problem by inserting the corrected commit date offsets into the Commit Date chunk instead of having that data in an extra chunk. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01 19:48:31 +00:00
options: read_generation_data
commit-graph: use generation v2 only if entire chain does Since there are released versions of Git that understand generation numbers in the commit-graph's CDAT chunk but do not understand the GDAT chunk, the following scenario is possible: 1. "New" Git writes a commit-graph with the GDAT chunk. 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. If each layer of split commit-graph is treated independently, as it was the case before this commit, with Git inspecting only the current layer for chunk_generation_data pointer, commits in the lower layer (one with GDAT) whould have corrected commit date as their generation number, while commits in the upper layer would have topological levels as their generation. Corrected commit dates usually have much larger values than topological levels. This means that if we take two commits, one from the upper layer, and one reachable from it in the lower layer, then the expectation that the generation of a parent is smaller than the generation of a child would be violated. It is difficult to expose this issue in a test. Since we _start_ with artificially low generation numbers, any commit walk that prioritizes generation numbers will walk all of the commits with high generation number before walking the commits with low generation number. In all the cases I tried, the commit-graph layers themselves "protect" any incorrect behavior since none of the commits in the lower layer can reach the commits in the upper layer. This issue would manifest itself as a performance problem in this case, especially with something like "git log --graph" since the low generation numbers would cause the in-degree queue to walk all of the commits in the lower layer before allowing the topo-order queue to write anything to output (depending on the size of the upper layer). Therefore, When writing the new layer in split commit-graph, we write a GDAT chunk only if the topmost layer has a GDAT chunk. This guarantees that if a layer has GDAT chunk, all lower layers must have a GDAT chunk as well. Rewriting layers follows similar approach: if the topmost layer below the set of layers being rewritten (in the split commit-graph chain) exists, and it does not contain GDAT chunk, then the result of rewrite does not have GDAT chunks either. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Reviewed-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-16 18:11:16 +00:00
EOF
test_cmp expect output
)
'
test_expect_success 'write generation data chunk when commit-graph chain is replaced' '
git clone mixed mixed-replace &&
(
cd mixed-replace &&
git commit-graph write --reachable --split=replace &&
test_path_is_file $graphdir/commit-graph-chain &&
test_line_count = 1 $graphdir/commit-graph-chain &&
verify_chain_files_exist $graphdir &&
graph_read_expect $(($NUM_FIRST_LAYER_COMMITS + $NUM_SECOND_LAYER_COMMITS)) &&
git commit-graph verify
)
'
test_done