git/t/t5319-multi-pack-index.sh

1210 lines
34 KiB
Bash
Raw Normal View History

#!/bin/sh
test_description='multi-pack-indexes'
. ./test-lib.sh
. "$TEST_DIRECTORY"/lib-chunk.sh
maintenance: add incremental-repack task The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files. One issue with running 'git repack' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk. Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since 'git multi-pack-index expire' was added in 19575c7 (multi-pack-index: implement 'expire' subcommand, 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1 (midx: implement midx_repack(), 2019-06-10). The 'incremental-repack' task runs the following steps: 1. 'git multi-pack-index write' creates a multi-pack-index file if one did not exist, and otherwise will update the multi-pack-index with any new pack-files that appeared since the last write. This is particularly relevant with the background fetch job. When the multi-pack-index sees two copies of the same object, it stores the offset data into the newer pack-file. This means that some old pack-files could become "unreferenced" which I will use to mean "a pack-file that is in the pack-file list of the multi-pack-index but none of the objects in the multi-pack-index reference a location inside that pack-file." 2. 'git multi-pack-index expire' deletes any unreferenced pack-files and updaes the multi-pack-index to drop those pack-files from the list. This is safe to do as concurrent Git processes will see the multi-pack-index and not open those packs when looking for object contents. (Similar to the 'loose-objects' job, there are some Git commands that open pack-files regardless of the multi-pack-index, but they are rarely used. Further, a user that self-selects to use background operations would likely refrain from using those commands.) 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set of pack-files that are listed in the multi-pack-index and creates a new pack-file containing the objects whose offsets are listed by the multi-pack-index to be in those objects. The set of pack- files is selected greedily by sorting the pack-files by modified time and adding a pack-file to the set if its "expected size" is smaller than the batch size until the total expected size of the selected pack-files is at least the batch size. The "expected size" is calculated by taking the size of the pack-file divided by the number of objects in the pack-file and multiplied by the number of objects from the multi-pack-index with offset in that pack-file. The expected size approximates how much data from that pack-file will contribute to the resulting pack-file size. The intention is that the resulting pack-file will be close in size to the provided batch size. The next run of the incremental-repack task will delete these repacked pack-files during the 'expire' step. In this version, the batch size is set to "0" which ignores the size restrictions when selecting the pack-files. It instead selects all pack-files and repacks all packed objects into a single pack-file. This will be updated in the next change, but it requires doing some calculations that are better isolated to a separate change. These steps are based on a similar background maintenance step in Scalar (and VFS for Git) [1]. This was incredibly effective for users of the Windows OS repository. After using the same VFS for Git repository for over a year, some users had _thousands_ of pack-files that combined to up to 250 GB of data. We noticed a few users were running into the open file descriptor limits (due in part to a bug in the multi-pack-index fixed by af96fe3 (midx: add packs to packed_git linked list, 2019-04-29). These pack-files were mostly small since they contained the commits and trees that were pushed to the origin in a given hour. The GVFS protocol includes a "prefetch" step that asks for pre-computed pack- files containing commits and trees by timestamp. These pack-files were grouped into "daily" pack-files once a day for up to 30 days. If a user did not request prefetch packs for over 30 days, then they would get the entire history of commits and trees in a new, large pack-file. This led to a large number of pack-files that had poor delta compression. By running this pack-file maintenance step once per day, these repos with thousands of packs spanning 200+ GB dropped to dozens of pack- files spanning 30-50 GB. This was done all without removing objects from the system and using a constant batch size of two gigabytes. Once the work was done to reduce the pack-files to small sizes, the batch size of two gigabytes means that not every run triggers a repack operation, so the following run will not expire a pack-file. This has kept these repos in a "clean" state. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 12:33:36 +00:00
GIT_TEST_MULTI_PACK_INDEX=0
objdir=.git/objects
multi-pack-index: use hash version byte Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:48 +00:00
HASH_LEN=$(test_oid rawsz)
midx_read_expect () {
NUM_PACKS=$1
NUM_OBJECTS=$2
NUM_CHUNKS=$3
OBJECT_DIR=$4
EXTRA_CHUNKS="$5"
{
cat <<-EOF &&
multi-pack-index: use hash version byte Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:48 +00:00
header: 4d494458 1 $HASH_LEN $NUM_CHUNKS $NUM_PACKS
chunks: pack-names oid-fanout oid-lookup object-offsets$EXTRA_CHUNKS
num_objects: $NUM_OBJECTS
packs:
EOF
if test $NUM_PACKS -ge 1
then
ls $OBJECT_DIR/pack/ | grep idx | sort
fi &&
printf "object-dir: $OBJECT_DIR\n"
} >expect &&
test-tool read-midx $OBJECT_DIR >actual &&
test_cmp expect actual
}
test_expect_success 'setup' '
test_oid_cache <<-EOF
idxoff sha1:2999
idxoff sha256:3739
packnameoff sha1:652
packnameoff sha256:940
fanoutoff sha1:1
fanoutoff sha256:3
EOF
'
test_expect_success "don't write midx with no packs" '
test_must_fail git multi-pack-index --object-dir=. write &&
test_path_is_missing pack/multi-pack-index
'
multi-pack-index: use hash version byte Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:48 +00:00
test_expect_success SHA1 'warn if a midx contains no oid' '
cp "$TEST_DIRECTORY"/t5319/no-objects.midx $objdir/pack/multi-pack-index &&
test_must_fail git multi-pack-index verify &&
rm $objdir/pack/multi-pack-index
'
generate_objects () {
i=$1
iii=$(printf '%03i' $i)
{
test-tool genrandom "bar" 200 &&
test-tool genrandom "baz $iii" 50
} >wide_delta_$iii &&
{
test-tool genrandom "foo"$i 100 &&
test-tool genrandom "foo"$(( $i + 1 )) 100 &&
test-tool genrandom "foo"$(( $i + 2 )) 100
} >deep_delta_$iii &&
{
echo $iii &&
test-tool genrandom "$iii" 8192
} >file_$iii &&
git update-index --add file_$iii deep_delta_$iii wide_delta_$iii
}
commit_and_list_objects () {
{
echo 101 &&
test-tool genrandom 100 8192;
} >file_101 &&
git update-index --add file_101 &&
tree=$(git write-tree) &&
commit=$(git commit-tree $tree -p HEAD</dev/null) &&
{
echo $tree &&
git ls-tree $tree | sed -e "s/.* \\([0-9a-f]*\\) .*/\\1/"
} >obj-list &&
git reset --hard $commit
}
test_expect_success 'create objects' '
test_commit initial &&
for i in $(test_seq 1 5)
do
generate_objects $i || return 1
done &&
commit_and_list_objects
'
test_expect_success 'write midx with one v1 pack' '
pack=$(git pack-objects --index-version=1 $objdir/pack/test <obj-list) &&
test_when_finished rm $objdir/pack/test-$pack.pack \
$objdir/pack/test-$pack.idx $objdir/pack/multi-pack-index &&
git multi-pack-index --object-dir=$objdir write &&
midx_read_expect 1 18 4 $objdir
'
midx_git_two_modes () {
git -c core.multiPackIndex=false $1 >expect &&
git -c core.multiPackIndex=true $1 >actual &&
if [ "$2" = "sorted" ]
then
sort <expect >expect.sorted &&
mv expect.sorted expect &&
sort <actual >actual.sorted &&
mv actual.sorted actual
fi &&
test_cmp expect actual
}
compare_results_with_midx () {
MSG=$1
test_expect_success "check normal git operations: $MSG" '
midx_git_two_modes "rev-list --objects --all" &&
midx_git_two_modes "log --raw" &&
midx_git_two_modes "count-objects --verbose" &&
midx_git_two_modes "cat-file --batch-all-objects --batch-check" &&
midx_git_two_modes "cat-file --batch-all-objects --batch-check --unordered" sorted
'
}
test_expect_success 'write midx with one v2 pack' '
git pack-objects --index-version=2,0x40 $objdir/pack/test <obj-list &&
git multi-pack-index --object-dir=$objdir write &&
midx_read_expect 1 18 4 $objdir
'
compare_results_with_midx "one v2 pack"
packfile.c: protect against disappearing indexes In 17c35c8969 (packfile: skip loading index if in multi-pack-index, 2018-07-12) we stopped loading the .idx file for packs that are contained within a multi-pack index. This saves us the effort of loading an .idx and doing some lightweight validity checks by way of 'packfile.c:load_idx()', but introduces a race between processes that need to load the index (e.g., to generate a reverse index) and processes that can delete the index. For example, running the following in your shell: $ git init repo && cd repo $ git commit --allow-empty -m 'base' $ git repack -ad && git multi-pack-index write followed by: $ rm -f .git/objects/pack/pack-*.idx $ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)' will result in a segfault prior to this patch. What's happening here is that we notice that the pack is in the multi-pack index, and so don't check that it still has a .idx. When we then try and load that index to generate a reverse index, we don't have it, so the call to 'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns NULL, and then dereferencing it causes a segfault. Of course, we don't ever expect someone to remove the index file by hand, or to be in a state where we never wrote it to begin with (yet find that pack in the multi-pack-index). But, this can happen in a timing race with 'git repack -ad', which removes all existing packs after writing a new pack containing all of their objects. Avoid this by reverting the hunk of 17c35c8969 which stops loading the index when the pack is contained in a MIDX. This makes the latter half of 17c35c8969 useless, since we'll always have a non-NULL 'p->index_data', in which case that if statement isn't guarding anything. These two together effectively revert 17c35c8969, and avoid the race explained above. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 17:17:28 +00:00
test_expect_success 'corrupt idx reports errors' '
idx=$(test-tool read-midx $objdir | grep "\.idx\$") &&
mv $objdir/pack/$idx backup-$idx &&
test_when_finished "mv backup-\$idx \$objdir/pack/\$idx" &&
# This is the minimum size for a sha-1 based .idx; this lets
# us pass perfunctory tests, but anything that actually opens and reads
# the idx file will complain.
test_copy_bytes 1064 <backup-$idx >$objdir/pack/$idx &&
git -c core.multiPackIndex=true rev-list --objects --all 2>err &&
packfile.c: protect against disappearing indexes In 17c35c8969 (packfile: skip loading index if in multi-pack-index, 2018-07-12) we stopped loading the .idx file for packs that are contained within a multi-pack index. This saves us the effort of loading an .idx and doing some lightweight validity checks by way of 'packfile.c:load_idx()', but introduces a race between processes that need to load the index (e.g., to generate a reverse index) and processes that can delete the index. For example, running the following in your shell: $ git init repo && cd repo $ git commit --allow-empty -m 'base' $ git repack -ad && git multi-pack-index write followed by: $ rm -f .git/objects/pack/pack-*.idx $ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)' will result in a segfault prior to this patch. What's happening here is that we notice that the pack is in the multi-pack index, and so don't check that it still has a .idx. When we then try and load that index to generate a reverse index, we don't have it, so the call to 'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns NULL, and then dereferencing it causes a segfault. Of course, we don't ever expect someone to remove the index file by hand, or to be in a state where we never wrote it to begin with (yet find that pack in the multi-pack-index). But, this can happen in a timing race with 'git repack -ad', which removes all existing packs after writing a new pack containing all of their objects. Avoid this by reverting the hunk of 17c35c8969 which stops loading the index when the pack is contained in a MIDX. This makes the latter half of 17c35c8969 useless, since we'll always have a non-NULL 'p->index_data', in which case that if statement isn't guarding anything. These two together effectively revert 17c35c8969, and avoid the race explained above. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 17:17:28 +00:00
grep "index unavailable" err
'
test_expect_success 'add more objects' '
for i in $(test_seq 6 10)
do
generate_objects $i || return 1
done &&
commit_and_list_objects
'
test_expect_success 'write midx with two packs' '
git pack-objects --index-version=1 $objdir/pack/test-2 <obj-list &&
git multi-pack-index --object-dir=$objdir write &&
midx_read_expect 2 34 4 $objdir
'
compare_results_with_midx "two packs"
test_expect_success 'write midx with --stdin-packs' '
rm -fr $objdir/pack/multi-pack-index &&
idx="$(find $objdir/pack -name "test-2-*.idx")" &&
basename "$idx" >in &&
git multi-pack-index write --stdin-packs <in &&
test-tool read-midx $objdir | grep "\.idx$" >packs &&
test_cmp packs in
'
compare_results_with_midx "mixed mode (one pack + extra)"
test_expect_success 'write with no objects and preferred pack' '
test_when_finished "rm -rf empty" &&
git init empty &&
test_must_fail git -C empty multi-pack-index write \
--stdin-packs --preferred-pack=does-not-exist </dev/null 2>err &&
cat >expect <<-EOF &&
warning: unknown preferred pack: ${SQ}does-not-exist${SQ}
error: no pack files to index.
EOF
test_cmp expect err
'
test_expect_success 'write progress off for redirected stderr' '
git multi-pack-index --object-dir=$objdir write 2>err &&
test_line_count = 0 err
'
test_expect_success 'write force progress on for stderr' '
GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir write --progress 2>err &&
test_file_not_empty err
'
test_expect_success 'write with the --no-progress option' '
GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir write --no-progress 2>err &&
test_line_count = 0 err
'
test_expect_success 'add more packs' '
for j in $(test_seq 11 20)
do
generate_objects $j &&
commit_and_list_objects &&
git pack-objects --index-version=2 $objdir/pack/test-pack <obj-list || return 1
done
'
compare_results_with_midx "mixed mode (two packs + extra)"
test_expect_success 'write midx with twelve packs' '
git multi-pack-index --object-dir=$objdir write &&
midx_read_expect 12 74 4 $objdir
'
compare_results_with_midx "twelve packs"
test_expect_success 'multi-pack-index *.rev cleanup with --object-dir' '
git init repo &&
git clone -s repo alternate &&
test_when_finished "rm -rf repo alternate" &&
(
cd repo &&
test_commit base &&
git repack -d
) &&
ours="alternate/.git/objects/pack/multi-pack-index-123.rev" &&
theirs="repo/.git/objects/pack/multi-pack-index-abc.rev" &&
touch "$ours" "$theirs" &&
(
cd alternate &&
git multi-pack-index --object-dir ../repo/.git/objects write
) &&
# writing a midx in "repo" should not remove the .rev file in the
# alternate
test_path_is_file repo/.git/objects/pack/multi-pack-index &&
test_path_is_file $ours &&
test_path_is_missing $theirs
'
multi-pack-index: use hash version byte Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:48 +00:00
test_expect_success 'warn on improper hash version' '
git init --object-format=sha1 sha1 &&
(
cd sha1 &&
git config core.multiPackIndex true &&
test_commit 1 &&
git repack -a &&
git multi-pack-index write &&
mv .git/objects/pack/multi-pack-index ../mpi-sha1
) &&
git init --object-format=sha256 sha256 &&
(
cd sha256 &&
git config core.multiPackIndex true &&
test_commit 1 &&
git repack -a &&
git multi-pack-index write &&
mv .git/objects/pack/multi-pack-index ../mpi-sha256
) &&
(
cd sha1 &&
mv ../mpi-sha256 .git/objects/pack/multi-pack-index &&
git log -1 2>err &&
test_grep "multi-pack-index hash version 2 does not match version 1" err
multi-pack-index: use hash version byte Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:48 +00:00
) &&
(
cd sha256 &&
mv ../mpi-sha1 .git/objects/pack/multi-pack-index &&
git log -1 2>err &&
test_grep "multi-pack-index hash version 1 does not match version 2" err
multi-pack-index: use hash version byte Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:48 +00:00
)
'
test_expect_success 'midx picks objects from preferred pack' '
test_when_finished rm -rf preferred.git &&
git init --bare preferred.git &&
(
cd preferred.git &&
a=$(echo "a" | git hash-object -w --stdin) &&
b=$(echo "b" | git hash-object -w --stdin) &&
c=$(echo "c" | git hash-object -w --stdin) &&
# Set up two packs, duplicating the object "B" at different
# offsets.
#
# Note that the "BC" pack (the one we choose as preferred) sorts
# lexically after the "AB" pack, meaning that omitting the
# --preferred-pack argument would cause this test to fail (since
# the MIDX code would select the copy of "b" in the "AB" pack).
git pack-objects objects/pack/test-AB <<-EOF &&
$a
$b
EOF
bc=$(git pack-objects objects/pack/test-BC <<-EOF
$b
$c
EOF
) &&
git multi-pack-index --object-dir=objects \
write --preferred-pack=test-BC-$bc.idx 2>err &&
test_must_be_empty err &&
test-tool read-midx --show-objects objects >out &&
ofs=$(git show-index <objects/pack/test-BC-$bc.idx | grep $b |
cut -d" " -f1) &&
printf "%s %s\tobjects/pack/test-BC-%s.pack\n" \
"$b" "$ofs" "$bc" >expect &&
grep ^$b out >actual &&
test_cmp expect actual
)
'
multi-pack-index: use hash version byte Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:48 +00:00
test_expect_success 'preferred packs must be non-empty' '
test_when_finished rm -rf preferred.git &&
git init preferred.git &&
(
cd preferred.git &&
test_commit base &&
git repack -ad &&
empty="$(git pack-objects $objdir/pack/pack </dev/null)" &&
test_must_fail git multi-pack-index write \
--preferred-pack=pack-$empty.pack 2>err &&
grep "with no objects" err
)
'
test_expect_success 'verify multi-pack-index success' '
git multi-pack-index verify --object-dir=$objdir
'
test_expect_success 'verify progress off for redirected stderr' '
git multi-pack-index verify --object-dir=$objdir 2>err &&
test_line_count = 0 err
'
test_expect_success 'verify force progress on for stderr' '
git multi-pack-index verify --object-dir=$objdir --progress 2>err &&
test_file_not_empty err
'
test_expect_success 'verify with the --no-progress option' '
git multi-pack-index verify --object-dir=$objdir --no-progress 2>err &&
test_line_count = 0 err
'
# usage: corrupt_midx_and_verify <pos> <data> <objdir> <string>
corrupt_midx_and_verify() {
POS=$1 &&
DATA="${2:-\0}" &&
OBJDIR=$3 &&
GREPSTR="$4" &&
COMMAND="$5" &&
if test -z "$COMMAND"
then
COMMAND="git multi-pack-index verify --object-dir=$OBJDIR"
fi &&
FILE=$OBJDIR/pack/multi-pack-index &&
chmod a+w $FILE &&
test_when_finished mv midx-backup $FILE &&
cp $FILE midx-backup &&
printf "$DATA" | dd of="$FILE" bs=1 seek="$POS" conv=notrunc &&
test_must_fail $COMMAND 2>test_err &&
grep -v "^+" test_err >err &&
test_grep "$GREPSTR" err
}
test_expect_success 'verify bad signature' '
corrupt_midx_and_verify 0 "\00" $objdir \
"multi-pack-index signature"
'
NUM_OBJECTS=74
MIDX_BYTE_VERSION=4
MIDX_BYTE_OID_VERSION=5
MIDX_BYTE_CHUNK_COUNT=6
MIDX_HEADER_SIZE=12
MIDX_BYTE_CHUNK_ID=$MIDX_HEADER_SIZE
MIDX_BYTE_CHUNK_OFFSET=$(($MIDX_HEADER_SIZE + 4))
MIDX_NUM_CHUNKS=5
MIDX_CHUNK_LOOKUP_WIDTH=12
MIDX_OFFSET_PACKNAMES=$(($MIDX_HEADER_SIZE + \
$MIDX_NUM_CHUNKS * $MIDX_CHUNK_LOOKUP_WIDTH))
MIDX_BYTE_PACKNAME_ORDER=$(($MIDX_OFFSET_PACKNAMES + 2))
MIDX_OFFSET_OID_FANOUT=$(($MIDX_OFFSET_PACKNAMES + $(test_oid packnameoff)))
MIDX_OID_FANOUT_WIDTH=4
MIDX_BYTE_OID_FANOUT_ORDER=$((MIDX_OFFSET_OID_FANOUT + 250 * $MIDX_OID_FANOUT_WIDTH + $(test_oid fanoutoff)))
MIDX_OFFSET_OID_LOOKUP=$(($MIDX_OFFSET_OID_FANOUT + 256 * $MIDX_OID_FANOUT_WIDTH))
MIDX_BYTE_OID_LOOKUP=$(($MIDX_OFFSET_OID_LOOKUP + 16 * $HASH_LEN))
MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * $HASH_LEN))
MIDX_OFFSET_WIDTH=8
MIDX_BYTE_PACK_INT_ID=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH + 2))
MIDX_BYTE_OFFSET=$(($MIDX_OFFSET_OBJECT_OFFSETS + 16 * $MIDX_OFFSET_WIDTH + 6))
test_expect_success 'verify bad version' '
corrupt_midx_and_verify $MIDX_BYTE_VERSION "\00" $objdir \
"multi-pack-index version"
'
test_expect_success 'verify bad OID version' '
multi-pack-index: use hash version byte Similar to the commit-graph format, the multi-pack-index format has a byte in the header intended to track the hash version used to write the file. This allows one to interpret the hash length without having the context of the repository config specifying the hash length. This was not modified as part of the SHA-256 work because the hash length was automatically up-shifted due to that config. Since we have this byte available, we can make the file formats more obviously incompatible instead of relying on other context from the repository. Add a new oid_version() method in midx.c similar to the one in commit-graph.c. This is specifically made separate from that implementation to avoid artificially linking the formats. The test impact requires a few more things than the corresponding change in the commit-graph format. Specifically, 'test-tool read-midx' was not writing anything about this header value to output. Since the value available in 'struct multi_pack_index' is hash_len instead of a version value, we output "20" or "32" instead of "1" or "2". Since we want a user to not have their Git commands fail if their multi-pack-index has the incorrect hash version compared to the repository's hash version, we relax the die() to an error() in load_multi_pack_index(). This has some effect on 'git multi-pack-index verify' as we need to check that a failed parse of a file that exists is actually a verify error. For that test that checks the hash version matches, we change the corrupted byte from "2" to "3" to ensure the test fails for both hash algorithms. Helped-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 14:04:48 +00:00
corrupt_midx_and_verify $MIDX_BYTE_OID_VERSION "\03" $objdir \
"hash version"
'
test_expect_success 'verify truncated chunk count' '
corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\01" $objdir \
"final chunk has non-zero id"
'
test_expect_success 'verify extended chunk count' '
corrupt_midx_and_verify $MIDX_BYTE_CHUNK_COUNT "\07" $objdir \
"terminating chunk id appears earlier than expected"
'
test_expect_success 'verify missing required chunk' '
corrupt_midx_and_verify $MIDX_BYTE_CHUNK_ID "\01" $objdir \
"required pack-name chunk missing"
'
test_expect_success 'verify invalid chunk offset' '
corrupt_midx_and_verify $MIDX_BYTE_CHUNK_OFFSET "\01" $objdir \
"improper chunk offset(s)"
'
test_expect_success 'verify packnames out of order' '
corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "z" $objdir \
"pack names out of order"
'
test_expect_success 'verify packnames out of order' '
corrupt_midx_and_verify $MIDX_BYTE_PACKNAME_ORDER "a" $objdir \
"failed to load pack"
'
test_expect_success 'verify oid fanout out of order' '
corrupt_midx_and_verify $MIDX_BYTE_OID_FANOUT_ORDER "\01" $objdir \
"oid fanout out of order"
'
test_expect_success 'verify oid lookup out of order' '
corrupt_midx_and_verify $MIDX_BYTE_OID_LOOKUP "\00" $objdir \
"oid lookup out of order"
'
test_expect_success 'verify incorrect pack-int-id' '
corrupt_midx_and_verify $MIDX_BYTE_PACK_INT_ID "\07" $objdir \
"bad pack-int-id"
'
test_expect_success 'verify incorrect offset' '
corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\377" $objdir \
"incorrect object offset"
'
test_expect_success 'git-fsck incorrect offset' '
corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\377" $objdir \
"incorrect object offset" \
"git -c core.multiPackIndex=true fsck" &&
test_unconfig core.multiPackIndex &&
test_must_fail git fsck &&
git -c core.multiPackIndex=false fsck
'
test_expect_success 'git fsck shows MIDX output with --progress' '
git fsck --progress 2>err &&
grep "Verifying OID order in multi-pack-index" err &&
grep "Verifying object offsets" err
'
test_expect_success 'git fsck suppresses MIDX output with --no-progress' '
git fsck --no-progress 2>err &&
! grep "Verifying OID order in multi-pack-index" err &&
! grep "Verifying object offsets" err
'
midx: don't reuse corrupt MIDXs when writing When writing a new multi-pack index, Git tries to reuse as much of the data from an existing MIDX as possible, like object offsets. This is done to avoid re-opening a bunch of *.idx files unnecessarily, but can lead to problems if the data we are reusing is corrupt. That's because we'll blindly reuse data from an existing MIDX without checking its trailing checksum for validity. So if there is memory corruption while writing a MIDX, or disk corruption in the intervening period between writing and reuse, we'll blindly propagate those bad values forward. Suppose we experience a memory corruption while writing a MIDX such that we write an incorrect object offset (or alternatively, the disk corrupts the data after being written, but before being reused). Then when we go to write a new MIDX, we'll reuse the bad object offset without checking its validity. This means that the MIDX we just wrote is broken, but its trailing checksum is in-tact, since we never bothered to look at the values before writing. In the above, a "git multi-pack-index verify" would have caught the problem before writing, but writing a new MIDX wouldn't have noticed anything wrong, blindly carrying forward the corrupt offset. Individual pack indexes check their validity by verifying the crc32 attached to each entry when carrying data forward during a repack. We could solve this problem for MIDXs in the same way, but individual crc32's don't make much sense, since their entries are so small. Likewise, checking the whole file on every read may be prohibitively expensive if a repository has a lot of objects, packs, or both. But we can check the trailing checksum when reusing an existing MIDX when writing a new one. And a corrupt MIDX need not stop us from writing a new one, since we can just avoid reusing the existing one at all and pretend as if we are writing a new MIDX from scratch. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-23 18:39:12 +00:00
test_expect_success 'corrupt MIDX is not reused' '
corrupt_midx_and_verify $MIDX_BYTE_OFFSET "\377" $objdir \
"incorrect object offset" &&
git multi-pack-index write 2>err &&
test_grep checksum.mismatch err &&
midx: don't reuse corrupt MIDXs when writing When writing a new multi-pack index, Git tries to reuse as much of the data from an existing MIDX as possible, like object offsets. This is done to avoid re-opening a bunch of *.idx files unnecessarily, but can lead to problems if the data we are reusing is corrupt. That's because we'll blindly reuse data from an existing MIDX without checking its trailing checksum for validity. So if there is memory corruption while writing a MIDX, or disk corruption in the intervening period between writing and reuse, we'll blindly propagate those bad values forward. Suppose we experience a memory corruption while writing a MIDX such that we write an incorrect object offset (or alternatively, the disk corrupts the data after being written, but before being reused). Then when we go to write a new MIDX, we'll reuse the bad object offset without checking its validity. This means that the MIDX we just wrote is broken, but its trailing checksum is in-tact, since we never bothered to look at the values before writing. In the above, a "git multi-pack-index verify" would have caught the problem before writing, but writing a new MIDX wouldn't have noticed anything wrong, blindly carrying forward the corrupt offset. Individual pack indexes check their validity by verifying the crc32 attached to each entry when carrying data forward during a repack. We could solve this problem for MIDXs in the same way, but individual crc32's don't make much sense, since their entries are so small. Likewise, checking the whole file on every read may be prohibitively expensive if a repository has a lot of objects, packs, or both. But we can check the trailing checksum when reusing an existing MIDX when writing a new one. And a corrupt MIDX need not stop us from writing a new one, since we can just avoid reusing the existing one at all and pretend as if we are writing a new MIDX from scratch. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-23 18:39:12 +00:00
git multi-pack-index verify
'
test_expect_success 'verify incorrect checksum' '
pos=$(($(wc -c <$objdir/pack/multi-pack-index) - 10)) &&
corrupt_midx_and_verify $pos \
"\377\377\377\377\377\377\377\377\377\377" \
$objdir "incorrect checksum"
'
test_expect_success 'repack progress off for redirected stderr' '
GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir repack 2>err &&
test_line_count = 0 err
'
test_expect_success 'repack force progress on for stderr' '
GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir repack --progress 2>err &&
test_file_not_empty err
'
test_expect_success 'repack with the --no-progress option' '
GIT_PROGRESS_DELAY=0 git multi-pack-index --object-dir=$objdir repack --no-progress 2>err &&
test_line_count = 0 err
'
test_expect_success 'repack removes multi-pack-index when deleting packs' '
test_path_is_file $objdir/pack/multi-pack-index &&
# Set GIT_TEST_MULTI_PACK_INDEX to 0 to avoid writing a new
# multi-pack-index after repacking, but set "core.multiPackIndex" to
# true so that "git repack" can read the existing MIDX.
GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -adf &&
test_path_is_missing $objdir/pack/multi-pack-index
'
test_expect_success 'repack preserves multi-pack-index when creating packs' '
git init preserve &&
test_when_finished "rm -fr preserve" &&
(
cd preserve &&
packdir=.git/objects/pack &&
midx=$packdir/multi-pack-index &&
test_commit 1 &&
pack1=$(git pack-objects --all $packdir/pack) &&
touch $packdir/pack-$pack1.keep &&
test_commit 2 &&
pack2=$(git pack-objects --revs $packdir/pack) &&
touch $packdir/pack-$pack2.keep &&
git multi-pack-index write &&
cp $midx $midx.bak &&
cat >pack-input <<-EOF &&
HEAD
^HEAD~1
EOF
test_commit 3 &&
pack3=$(git pack-objects --revs $packdir/pack <pack-input) &&
test_commit 4 &&
pack4=$(git pack-objects --revs $packdir/pack <pack-input) &&
GIT_TEST_MULTI_PACK_INDEX=0 git -c core.multiPackIndex repack -ad &&
ls -la $packdir &&
test_path_is_file $packdir/pack-$pack1.pack &&
test_path_is_file $packdir/pack-$pack2.pack &&
test_path_is_missing $packdir/pack-$pack3.pack &&
test_path_is_missing $packdir/pack-$pack4.pack &&
test_cmp_bin $midx.bak $midx
)
'
compare_results_with_midx "after repack"
test_expect_success 'multi-pack-index and pack-bitmap' '
GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 \
git -c repack.writeBitmaps=true repack -ad &&
git multi-pack-index write &&
git rev-list --test-bitmap HEAD
'
test_expect_success 'multi-pack-index and alternates' '
git init --bare alt.git &&
echo $(pwd)/alt.git/objects >.git/objects/info/alternates &&
echo content1 >file1 &&
altblob=$(GIT_DIR=alt.git git hash-object -w file1) &&
git cat-file blob $altblob &&
git rev-list --all
'
compare_results_with_midx "with alternate (local midx)"
test_expect_success 'multi-pack-index in an alternate' '
mv .git/objects/pack/* alt.git/objects/pack &&
test_commit add_local_objects &&
git repack --local &&
git multi-pack-index write &&
midx_read_expect 1 3 4 $objdir &&
git reset --hard HEAD~1 &&
rm -f .git/objects/pack/*
'
compare_results_with_midx "with alternate (remote midx)"
# usage: corrupt_data <file> <pos> [<data>]
corrupt_data () {
file=$1
pos=$2
data="${3:-\0}"
printf "$data" | dd of="$file" bs=1 seek="$pos" conv=notrunc
}
# Force 64-bit offsets by manipulating the idx file.
# This makes the IDX file _incorrect_ so be careful to clean up after!
test_expect_success 'force some 64-bit offsets with pack-objects' '
mkdir objects64 &&
mkdir objects64/pack &&
for i in $(test_seq 1 11)
do
generate_objects 11 || return 1
done &&
commit_and_list_objects &&
pack64=$(git pack-objects --index-version=2,0x40 objects64/pack/test-64 <obj-list) &&
idx64=objects64/pack/test-64-$pack64.idx &&
chmod u+w $idx64 &&
corrupt_data $idx64 $(test_oid idxoff) "\02" &&
midx: avoid opening multiple MIDXs when writing Opening multiple instance of the same MIDX can lead to problems like two separate packed_git structures which represent the same pack being added to the repository's object store. The above scenario can happen because prepare_midx_pack() checks if `m->packs[pack_int_id]` is NULL in order to determine if a pack has been opened and installed in the repository before. But a caller can construct two copies of the same MIDX by calling get_multi_pack_index() and load_multi_pack_index() since the former manipulates the object store directly but the latter is a lower-level routine which allocates a new MIDX for each call. So if prepare_midx_pack() is called on multiple MIDXs with the same pack_int_id, then that pack will be installed twice in the object store's packed_git pointer. This can lead to problems in, for e.g., the pack-bitmap code, which does something like the following (in pack-bitmap.c:open_pack_bitmap()): struct bitmap_index *bitmap_git = ...; for (p = get_all_packs(r); p; p = p->next) { if (open_pack_bitmap_1(bitmap_git, p) == 0) ret = 0; } which is a problem if two copies of the same pack exist in the packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a conditional like the following: if (bitmap_git->pack || bitmap_git->midx) { /* ignore extra bitmap file; we can only handle one */ warning("ignoring extra bitmap file: %s", packfile->pack_name); close(fd); return -1; } Avoid this scenario by not letting write_midx_internal() open a MIDX that isn't also pointed at by the object store. So long as this is the case, other routines should prefer to open MIDXs with get_multi_pack_index() or reprepare_packed_git() instead of creating instances on their own. Because get_multi_pack_index() returns `r->object_store->multi_pack_index` if it is non-NULL, we'll only have one instance of a MIDX open at one time, avoiding these problems. To encourage this, drop the `struct multi_pack_index *` parameter from `write_midx_internal()`, and rely instead on the `object_dir` to find (or initialize) the correct MIDX instance. Likewise, replace the call to `close_midx()` with `close_object_store()`, since we're about to replace the MIDX with a new one and should invalidate the object store's memory of any MIDX that might have existed beforehand. Note that this now forbids passing object directories that don't belong to alternate repositories over `--object-dir`, since before we would have happily opened a MIDX in any directory, but now restrict ourselves to only those reachable by `r->objects->multi_pack_index` (and alternate MIDXs that we can see by walking the `next` pointer). As far as I can tell, supporting arbitrary directories with `--object-dir` was a historical accident, since even the documentation says `<alt>` when referring to the value passed to this option. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 20:34:01 +00:00
# objects64 is not a real repository, but can serve as an alternate
# anyway so we can write a MIDX into it
git init repo &&
test_when_finished "rm -fr repo" &&
(
cd repo &&
( cd ../objects64 && pwd ) >.git/objects/info/alternates &&
midx64=$(git multi-pack-index --object-dir=../objects64 write)
) &&
midx_read_expect 1 63 5 objects64 " large-offsets"
'
test_expect_success 'verify multi-pack-index with 64-bit offsets' '
git multi-pack-index verify --object-dir=objects64
'
NUM_OBJECTS=63
MIDX_OFFSET_OID_FANOUT=$((MIDX_OFFSET_PACKNAMES + 54))
MIDX_OFFSET_OID_LOOKUP=$((MIDX_OFFSET_OID_FANOUT + 256 * $MIDX_OID_FANOUT_WIDTH))
MIDX_OFFSET_OBJECT_OFFSETS=$(($MIDX_OFFSET_OID_LOOKUP + $NUM_OBJECTS * $HASH_LEN))
MIDX_OFFSET_LARGE_OFFSETS=$(($MIDX_OFFSET_OBJECT_OFFSETS + $NUM_OBJECTS * $MIDX_OFFSET_WIDTH))
MIDX_BYTE_LARGE_OFFSET=$(($MIDX_OFFSET_LARGE_OFFSETS + 3))
test_expect_success 'verify incorrect 64-bit offset' '
corrupt_midx_and_verify $MIDX_BYTE_LARGE_OFFSET "\07" objects64 \
"incorrect object offset"
'
test_expect_success 'setup expire tests' '
mkdir dup &&
(
cd dup &&
git init &&
test-tool genrandom "data" 4096 >large_file.txt &&
git update-index --add large_file.txt &&
for i in $(test_seq 1 20)
do
test_commit $i || exit 1
done &&
git branch A HEAD &&
git branch B HEAD~8 &&
git branch C HEAD~13 &&
git branch D HEAD~16 &&
git branch E HEAD~18 &&
git pack-objects --revs .git/objects/pack/pack-A <<-EOF &&
refs/heads/A
^refs/heads/B
EOF
git pack-objects --revs .git/objects/pack/pack-B <<-EOF &&
refs/heads/B
^refs/heads/C
EOF
git pack-objects --revs .git/objects/pack/pack-C <<-EOF &&
refs/heads/C
^refs/heads/D
EOF
git pack-objects --revs .git/objects/pack/pack-D <<-EOF &&
refs/heads/D
^refs/heads/E
EOF
git pack-objects --revs .git/objects/pack/pack-E <<-EOF &&
refs/heads/E
EOF
multi-pack-index: prepare 'repack' subcommand In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option. The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack- files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. In addition, we hard-code the modified times of the packs in the pack directory to ensure the list of packs sorted by modified time matches the order if sorted by size (ascending). This will be important in a future test. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-10 23:35:26 +00:00
git multi-pack-index write &&
cp -r .git/objects/pack .git/objects/pack-backup
)
'
test_expect_success 'expire does not remove any packs' '
(
cd dup &&
ls .git/objects/pack >expect &&
git multi-pack-index expire &&
ls .git/objects/pack >actual &&
test_cmp expect actual
)
'
test_expect_success 'expire progress off for redirected stderr' '
(
cd dup &&
git multi-pack-index expire 2>err &&
test_line_count = 0 err
)
'
test_expect_success 'expire force progress on for stderr' '
(
cd dup &&
GIT_PROGRESS_DELAY=0 git multi-pack-index expire --progress 2>err &&
test_file_not_empty err
)
'
test_expect_success 'expire with the --no-progress option' '
(
cd dup &&
GIT_PROGRESS_DELAY=0 git multi-pack-index expire --no-progress 2>err &&
test_line_count = 0 err
)
'
test_expect_success 'expire removes unreferenced packs' '
(
cd dup &&
git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
refs/heads/A
^refs/heads/C
EOF
git multi-pack-index write &&
ls .git/objects/pack | grep -v -e pack-[AB] >expect &&
git multi-pack-index expire &&
ls .git/objects/pack >actual &&
test_cmp expect actual &&
ls .git/objects/pack/ | grep idx >expect-idx &&
test-tool read-midx .git/objects | grep idx >actual-midx &&
test_cmp expect-idx actual-midx &&
git multi-pack-index verify &&
git fsck
)
'
multi-pack-index: prepare 'repack' subcommand In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option. The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack- files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. In addition, we hard-code the modified times of the packs in the pack directory to ensure the list of packs sorted by modified time matches the order if sorted by size (ascending). This will be important in a future test. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-10 23:35:26 +00:00
test_expect_success 'repack with minimum size does not alter existing packs' '
(
cd dup &&
rm -rf .git/objects/pack &&
mv .git/objects/pack-backup .git/objects/pack &&
test-tool chmtime =-5 .git/objects/pack/pack-D* &&
test-tool chmtime =-4 .git/objects/pack/pack-C* &&
test-tool chmtime =-3 .git/objects/pack/pack-B* &&
test-tool chmtime =-2 .git/objects/pack/pack-A* &&
multi-pack-index: prepare 'repack' subcommand In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option. The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack- files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. In addition, we hard-code the modified times of the packs in the pack directory to ensure the list of packs sorted by modified time matches the order if sorted by size (ascending). This will be important in a future test. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-10 23:35:26 +00:00
ls .git/objects/pack >expect &&
MINSIZE=$(test-tool path-utils file-size .git/objects/pack/*pack | sort -n | head -n 1) &&
multi-pack-index: prepare 'repack' subcommand In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort. It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands. Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option. The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size. Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index. The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack- files that have duplicate copies in other pack-files. The current change introduces the command-line arguments, and we add a test that ensures we parse these options properly. Since we specify a small batch size, we will guarantee that future implementations do not change the list of pack-files. In addition, we hard-code the modified times of the packs in the pack directory to ensure the list of packs sorted by modified time matches the order if sorted by size (ascending). This will be important in a future test. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-10 23:35:26 +00:00
git multi-pack-index repack --batch-size=$MINSIZE &&
ls .git/objects/pack >actual &&
test_cmp expect actual
)
'
test_expect_success 'repack respects repack.packKeptObjects=false' '
test_when_finished rm -f dup/.git/objects/pack/*keep &&
(
cd dup &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 5 idx-list &&
ls .git/objects/pack/*.pack | sed "s/\.pack/.keep/" >keep-list &&
test_line_count = 5 keep-list &&
for keep in $(cat keep-list)
do
touch $keep || return 1
done &&
git multi-pack-index repack --batch-size=0 &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 5 idx-list &&
test-tool read-midx .git/objects | grep idx >midx-list &&
test_line_count = 5 midx-list &&
THIRD_SMALLEST_SIZE=$(test-tool path-utils file-size .git/objects/pack/*pack | sort -n | sed -n 3p) &&
BATCH_SIZE=$((THIRD_SMALLEST_SIZE + 1)) &&
git multi-pack-index repack --batch-size=$BATCH_SIZE &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 5 idx-list &&
test-tool read-midx .git/objects | grep idx >midx-list &&
test_line_count = 5 midx-list
)
'
midx: implement midx_repack() To repack with a non-zero batch-size, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, compute their expected size, and add the packs to a list if they are smaller than the given batch-size. Stop when the total expected size is at least the batch size. If the batch size is zero, select all packs in the multi-pack-index. Finally, collect the objects from the multi-pack-index that are in the selected packs and send them to 'git pack-objects'. Write a new multi-pack-index that includes the new pack. Using a batch size of zero is very similar to a standard 'git repack' command, except that we do not delete the old packs and instead rely on the new multi-pack-index to prevent new processes from reading the old packs. This does not disrupt other Git processes that are currently reading the old packs based on the old multi-pack-index. While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the actual size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. The "expected size" version provides similar behavior, but could skip a pack-file if the average object size is much larger than the actual size of the referenced objects, or can create a large pack if the actual size of the referenced objects is larger than the expected size. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-10 23:35:27 +00:00
test_expect_success 'repack creates a new pack' '
(
cd dup &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 5 idx-list &&
THIRD_SMALLEST_SIZE=$(test-tool path-utils file-size .git/objects/pack/*pack | sort -n | head -n 3 | tail -n 1) &&
midx: implement midx_repack() To repack with a non-zero batch-size, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, compute their expected size, and add the packs to a list if they are smaller than the given batch-size. Stop when the total expected size is at least the batch size. If the batch size is zero, select all packs in the multi-pack-index. Finally, collect the objects from the multi-pack-index that are in the selected packs and send them to 'git pack-objects'. Write a new multi-pack-index that includes the new pack. Using a batch size of zero is very similar to a standard 'git repack' command, except that we do not delete the old packs and instead rely on the new multi-pack-index to prevent new processes from reading the old packs. This does not disrupt other Git processes that are currently reading the old packs based on the old multi-pack-index. While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the actual size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. The "expected size" version provides similar behavior, but could skip a pack-file if the average object size is much larger than the actual size of the referenced objects, or can create a large pack if the actual size of the referenced objects is larger than the expected size. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-10 23:35:27 +00:00
BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
git multi-pack-index repack --batch-size=$BATCH_SIZE &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 6 idx-list &&
test-tool read-midx .git/objects | grep idx >midx-list &&
test_line_count = 6 midx-list
)
'
test_expect_success 'repack (all) ignores cruft pack' '
git init repo &&
test_when_finished "rm -fr repo" &&
(
cd repo &&
test_commit base &&
test_commit --no-tag unreachable &&
git reset --hard base &&
git reflog expire --all --expire=all &&
git repack --cruft -d &&
git multi-pack-index write &&
find $objdir/pack | sort >before &&
git multi-pack-index repack --batch-size=0 &&
find $objdir/pack | sort >after &&
test_cmp before after
)
'
test_expect_success 'repack (--batch-size) ignores cruft pack' '
git init repo &&
test_when_finished "rm -fr repo" &&
(
cd repo &&
test_commit_bulk 5 &&
test_commit --no-tag unreachable &&
git reset --hard HEAD^ &&
git reflog expire --all --expire=all &&
git repack --cruft -d &&
test_commit four &&
find $objdir/pack -type f -name "*.pack" | sort >before &&
git repack -d &&
find $objdir/pack -type f -name "*.pack" | sort >after &&
pack="$(comm -13 before after)" &&
test_file_size "$pack" >sz &&
# Set --batch-size to twice the size of the pack created
# in the previous step, since this is enough to
# accommodate it and the cruft pack.
#
# This means that the MIDX machinery *could* combine the
# new and cruft packs together.
#
# We ensure that it does not below.
batch="$((($(cat sz) * 2)))" &&
git multi-pack-index write &&
find $objdir/pack | sort >before &&
git multi-pack-index repack --batch-size=$batch &&
find $objdir/pack | sort >after &&
test_cmp before after
)
'
midx: implement midx_repack() To repack with a non-zero batch-size, first sort all pack-files by their modified time. Second, walk those pack-files from oldest to newest, compute their expected size, and add the packs to a list if they are smaller than the given batch-size. Stop when the total expected size is at least the batch size. If the batch size is zero, select all packs in the multi-pack-index. Finally, collect the objects from the multi-pack-index that are in the selected packs and send them to 'git pack-objects'. Write a new multi-pack-index that includes the new pack. Using a batch size of zero is very similar to a standard 'git repack' command, except that we do not delete the old packs and instead rely on the new multi-pack-index to prevent new processes from reading the old packs. This does not disrupt other Git processes that are currently reading the old packs based on the old multi-pack-index. While first designing a 'git multi-pack-index repack' operation, I started by collecting the batches based on the actual size of the objects instead of the size of the pack-files. This allows repacking a large pack-file that has very few referencd objects. However, this came at a significant cost of parsing pack-files instead of simply reading the multi-pack-index and getting the file information for the pack-files. The "expected size" version provides similar behavior, but could skip a pack-file if the average object size is much larger than the actual size of the referenced objects, or can create a large pack if the actual size of the referenced objects is larger than the expected size. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-10 23:35:27 +00:00
test_expect_success 'expire removes repacked packs' '
(
cd dup &&
ls -al .git/objects/pack/*pack &&
ls -S .git/objects/pack/*pack | head -n 4 >expect &&
git multi-pack-index expire &&
ls -S .git/objects/pack/*pack >actual &&
test_cmp expect actual &&
test-tool read-midx .git/objects | grep idx >midx-list &&
test_line_count = 4 midx-list
)
'
test_expect_success 'expire works when adding new packs' '
(
cd dup &&
git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
refs/heads/A
^refs/heads/B
EOF
git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
refs/heads/B
^refs/heads/C
EOF
git pack-objects --revs .git/objects/pack/pack-combined <<-EOF &&
refs/heads/C
^refs/heads/D
EOF
git multi-pack-index write &&
git pack-objects --revs .git/objects/pack/a-pack <<-EOF &&
refs/heads/D
^refs/heads/E
EOF
git multi-pack-index write &&
git pack-objects --revs .git/objects/pack/z-pack <<-EOF &&
refs/heads/E
EOF
git multi-pack-index expire &&
ls .git/objects/pack/ | grep idx >expect &&
test-tool read-midx .git/objects | grep idx >actual &&
test_cmp expect actual &&
git multi-pack-index verify
)
'
test_expect_success 'expire respects .keep files' '
(
cd dup &&
git pack-objects --revs .git/objects/pack/pack-all <<-EOF &&
refs/heads/A
EOF
git multi-pack-index write &&
PACKA=$(ls .git/objects/pack/a-pack*\.pack | sed s/\.pack\$//) &&
touch $PACKA.keep &&
git multi-pack-index expire &&
test_path_is_file $PACKA.idx &&
test_path_is_file $PACKA.keep &&
test_path_is_file $PACKA.pack &&
test-tool read-midx .git/objects | grep idx >midx-list &&
test_line_count = 2 midx-list
)
'
test_expect_success 'expiring unreferenced cruft pack retains pack' '
git init repo &&
test_when_finished "rm -fr repo" &&
(
cd repo &&
test_commit base &&
test_commit --no-tag unreachable &&
unreachable=$(git rev-parse HEAD) &&
git reset --hard base &&
git reflog expire --all --expire=all &&
git repack --cruft -d &&
mtimes="$(ls $objdir/pack/pack-*.mtimes)" &&
echo "base..$unreachable" >in &&
pack="$(git pack-objects --revs --delta-base-offset \
$objdir/pack/pack <in)" &&
# Preferring the contents of "$pack" will leave the
# cruft pack unreferenced (ie., none of the objects
# contained in the cruft pack will have their MIDX copy
# selected from the cruft pack).
git multi-pack-index write --preferred-pack="pack-$pack.pack" &&
git multi-pack-index expire &&
test_path_is_file "$mtimes"
)
'
test_expect_success 'repack --batch-size=0 repacks everything' '
multi-pack-index: repack batches below --batch-size The --batch-size=<size> option of 'git multi-pack-index repack' is intended to limit the amount of work done by the repack. In the case of a large repository, this command should repack a number of small pack-files but leave the large pack-files alone. Most often, the repository has one large pack-file from a 'git clone' operation and number of smaller pack-files from incremental 'git fetch' operations. The issue with '--batch-size' is that it also _prevents_ the repack from happening if the expected size of the resulting pack-file is too small. This was intended as a way to avoid frequent churn of small pack-files, but it has mostly caused confusion when a repository is of "medium" size. That is, not enormous like the Windows OS repository, but also not so small that this incremental repack isn't valuable. The solution presented here is to collect pack-files for repack if their expected size is smaller than the batch-size parameter until either the total expected size exceeds the batch-size or all pack-files are considered. If there are at least two pack-files, then these are combined to a new pack-file whose size should not be too much larger than the batch-size. This new strategy should succeed in keeping the number of pack-files small in these "medium" size repositories. The concern about churn is likely not interesting, as the real control over that is the frequency in which the repack command is run. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-11 15:30:18 +00:00
cp -r dup dup2 &&
(
cd dup &&
rm .git/objects/pack/*.keep &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 2 idx-list &&
git multi-pack-index repack --batch-size=0 &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 3 idx-list &&
test-tool read-midx .git/objects | grep idx >midx-list &&
test_line_count = 3 midx-list &&
git multi-pack-index expire &&
ls -al .git/objects/pack/*idx >idx-list &&
test_line_count = 1 idx-list &&
git multi-pack-index repack --batch-size=0 &&
ls -al .git/objects/pack/*idx >new-idx-list &&
test_cmp idx-list new-idx-list
)
'
multi-pack-index: repack batches below --batch-size The --batch-size=<size> option of 'git multi-pack-index repack' is intended to limit the amount of work done by the repack. In the case of a large repository, this command should repack a number of small pack-files but leave the large pack-files alone. Most often, the repository has one large pack-file from a 'git clone' operation and number of smaller pack-files from incremental 'git fetch' operations. The issue with '--batch-size' is that it also _prevents_ the repack from happening if the expected size of the resulting pack-file is too small. This was intended as a way to avoid frequent churn of small pack-files, but it has mostly caused confusion when a repository is of "medium" size. That is, not enormous like the Windows OS repository, but also not so small that this incremental repack isn't valuable. The solution presented here is to collect pack-files for repack if their expected size is smaller than the batch-size parameter until either the total expected size exceeds the batch-size or all pack-files are considered. If there are at least two pack-files, then these are combined to a new pack-file whose size should not be too much larger than the batch-size. This new strategy should succeed in keeping the number of pack-files small in these "medium" size repositories. The concern about churn is likely not interesting, as the real control over that is the frequency in which the repack command is run. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Reviewed-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-11 15:30:18 +00:00
test_expect_success 'repack --batch-size=<large> repacks everything' '
(
cd dup2 &&
rm .git/objects/pack/*.keep &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 2 idx-list &&
git multi-pack-index repack --batch-size=2000000 &&
ls .git/objects/pack/*idx >idx-list &&
test_line_count = 3 idx-list &&
test-tool read-midx .git/objects | grep idx >midx-list &&
test_line_count = 3 midx-list &&
git multi-pack-index expire &&
ls -al .git/objects/pack/*idx >idx-list &&
test_line_count = 1 idx-list
)
'
midx.c: protect against disappearing packs When a packed object is stored in a multi-pack index, but that pack has racily gone away, the MIDX code simply calls die(), when it could be returning an error to the caller, which would in turn lead to re-scanning the pack directory. A pack can racily disappear, for example, due to a simultaneous 'git repack -ad', You can also reproduce this with two terminals, where one is running: git init while true; do git commit -q --allow-empty -m foo git repack -ad git multi-pack-index write done (in effect, constantly writing new MIDXs), and the other is running: obj=$(git rev-parse HEAD) while true; do echo $obj | git cat-file --batch-check='%(objectsize:disk)' || break done That will sometimes hit the error preparing packfile from multi-pack-index message, which this patch fixes. Right now, that path to discovering a missing pack looks something like 'find_pack_entry()' calling 'fill_midx_entry()' and eventually making its way to call 'nth_midxed_pack_entry()'. 'nth_midxed_pack_entry()' already checks 'is_pack_valid()' and propagates an error if the pack is invalid. So, this works if the pack has gone away between calling 'prepare_midx_pack()' and before calling 'is_pack_valid()', but not if it disappears before then. Catch the case where the pack has already disappeared before 'prepare_midx_pack()' by returning an error in that case, too. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 17:17:33 +00:00
test_expect_success 'load reverse index when missing .idx, .pack' '
packfile.c: protect against disappearing indexes In 17c35c8969 (packfile: skip loading index if in multi-pack-index, 2018-07-12) we stopped loading the .idx file for packs that are contained within a multi-pack index. This saves us the effort of loading an .idx and doing some lightweight validity checks by way of 'packfile.c:load_idx()', but introduces a race between processes that need to load the index (e.g., to generate a reverse index) and processes that can delete the index. For example, running the following in your shell: $ git init repo && cd repo $ git commit --allow-empty -m 'base' $ git repack -ad && git multi-pack-index write followed by: $ rm -f .git/objects/pack/pack-*.idx $ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)' will result in a segfault prior to this patch. What's happening here is that we notice that the pack is in the multi-pack index, and so don't check that it still has a .idx. When we then try and load that index to generate a reverse index, we don't have it, so the call to 'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns NULL, and then dereferencing it causes a segfault. Of course, we don't ever expect someone to remove the index file by hand, or to be in a state where we never wrote it to begin with (yet find that pack in the multi-pack-index). But, this can happen in a timing race with 'git repack -ad', which removes all existing packs after writing a new pack containing all of their objects. Avoid this by reverting the hunk of 17c35c8969 which stops loading the index when the pack is contained in a MIDX. This makes the latter half of 17c35c8969 useless, since we'll always have a non-NULL 'p->index_data', in which case that if statement isn't guarding anything. These two together effectively revert 17c35c8969, and avoid the race explained above. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 17:17:28 +00:00
git init repo &&
test_when_finished "rm -fr repo" &&
(
cd repo &&
git config core.multiPackIndex true &&
test_commit base &&
git repack -ad &&
git multi-pack-index write &&
git rev-parse HEAD >tip &&
midx.c: protect against disappearing packs When a packed object is stored in a multi-pack index, but that pack has racily gone away, the MIDX code simply calls die(), when it could be returning an error to the caller, which would in turn lead to re-scanning the pack directory. A pack can racily disappear, for example, due to a simultaneous 'git repack -ad', You can also reproduce this with two terminals, where one is running: git init while true; do git commit -q --allow-empty -m foo git repack -ad git multi-pack-index write done (in effect, constantly writing new MIDXs), and the other is running: obj=$(git rev-parse HEAD) while true; do echo $obj | git cat-file --batch-check='%(objectsize:disk)' || break done That will sometimes hit the error preparing packfile from multi-pack-index message, which this patch fixes. Right now, that path to discovering a missing pack looks something like 'find_pack_entry()' calling 'fill_midx_entry()' and eventually making its way to call 'nth_midxed_pack_entry()'. 'nth_midxed_pack_entry()' already checks 'is_pack_valid()' and propagates an error if the pack is invalid. So, this works if the pack has gone away between calling 'prepare_midx_pack()' and before calling 'is_pack_valid()', but not if it disappears before then. Catch the case where the pack has already disappeared before 'prepare_midx_pack()' by returning an error in that case, too. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 17:17:33 +00:00
pack=$(ls .git/objects/pack/pack-*.pack) &&
packfile.c: protect against disappearing indexes In 17c35c8969 (packfile: skip loading index if in multi-pack-index, 2018-07-12) we stopped loading the .idx file for packs that are contained within a multi-pack index. This saves us the effort of loading an .idx and doing some lightweight validity checks by way of 'packfile.c:load_idx()', but introduces a race between processes that need to load the index (e.g., to generate a reverse index) and processes that can delete the index. For example, running the following in your shell: $ git init repo && cd repo $ git commit --allow-empty -m 'base' $ git repack -ad && git multi-pack-index write followed by: $ rm -f .git/objects/pack/pack-*.idx $ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)' will result in a segfault prior to this patch. What's happening here is that we notice that the pack is in the multi-pack index, and so don't check that it still has a .idx. When we then try and load that index to generate a reverse index, we don't have it, so the call to 'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns NULL, and then dereferencing it causes a segfault. Of course, we don't ever expect someone to remove the index file by hand, or to be in a state where we never wrote it to begin with (yet find that pack in the multi-pack-index). But, this can happen in a timing race with 'git repack -ad', which removes all existing packs after writing a new pack containing all of their objects. Avoid this by reverting the hunk of 17c35c8969 which stops loading the index when the pack is contained in a MIDX. This makes the latter half of 17c35c8969 useless, since we'll always have a non-NULL 'p->index_data', in which case that if statement isn't guarding anything. These two together effectively revert 17c35c8969, and avoid the race explained above. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 17:17:28 +00:00
idx=$(ls .git/objects/pack/pack-*.idx) &&
mv $idx $idx.bak &&
midx.c: protect against disappearing packs When a packed object is stored in a multi-pack index, but that pack has racily gone away, the MIDX code simply calls die(), when it could be returning an error to the caller, which would in turn lead to re-scanning the pack directory. A pack can racily disappear, for example, due to a simultaneous 'git repack -ad', You can also reproduce this with two terminals, where one is running: git init while true; do git commit -q --allow-empty -m foo git repack -ad git multi-pack-index write done (in effect, constantly writing new MIDXs), and the other is running: obj=$(git rev-parse HEAD) while true; do echo $obj | git cat-file --batch-check='%(objectsize:disk)' || break done That will sometimes hit the error preparing packfile from multi-pack-index message, which this patch fixes. Right now, that path to discovering a missing pack looks something like 'find_pack_entry()' calling 'fill_midx_entry()' and eventually making its way to call 'nth_midxed_pack_entry()'. 'nth_midxed_pack_entry()' already checks 'is_pack_valid()' and propagates an error if the pack is invalid. So, this works if the pack has gone away between calling 'prepare_midx_pack()' and before calling 'is_pack_valid()', but not if it disappears before then. Catch the case where the pack has already disappeared before 'prepare_midx_pack()' by returning an error in that case, too. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 17:17:33 +00:00
git cat-file --batch-check="%(objectsize:disk)" <tip &&
mv $idx.bak $idx &&
mv $pack $pack.bak &&
packfile.c: protect against disappearing indexes In 17c35c8969 (packfile: skip loading index if in multi-pack-index, 2018-07-12) we stopped loading the .idx file for packs that are contained within a multi-pack index. This saves us the effort of loading an .idx and doing some lightweight validity checks by way of 'packfile.c:load_idx()', but introduces a race between processes that need to load the index (e.g., to generate a reverse index) and processes that can delete the index. For example, running the following in your shell: $ git init repo && cd repo $ git commit --allow-empty -m 'base' $ git repack -ad && git multi-pack-index write followed by: $ rm -f .git/objects/pack/pack-*.idx $ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)' will result in a segfault prior to this patch. What's happening here is that we notice that the pack is in the multi-pack index, and so don't check that it still has a .idx. When we then try and load that index to generate a reverse index, we don't have it, so the call to 'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns NULL, and then dereferencing it causes a segfault. Of course, we don't ever expect someone to remove the index file by hand, or to be in a state where we never wrote it to begin with (yet find that pack in the multi-pack-index). But, this can happen in a timing race with 'git repack -ad', which removes all existing packs after writing a new pack containing all of their objects. Avoid this by reverting the hunk of 17c35c8969 which stops loading the index when the pack is contained in a MIDX. This makes the latter half of 17c35c8969 useless, since we'll always have a non-NULL 'p->index_data', in which case that if statement isn't guarding anything. These two together effectively revert 17c35c8969, and avoid the race explained above. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 17:17:28 +00:00
git cat-file --batch-check="%(objectsize:disk)" <tip
)
'
test_expect_success 'usage shown without sub-command' '
test_expect_code 129 git multi-pack-index 2>err &&
! test_grep "unrecognized subcommand" err
'
midx: disallow running outside of a repository The multi-pack-index command supports working with arbitrary object directories via the `--object-dir` flag. Though this has historically worked in arbitrary repositories (including when the command itself was run outside of a Git repository), this has been somewhat of an accident. For example, running: git multi-pack-index write --object-dir=/path/to/repo/objects outside of a Git repository causes a BUG(). This is because the top-level `cmd_multi_pack_index()` function stops parsing when it sees "write", and then fills in the default object directory (the result of calling `get_object_directory()`) before handing off to `cmd_multi_pack_index_write()`. But there is no repository to initialize, and so calling `get_object_directory()` results in a BUG() (indicating that the current repository is not initialized). Another case where this doesn't quite work as expected is when operating in a SHA-256 repository. To see the failure, try this in your shell: git init --object-format=sha256 repo git -C repo commit --allow-empty base git -C repo repack -d git multi-pack-index --object-dir=$(pwd)/repo/.git/objects write and observe that we cannot open the `.idx` file in "repo", because the outermost process assumes that any repository that it works in also uses the default value of `the_hash_algo` (at the time of writing, SHA-1). There may be compelling reasons for trying to work around these bugs, but working in arbitrary `--object-dir`'s is non-standard enough (and likewise, these bugs prevalent enough) that I don't think any workflows would be broken by abandoning this behavior. Accordingly, restrict the `multi-pack-index` builtin to only work when inside of a Git repository (i.e., its main utility becomes selecting which alternate to operate in), which avoids both of the bugs above. (Note that you can still trigger a bug when writing a MIDX in an alternate which does not use the same object format as the repository which it is an alternate of, but that is an unrelated bug to this one). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-08-31 20:51:53 +00:00
test_expect_success 'complains when run outside of a repository' '
nongit test_must_fail git multi-pack-index write 2>err &&
grep "not a git repository" err
'
test_expect_success 'repack with delta islands' '
git init repo &&
test_when_finished "rm -fr repo" &&
(
cd repo &&
test_commit first &&
git repack &&
test_commit second &&
git repack &&
git multi-pack-index write &&
git -c repack.useDeltaIslands=true multi-pack-index repack
)
'
corrupt_chunk () {
midx=.git/objects/pack/multi-pack-index &&
test_when_finished "rm -rf $midx" &&
git repack -ad --write-midx &&
corrupt_chunk_file $midx "$@"
}
test_expect_success 'reader notices too-small oid fanout chunk' '
corrupt_chunk OIDF clear 00000000 &&
test_must_fail git log 2>err &&
cat >expect <<-\EOF &&
error: multi-pack-index OID fanout is of the wrong size
fatal: multi-pack-index required OID fanout chunk missing or corrupted
EOF
test_cmp expect err
'
test_expect_success 'reader notices too-small oid lookup chunk' '
corrupt_chunk OIDL clear 00000000 &&
test_must_fail git log 2>err &&
cat >expect <<-\EOF &&
error: multi-pack-index OID lookup chunk is the wrong size
fatal: multi-pack-index required OID lookup chunk missing or corrupted
EOF
test_cmp expect err
'
test_expect_success 'reader notices too-small pack names chunk' '
# There is no NUL to terminate the name here, so the
# chunk is too short.
corrupt_chunk PNAM clear 70656666 &&
test_must_fail git log 2>err &&
cat >expect <<-\EOF &&
fatal: multi-pack-index pack-name chunk is too short
EOF
test_cmp expect err
'
midx: enforce chunk alignment on reading The midx reader assumes chunks are aligned to a 4-byte boundary: we treat the fanout chunk as an array of uint32_t, indexing it to feed the results to ntohl(). Without aligning the chunks, we may violate the CPU's alignment constraints. Though many platforms allow this, some do not. And certanily UBSan will complain, since it is undefined behavior. Even though most chunks are naturally 4-byte-aligned (because they are storing uint32_t or larger types), PNAM is not. It stores NUL-terminated pack names, so you can have a valid chunk with any length. The writing side handles this by 4-byte-aligning the chunk, introducing a few extra NULs as necessary. But since we don't check this on the reading side, we may end up with a misaligned fanout and trigger the undefined behavior. We have two options here: 1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The latter handles alignment itself. It's possible that it's slightly slower (though in practice I'm not sure how true that is, especially for these code paths which then go on to do a binary search). 2. Enforce the alignment when reading the chunks. This is easy to do, since the table-of-contents reader can check it in one spot. I went with the second option here, just because it places less burden on maintenance going forward (it is OK to continue using ntohl), and we know it can't have any performance impact on the actual reads. The commit-graph code uses the same chunk API. It's usually also 4-byte aligned, but some chunks are not (like Bloom filter BDAT chunks). So we'll pass "1" here to allow any alignment. It doesn't suffer from the same problem as midx with its fanout because the fanout chunk is always the first (and the rest of the format dictates that the first chunk will start aligned). The new test shows the effect on a midx with a misaligned PNAM chunk. Note that the midx-reading code treats chunk-toc errors as soft, falling back to the non-midx path rather than calling die(), as we do for other parsing errors. Arguably we should make all of these behave the same, but that's out of scope for this patch. For now the test just expects the fallback behavior. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 21:05:23 +00:00
test_expect_success 'reader handles unaligned chunks' '
# A 9-byte PNAM means all of the subsequent chunks
# will no longer be 4-byte aligned, but it is still
# a valid one-pack chunk on its own (it is "foo.pack\0").
corrupt_chunk PNAM clear 666f6f2e7061636b00 &&
git -c core.multipackindex=false log >expect.out &&
git -c core.multipackindex=true log >out 2>err &&
test_cmp expect.out out &&
cat >expect.err <<-\EOF &&
error: chunk id 4f494446 not 4-byte aligned
EOF
test_cmp expect.err err
'
test_expect_success 'reader notices too-small object offset chunk' '
corrupt_chunk OOFF clear 00000000 &&
test_must_fail git log 2>err &&
cat >expect <<-\EOF &&
error: multi-pack-index object offset chunk is the wrong size
fatal: multi-pack-index required object offsets chunk missing or corrupted
EOF
test_cmp expect err
'
test_expect_success 'reader bounds-checks large offset table' '
# re-use the objects64 dir here to cheaply get access to a midx
# with large offsets.
git init repo &&
test_when_finished "rm -rf repo" &&
(
cd repo &&
(cd ../objects64 && pwd) >.git/objects/info/alternates &&
git multi-pack-index --object-dir=../objects64 write &&
midx=../objects64/pack/multi-pack-index &&
corrupt_chunk_file $midx LOFF clear &&
t5319: make corrupted large-offset test more robust The test t5319.88 ("reader bounds-checks large offset table") can fail intermittently. The failure mode looks like this: 1. An earlier test sets up "objects64", a directory that can be used to produce a midx with a corrupted large-offsets table. To get the large offsets, it corrupts the normal ".idx" file to have a fake large offset, and then builds a midx from that. That midx now has a large offset table, which is what we want. But we also have a .idx on disk that has a corrupted entry. We'll call the object with the corrupted large-offset "X". 2. In t5319.88, we further corrupt the midx by reducing the size of the large-offset chunk (because our goal is to make sure we do not do an out-of-bounds read on it). 3. We then enumerate all of the objects with "cat-file --batch-check --batch-all-objects", expecting to see a complaint when we try to show object X. We use --batch-all-objects because our objects64 repo doesn't actually have any refs (but if we check them all, one of them will be the failing one). The default batch-check format includes %(objecttype) and %(objectsize), both of which require us to access the actual pack data (and thus requires looking at the offset). 4a. Usually, this succeeds. We try to output object X, do a lookup via the midx for the type/size lookup, and run into the corrupt large-offset table. 4b. But sometimes we hit a different error. If another object points to X as a delta base, then trying to find the type of that object requires walking the delta chain to the base entry (since only the base has the concrete type; deltas themselves are either OFS_DELTA or REF_DELTA). Normally this would not require separate offset lookups at all, as deltas are usually stored as OFS_DELTA, specifying the relative offset to the base. But the corrupt idx created in step 1 is done directly with "git pack-objects" and does not pass the --delta-base-offset option, meaning we have REF_DELTA entries! Those do have to consult an index to find the location of the base object, and they use the pack .idx to do this. The same pack .idx that we know is corrupted from step 1! Git does notice the error, but it does so by seeing the corrupt .idx file, not the corrupt midx file, and the error it reports is different, causing the test to fail. The set of objects created in the test is deterministic. But the delta selection seems not to be (which is not too surprising, as it is multi-threaded). I have seen the failure in Windows CI but haven't reproduced it locally (not even with --stress). Re-running a failed Windows CI job tends to work. But when I download and examine the trash directory from a failed run, it shows a different set of deltas than I get locally. But the exact source of non-determinism isn't that important; our test should be robust against any order. There are a few options to fix this: a. It would be OK for the "objects64" setup to "unbreak" the .idx file after generating the midx. But then it would be hard for subsequent tests to reuse it, since it is the corrupted idx that forces the midx to have a large offset table. b. The "objects64" setup could use --delta-base-offset. This would fix our problem, but earlier tests have many hard-coded offsets. Using OFS_DELTA would change the locations of objects in the pack (this might even be OK because I think most of the offsets are within the .idx file, but it seems brittle and I'm afraid to touch it). c. Our cat-file output is in oid order by default. Since we store bases before deltas, if we went in pack order (using the "--unordered" flag), we'd always see our corrupt X before any delta which depends on it. But using "--unordered" means we skip the midx entirely. That makes sense, since it is just enumerating all of the packs, using the offsets found in their .idx files directly. So it doesn't work for our test. d. We could ask directly about object X, rather than enumerating all of them. But that requires further hard-coding of the oid (both sha1 and sha256) of object X. I'd prefer not to introduce more brittleness. e. We can use a --batch-check format that looks at the pack data, but doesn't have to chase deltas. The problem in this case is %(objecttype), which has to walk to the base. But %(objectsize) does not; we can get the value directly from the delta itself. Another option would be %(deltabase), where we report the REF_DELTA name but don't look at its data. I've gone with option (e) here. It's kind of subtle, but it's simple and has no side effects. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-14 00:43:48 +00:00
# using only %(objectsize) is important here; see the commit
# message for more details
test_must_fail git cat-file --batch-all-objects \
--batch-check="%(objectsize)" 2>err &&
cat >expect <<-\EOF &&
fatal: multi-pack-index large offset out of bounds
EOF
test_cmp expect err
)
'
test_expect_success 'reader notices too-small revindex chunk' '
# We only get a revindex with bitmaps (and likewise only
# load it when they are asked for).
test_config repack.writeBitmaps true &&
corrupt_chunk RIDX clear 00000000 &&
git -c core.multipackIndex=false rev-list \
--all --use-bitmap-index >expect.out &&
git -c core.multipackIndex=true rev-list \
--all --use-bitmap-index >out 2>err &&
test_cmp expect.out out &&
cat >expect.err <<-\EOF &&
error: multi-pack-index reverse-index chunk is the wrong size
warning: multi-pack bitmap is missing required reverse index
EOF
test_cmp expect.err err
'
test_expect_success 'reader notices out-of-bounds fanout' '
# This is similar to the out-of-bounds fanout test in t5318. The values
# in adjacent entries should be large but not identical (they
# are used as hi/lo starts for a binary search, which would then abort
# immediately).
corrupt_chunk OIDF 0 $(printf "%02x000000" $(test_seq 0 254)) &&
test_must_fail git log 2>err &&
cat >expect <<-\EOF &&
error: oid fanout out of order: fanout[254] = fe000000 > 5c = fanout[255]
fatal: multi-pack-index required OID fanout chunk missing or corrupted
EOF
test_cmp expect err
'
midx: implement `BTMP` chunk When a multi-pack bitmap is used to implement verbatim pack reuse (that is, when verbatim chunks from an on-disk packfile are copied directly[^1]), it does so by using its "preferred pack" as the source for pack-reuse. This allows repositories to pack the majority of their objects into a single (often large) pack, and then use it as the single source for verbatim pack reuse. This increases the amount of objects that are reused verbatim (and consequently, decrease the amount of time it takes to generate many packs). But this performance comes at a cost, which is that the preferred packfile must pace its growth with that of the entire repository in order to maintain the utility of verbatim pack reuse. As repositories grow beyond what we can reasonably store in a single packfile, the utility of verbatim pack reuse diminishes. Or, at the very least, it becomes increasingly more expensive to maintain as the pack grows larger and larger. It would be beneficial to be able to perform this same optimization over multiple packs, provided some modest constraints (most importantly, that the set of packs eligible for verbatim reuse are disjoint with respect to the subset of their objects being sent). If we assume that the packs which we treat as candidates for verbatim reuse are disjoint with respect to any of their objects we may output, we need to make only modest modifications to the verbatim pack-reuse code itself. Most notably, we need to remove the assumption that the bits in the reachability bitmap corresponding to objects from the single reuse pack begin at the first bit position. Future patches will unwind these assumptions and reimplement their existing functionality as special cases of the more general assumptions (e.g. that reuse bits can start anywhere within the bitset, but happen to start at 0 for all existing cases). This patch does not yet relax any of those assumptions. Instead, it implements a foundational data-structure, the "Bitampped Packs" (`BTMP`) chunk of the multi-pack index. The `BTMP` chunk's contents are described in detail here. Importantly, the `BTMP` chunk contains information to map regions of a multi-pack index's reachability bitmap to the packs whose objects they represent. For now, this chunk is only written, not read (outside of the test-tool used in this patch to test the new chunk's behavior). Future patches will begin to make use of this new chunk. [^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the pack that wasn't used verbatim. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-14 22:23:51 +00:00
test_expect_success 'bitmapped packs are stored via the BTMP chunk' '
test_when_finished "rm -fr repo" &&
git init repo &&
(
cd repo &&
for i in 1 2 3 4 5
do
test_commit "$i" &&
git repack -d || return 1
done &&
find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename |
sort >packs &&
git multi-pack-index write --stdin-packs <packs &&
test_must_fail test-tool read-midx --bitmap $objdir 2>err &&
cat >expect <<-\EOF &&
error: MIDX does not contain the BTMP chunk
EOF
test_cmp expect err &&
git multi-pack-index write --stdin-packs --bitmap \
--preferred-pack="$(head -n1 <packs)" <packs &&
test-tool read-midx --bitmap $objdir >actual &&
for i in $(test_seq $(wc -l <packs))
do
sed -ne "${i}s/\.idx$/\.pack/p" packs &&
echo " bitmap_pos: $((($i - 1) * 3))" &&
echo " bitmap_nr: 3" || return 1
done >expect &&
test_cmp expect actual
)
'
test_done