git/t/t6115-rev-list-du.sh

81 lines
2.3 KiB
Bash
Raw Normal View History

rev-list: add --disk-usage option for calculating disk usage It can sometimes be useful to see which refs are contributing to the overall repository size (e.g., does some branch have a bunch of objects not found elsewhere in history, which indicates that deleting it would shrink the size of a clone). You can find that out by generating a list of objects, getting their sizes from cat-file, and then summing them, like: git rev-list --objects --no-object-names main..branch git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' Though note that the caveats from git-cat-file(1) apply here. We "blame" base objects more than their deltas, even though the relationship could easily be flipped. Still, it can be a useful rough measure. But one problem is that it's slow to run. Teaching rev-list to sum up the sizes can be much faster for two reasons: 1. It skips all of the piping of object names and sizes. 2. If bitmaps are in use, for objects that are in the bitmapped packfile we can skip the oid_object_info() lookup entirely, and just ask the revindex for the on-disk size. This patch implements a --disk-usage option which produces the same answer in a fraction of the time. Here are some timings using a clone of torvalds/linux: [rev-list piped to cat-file, no bitmaps] $ time git rev-list --objects --no-object-names --all | git cat-file --buffer --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1459938510 real 0m29.635s user 0m38.003s sys 0m1.093s [internal, no bitmaps] $ time git rev-list --disk-usage --objects --all 1459938510 real 0m31.262s user 0m30.885s sys 0m0.376s Even though the wall-clock time is slightly worse due to parallelism, notice the CPU savings between the two. We saved 21% of the CPU just by avoiding the pipes. But the real win is with bitmaps. If we use them without the new option: [rev-list piped to cat-file, bitmaps] $ time git rev-list --objects --no-object-names --all --use-bitmap-index | git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1459938510 real 0m6.244s user 0m8.452s sys 0m0.311s then we're faster to generate the list of objects, but we still spend a lot of time piping and looking things up. But if we do both together: [internal, bitmaps] $ time git rev-list --disk-usage --objects --all --use-bitmap-index 1459938510 real 0m0.219s user 0m0.169s sys 0m0.049s then we get the same answer much faster. For "--all", that answer will correspond closely to "du objects/pack", of course. But we're actually checking reachability here, so we're still fast when we ask for more interesting things: $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10 374798628 real 0m0.429s user 0m0.356s sys 0m0.072s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-09 10:53:50 +00:00
#!/bin/sh
test_description='basic tests of rev-list --disk-usage'
. ./test-lib.sh
# we want a mix of reachable and unreachable, as well as
# objects in the bitmapped pack and some outside of it
test_expect_success 'set up repository' '
test_commit --no-tag one &&
test_commit --no-tag two &&
git repack -adb &&
git reset --hard HEAD^ &&
test_commit --no-tag three &&
test_commit --no-tag four &&
git reset --hard HEAD^
'
# We don't want to hardcode sizes, because they depend on the exact details of
# packing, zlib, etc. We'll assume that the regular rev-list and cat-file
# machinery works and compare the --disk-usage output to that.
disk_usage_slow () {
git rev-list --no-object-names "$@" |
git cat-file --batch-check="%(objectsize:disk)" |
perl -lne '$total += $_; END { print $total}'
}
# check behavior with given rev-list options; note that
# whitespace is not preserved in args
check_du () {
args=$*
test_expect_success "generate expected size ($args)" "
disk_usage_slow $args >expect
"
test_expect_success "rev-list --disk-usage without bitmaps ($args)" "
git rev-list --disk-usage $args >actual &&
test_cmp expect actual
"
test_expect_success "rev-list --disk-usage with bitmaps ($args)" "
git rev-list --disk-usage --use-bitmap-index $args >actual &&
test_cmp expect actual
"
}
check_du HEAD
check_du --objects HEAD
check_du --objects HEAD^..HEAD
pack-bitmap: drop --unpacked non-commit objects from results When performing revision queries with `--objects` and `--use-bitmap-index`, the output may incorrectly contain objects which are packed, even when the `--unpacked` option is given. This affects traversals, but also other querying operations, like `--count`, `--disk-usage`, etc. Like in the previous commit, the fix is to exclude those objects from the result set before they are shown to the user (or, in this case, before the bitmap containing the result of the traversal is enumerated and its objects listed). This is performed by a new function in pack-bitmap.c, called `filter_packed_objects_from_bitmap()`. Note that we do not have to inspect individual bits in the result bitmap, since we know that the first N (where N is the number of objects in the bitmap's pack/MIDX) bits correspond to objects which packed by definition. In other words, for an object to have a bitmap position (not in the extended index), it must appear in either the bitmap's pack or one of the packs in its MIDX. This presents an appealing optimization to us, which is that we can simply memset() the corresponding number of `eword_t`'s to zero, provided that we handle any objects which spill into the next word (but don't occupy all 64 bits of the word itself). We only have to handle objects in the bitmap's extended index. These objects may (or may not) appear in one or more pack(s). Since these objects are known to not appear in either the bitmap's MIDX or pack, they may be stored as loose, appear in other pack(s), or both. Before returning a bitmap containing the result of the traversal back to the caller, drop any bits from the extended index which appear in one or more packs. This implements the correct behavior for rev-list operations which use the bitmap index to compute their result. Co-authored-by: Jeff King <peff@peff.net> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-06 22:56:33 +00:00
test_expect_success 'setup for --unpacked tests' '
git repack -adb &&
test_commit unpacked
'
check_du --all --objects --unpacked
# As mentioned above, don't use hardcode sizes as actual size, but use the
# output from git cat-file.
test_expect_success 'rev-list --disk-usage=human' '
git rev-list --objects HEAD --disk-usage=human >actual &&
disk_usage_slow --objects HEAD >actual_size &&
grep "$(cat actual_size) bytes" actual
'
test_expect_success 'rev-list --disk-usage=human with bitmaps' '
git rev-list --objects HEAD --use-bitmap-index --disk-usage=human >actual &&
disk_usage_slow --objects HEAD >actual_size &&
grep "$(cat actual_size) bytes" actual
'
test_expect_success 'rev-list use --disk-usage unproperly' '
test_must_fail git rev-list --objects HEAD --disk-usage=typo 2>err &&
cat >expect <<-\EOF &&
fatal: invalid value for '\''--disk-usage=<format>'\'': '\''typo'\'', the only allowed format is '\''human'\''
EOF
test_cmp err expect
'
rev-list: add --disk-usage option for calculating disk usage It can sometimes be useful to see which refs are contributing to the overall repository size (e.g., does some branch have a bunch of objects not found elsewhere in history, which indicates that deleting it would shrink the size of a clone). You can find that out by generating a list of objects, getting their sizes from cat-file, and then summing them, like: git rev-list --objects --no-object-names main..branch git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' Though note that the caveats from git-cat-file(1) apply here. We "blame" base objects more than their deltas, even though the relationship could easily be flipped. Still, it can be a useful rough measure. But one problem is that it's slow to run. Teaching rev-list to sum up the sizes can be much faster for two reasons: 1. It skips all of the piping of object names and sizes. 2. If bitmaps are in use, for objects that are in the bitmapped packfile we can skip the oid_object_info() lookup entirely, and just ask the revindex for the on-disk size. This patch implements a --disk-usage option which produces the same answer in a fraction of the time. Here are some timings using a clone of torvalds/linux: [rev-list piped to cat-file, no bitmaps] $ time git rev-list --objects --no-object-names --all | git cat-file --buffer --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1459938510 real 0m29.635s user 0m38.003s sys 0m1.093s [internal, no bitmaps] $ time git rev-list --disk-usage --objects --all 1459938510 real 0m31.262s user 0m30.885s sys 0m0.376s Even though the wall-clock time is slightly worse due to parallelism, notice the CPU savings between the two. We saved 21% of the CPU just by avoiding the pipes. But the real win is with bitmaps. If we use them without the new option: [rev-list piped to cat-file, bitmaps] $ time git rev-list --objects --no-object-names --all --use-bitmap-index | git cat-file --batch-check='%(objectsize:disk)' | perl -lne '$total += $_; END { print $total }' 1459938510 real 0m6.244s user 0m8.452s sys 0m0.311s then we're faster to generate the list of objects, but we still spend a lot of time piping and looking things up. But if we do both together: [internal, bitmaps] $ time git rev-list --disk-usage --objects --all --use-bitmap-index 1459938510 real 0m0.219s user 0m0.169s sys 0m0.049s then we get the same answer much faster. For "--all", that answer will correspond closely to "du objects/pack", of course. But we're actually checking reachability here, so we're still fast when we ask for more interesting things: $ time git rev-list --disk-usage --use-bitmap-index v5.0..v5.10 374798628 real 0m0.429s user 0m0.356s sys 0m0.072s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-09 10:53:50 +00:00
test_done