1
0
mirror of https://github.com/git/git synced 2024-07-07 19:39:27 +00:00
git/builtin
Han Xin aaf81223f4 unpack-objects: use stream_loose_object() to unpack large objects
Make use of the stream_loose_object() function introduced in the
preceding commit to unpack large objects. Before this we'd need to
malloc() the size of the blob before unpacking it, which could cause
OOM with very large blobs.

We could use the new streaming interface to unpack all blobs, but
doing so would be much slower, as demonstrated e.g. with this
benchmark using git-hyperfine[0]:

	rm -rf /tmp/scalar.git &&
	git clone --bare https://github.com/Microsoft/scalar.git /tmp/scalar.git &&
	mv /tmp/scalar.git/objects/pack/*.pack /tmp/scalar.git/my.pack &&
	git hyperfine \
		-r 2 --warmup 1 \
		-L rev origin/master,HEAD -L v "10,512,1k,1m" \
		-s 'make' \
		-p 'git init --bare dest.git' \
		-c 'rm -rf dest.git' \
		'./git -C dest.git -c core.bigFileThreshold={v} unpack-objects </tmp/scalar.git/my.pack'

Here we'll perform worse with lower core.bigFileThreshold settings
with this change in terms of speed, but we're getting lower memory use
in return:

	Summary
	  './git -C dest.git -c core.bigFileThreshold=10 unpack-objects </tmp/scalar.git/my.pack' in 'origin/master' ran
	    1.01 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1k unpack-objects </tmp/scalar.git/my.pack' in 'origin/master'
	    1.01 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1m unpack-objects </tmp/scalar.git/my.pack' in 'origin/master'
	    1.01 ± 0.02 times faster than './git -C dest.git -c core.bigFileThreshold=1m unpack-objects </tmp/scalar.git/my.pack' in 'HEAD'
	    1.02 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/scalar.git/my.pack' in 'origin/master'
	    1.09 ± 0.01 times faster than './git -C dest.git -c core.bigFileThreshold=1k unpack-objects </tmp/scalar.git/my.pack' in 'HEAD'
	    1.10 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/scalar.git/my.pack' in 'HEAD'
	    1.11 ± 0.00 times faster than './git -C dest.git -c core.bigFileThreshold=10 unpack-objects </tmp/scalar.git/my.pack' in 'HEAD'

A better benchmark to demonstrate the benefits of that this one, which
creates an artificial repo with a 1, 25, 50, 75 and 100MB blob:

	rm -rf /tmp/repo &&
	git init /tmp/repo &&
	(
		cd /tmp/repo &&
		for i in 1 25 50 75 100
		do
			dd if=/dev/urandom of=blob.$i count=$(($i*1024)) bs=1024
		done &&
		git add blob.* &&
		git commit -mblobs &&
		git gc &&
		PACK=$(echo .git/objects/pack/pack-*.pack) &&
		cp "$PACK" my.pack
	) &&
	git hyperfine \
		--show-output \
		-L rev origin/master,HEAD -L v "512,50m,100m" \
		-s 'make' \
		-p 'git init --bare dest.git' \
		-c 'rm -rf dest.git' \
		'/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold={v} unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum'

Using this test we'll always use >100MB of memory on
origin/master (around ~105MB), but max out at e.g. ~55MB if we set
core.bigFileThreshold=50m.

The relevant "Maximum resident set size" lines were manually added
below the relevant benchmark:

  '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master' ran
        Maximum resident set size (kbytes): 107080
    1.02 ± 0.78 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master'
        Maximum resident set size (kbytes): 106968
    1.09 ± 0.79 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'origin/master'
        Maximum resident set size (kbytes): 107032
    1.42 ± 1.07 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=100m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 107072
    1.83 ± 1.02 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=50m unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 55704
    2.16 ± 1.19 times faster than '/usr/bin/time -v ./git -C dest.git -c core.bigFileThreshold=512 unpack-objects </tmp/repo/my.pack 2>&1 | grep Maximum' in 'HEAD'
        Maximum resident set size (kbytes): 4564

This shows that if you have enough memory this new streaming method is
slower the lower you set the streaming threshold, but the benefit is
more bounded memory use.

An earlier version of this patch introduced a new
"core.bigFileStreamingThreshold" instead of re-using the existing
"core.bigFileThreshold" variable[1]. As noted in a detailed overview
of its users in [2] using it has several different meanings.

Still, we consider it good enough to simply re-use it. While it's
possible that someone might want to e.g. consider objects "small" for
the purposes of diffing but "big" for the purposes of writing them
such use-cases are probably too obscure to worry about. We can always
split up "core.bigFileThreshold" in the future if there's a need for
that.

0. https://github.com/avar/git-hyperfine/
1. https://lore.kernel.org/git/20211210103435.83656-1-chiyutianyi@gmail.com/
2. https://lore.kernel.org/git/20220120112114.47618-5-chiyutianyi@gmail.com/

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Derrick Stolee <stolee@gmail.com>
Helped-by: Jiang Xin <zhiyou.jx@alibaba-inc.com>
Signed-off-by: Han Xin <chiyutianyi@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-06-13 10:22:36 -07:00
..
add.c Merge branch 'ns/batch-fsync' 2022-06-03 14:30:34 -07:00
am.c Merge branch 'ab/date-mode-release' 2022-02-25 15:47:36 -08:00
annotate.c
apply.c apply.c: remove unnecessary include 2022-04-06 09:42:14 -07:00
archive.c use xopen() to handle fatal open(2) failures 2021-08-25 14:39:08 -07:00
bisect--helper.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
blame.c Merge branch 'ea/progress-partial-blame' 2022-05-10 17:41:11 -07:00
branch.c Merge branch 'gc/branch-recurse-submodules' 2022-02-18 13:53:29 -08:00
bugreport.c hook-list.h: add a generated list of hooks, like config-list.h 2021-09-27 09:44:54 -07:00
bundle.c bundle: call strvec_clear() on allocated strvec 2022-03-04 13:24:18 -08:00
cat-file.c Merge branch 'jc/cat-file-batch-default-format-optim' 2022-03-23 14:09:31 -07:00
check-attr.c
check-ignore.c dir.[ch]: replace dir_init() with DIR_INIT 2021-07-01 12:32:22 -07:00
check-mailmap.c shortlog: remove unused(?) "repo-abbrev" feature 2021-01-12 14:04:42 -08:00
check-ref-format.c
checkout--worker.c pkt-line.[ch]: remove unused packet_read_line_buf() 2021-10-15 13:09:40 -07:00
checkout-index.c checkout-index: integrate with sparse index 2022-01-13 13:49:45 -08:00
checkout.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
clean.c Merge branch 'vd/sparse-clean-etc' 2022-02-17 16:25:05 -08:00
clone.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
column.c column: fix parsing of the '--nl' option 2021-08-26 14:36:27 -07:00
commit-graph.c commit-graph: fix memory leak in misused string_list API 2022-03-04 13:24:18 -08:00
commit-tree.c use xopen() to handle fatal open(2) failures 2021-08-25 14:39:08 -07:00
commit.c Merge branch 'ab/commit-plug-leaks' 2022-05-23 14:39:54 -07:00
config.c Merge branch 'mf/fix-type-in-config-h' 2022-03-16 17:53:07 -07:00
count-objects.c i18n: remove from i18n strings that do not hold translatable parts 2022-02-04 13:58:28 -08:00
credential-cache--daemon.c unix-socket: add backlog size option to unix_stream_listen() 2021-03-15 14:32:51 -07:00
credential-cache.c credential-cache: check for windows specific errors 2021-09-14 09:30:54 -07:00
credential-store.c Use a better name for the function interpolating paths 2021-07-26 12:17:16 -07:00
credential.c doc: fix git credential synopsis 2021-10-28 09:57:09 -07:00
describe.c i18n: turn even more messages into "cannot be used together" ones 2022-01-05 13:31:00 -08:00
diff-files.c Merge branch 'jc/diffcore-rotate' 2021-02-25 16:43:30 -08:00
diff-index.c diff-index: restore -c/--cc options handling 2021-09-07 11:11:35 -07:00
diff-tree.c 2.36 gitk/diff-tree --stdin regression fix 2022-04-26 09:26:35 -07:00
diff.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
difftool.c i18n: factorize more 'incompatible options' messages 2022-02-04 13:58:28 -08:00
env--helper.c
fast-export.c Merge branch 'rs/fast-export-pathspec-fix' into maint 2022-05-05 14:36:25 -07:00
fast-import.c Merge branch 'ns/core-fsyncmethod' 2022-03-25 16:38:24 -07:00
fetch-pack.c Merge branch 'rc/fetch-refetch' 2022-04-04 10:56:23 -07:00
fetch.c Merge branch 'jc/avoid-redundant-submodule-fetch' 2022-05-25 16:42:49 -07:00
fmt-merge-msg.c merge: allow to pretend a merge is made into a different branch 2021-12-20 14:55:02 -08:00
for-each-ref.c for-each-ref: delay parsing of --sort=<atom> options 2021-10-20 14:33:07 -07:00
for-each-repo.c builtin/for-each-repo: remove unnecessary argv copy to plug leak 2021-07-26 12:19:20 -07:00
fsck.c run-command API users: use strvec_pushl(), not argv construction 2021-11-25 22:15:07 -08:00
fsmonitor--daemon.c fsmonitor--daemon: use a cookie file to sync with file system 2022-03-25 16:04:17 -07:00
gc.c Merge branch 'tb/cruft-packs' 2022-06-03 14:30:37 -07:00
get-tar-commit-id.c
grep.c Merge branch 'ab/object-file-api-updates' 2022-03-16 17:53:08 -07:00
hash-object.c Merge branch 'ab/object-file-api-updates' 2022-03-16 17:53:08 -07:00
help.c Merge branch 'ab/help-fixes' 2022-03-09 13:38:24 -08:00
hook.c git hook run: add an --ignore-missing flag 2022-01-07 15:19:34 -08:00
index-pack.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
init-db.c i18n: refactor "foo and bar are mutually exclusive" 2022-01-05 13:29:23 -08:00
interpret-trailers.c
log.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
ls-files.c ls-files: support --recurse-submodules --stage 2022-02-23 16:41:55 -08:00
ls-remote.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
ls-tree.c Merge branch 'tl/ls-tree-oid-only' 2022-04-06 15:21:59 -07:00
mailinfo.c mailinfo: allow squelching quoted CRLF warning 2021-05-10 15:06:22 +09:00
mailsplit.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
merge-base.c merge-base: free() allocated "struct commit **" list 2022-03-04 13:24:17 -08:00
merge-file.c xdiff: implement a zealous diff3, or "zdiff3" 2021-12-01 14:45:58 -08:00
merge-index.c merge-index: ensure full index 2021-04-14 13:47:21 -07:00
merge-ours.c builtins + test helpers: use return instead of exit() in cmd_* 2021-06-09 09:15:58 +09:00
merge-recursive.c gettext API users: don't explicitly cast ngettext()'s "n" 2022-03-07 11:57:52 -08:00
merge-tree.c xdiff users: use designated initializers for out_line 2021-05-11 12:47:31 +09:00
merge.c hooks: fix an obscure TOCTOU "did we just run a hook?" race 2022-03-07 13:00:53 -08:00
mktag.c Merge branch 'ab/object-file-api-updates' 2022-03-16 17:53:08 -07:00
mktree.c Merge branch 'ab/object-file-api-updates' 2022-03-16 17:53:08 -07:00
multi-pack-index.c multi-pack-index: use --object-dir real path 2022-04-25 11:31:12 -07:00
mv.c mv: refuse to move sparse paths 2021-09-28 10:31:02 -07:00
name-rev.c Merge branch 'rs/name-rev-fix-free-after-use' into maint 2022-05-05 14:36:24 -07:00
notes.c Merge branch 'ab/object-file-api-updates' 2022-03-16 17:53:08 -07:00
pack-objects.c Merge branch 'tb/cruft-packs' 2022-06-03 14:30:37 -07:00
pack-redundant.c tree-wide: apply equals-null.cocci 2022-05-02 09:50:37 -07:00
pack-refs.c
patch-id.c patch-id: fix scan_hunk_header on diffs with 1 line of before/after 2022-02-02 11:24:23 -08:00
prune-packed.c i18n: remove from i18n strings that do not hold translatable parts 2022-02-04 13:58:28 -08:00
prune.c Merge branch 'ns/tmp-objdir' 2022-01-03 16:24:15 -08:00
pull.c Merge branch 'gc/pull-recurse-submodules' 2022-05-20 15:26:57 -07:00
push.c push: new config option "push.autoSetupRemote" supports "simple" push 2022-04-29 11:20:55 -07:00
range-diff.c column, range-diff: downcase option description 2021-03-29 14:06:08 -07:00
read-tree.c read-tree: make three-way merge sparse-aware 2022-03-01 12:36:01 -08:00
rebase.c Merge branch 'ea/rebase-code-simplify' 2022-05-11 13:56:22 -07:00
receive-pack.c Merge branch 'tb/receive-pack-code-cleanup' 2022-05-25 16:42:49 -07:00
reflog.c reflog: fix 'show' subcommand's argv 2022-03-28 15:45:46 -07:00
remote-ext.c
remote-fd.c
remote.c builtin/remote.c: teach -v to list filters for promisor remotes 2022-05-09 10:53:58 -07:00
repack.c Merge branch 'tb/cruft-packs' 2022-06-03 14:30:37 -07:00
replace.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
rerere.c xdiff users: use designated initializers for out_line 2021-05-11 12:47:31 +09:00
reset.c reset: show --no-refresh in the short-help 2022-03-24 13:36:21 -07:00
rev-list.c Merge branch 'ds/partial-bundles' 2022-03-21 15:14:24 -07:00
rev-parse.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
revert.c Merge branch 'ds/mergies-with-sparse-index' 2021-09-20 15:20:45 -07:00
rm.c Merge branch 'ja/i18n-similar-messages' 2022-01-10 11:52:56 -08:00
send-pack.c i18n: factorize "invalid value" messages 2022-02-04 13:58:28 -08:00
shortlog.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
show-branch.c Merge branch 'jc/show-branch-g-current' 2022-05-25 16:42:47 -07:00
show-index.c builtin/show-index: set the algorithm for object IDs 2021-04-27 16:31:39 +09:00
show-ref.c refs: switch peel_ref() to peel_iterated_oid() 2021-01-21 15:51:31 -08:00
sparse-checkout.c Merge branch 'ds/sparse-sparse-checkout' 2022-06-03 14:30:35 -07:00
stash.c stash: apply stash using 'merge_ort_nonrecursive()' 2022-05-10 16:45:12 -07:00
stripspace.c i18n: remove from i18n strings that do not hold translatable parts 2022-02-04 13:58:28 -08:00
submodule--helper.c Merge branch 'jx/l10n-workflow-change' 2022-06-03 14:30:36 -07:00
symbolic-ref.c symbolic-ref: don't leak shortened refname in check_symref() 2021-03-14 15:57:59 -07:00
tag.c Merge branch 'ep/maint-equals-null-cocci' 2022-05-20 15:26:59 -07:00
unpack-file.c
unpack-objects.c unpack-objects: use stream_loose_object() to unpack large objects 2022-06-13 10:22:36 -07:00
update-index.c Merge branch 'ns/batch-fsync' 2022-06-03 14:30:34 -07:00
update-ref.c update-ref: fix streaming of status updates 2021-09-03 11:35:15 -07:00
update-server-info.c i18n: remove from i18n strings that do not hold translatable parts 2022-02-04 13:58:28 -08:00
upload-archive.c upload-archive: use regular "struct child_process" pattern 2021-11-25 22:15:07 -08:00
upload-pack.c upload-pack: document and rename --advertise-refs 2021-08-05 08:59:37 -07:00
var.c var: add GIT_DEFAULT_BRANCH variable 2021-11-03 13:25:36 -07:00
verify-commit.c
verify-pack.c
verify-tag.c
worktree.c Merge branch 'pw/worktree-list-with-z' 2022-04-04 10:56:25 -07:00
write-tree.c