git/builtin
Taylor Blau 37dc6d8104 builtin/repack.c: implement support for --max-cruft-size
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.

When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:

  - Repacking periodically, either by packing anything loose in the
    repository (via `git repack -d`) or producing a geometric sequence
    of packs (via `git repack --geometric=<d> -d`).

  - Every so often, splitting the repository into two packs, one cruft
    to store the unreachable objects, and another non-cruft pack to
    store the reachable objects.

Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.

This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).

This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.

Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).

At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.

Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:

  - Sort a list of any existing cruft packs in ascending order of pack
    size.

  - Starting from the beginning of the list, group cruft packs together
    while the accumulated size is smaller than the maximum specified
    pack size.

  - Combine the objects in these cruft packs together into a new cruft
    pack, along with any other unreachable objects which have since
    entered the repository.

Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.

When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.

This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-05 13:26:11 -07:00
..
add.c Merge branch 'jk/unused-post-2.42-part2' 2023-09-13 10:07:56 -07:00
am.c Merge branch 'ob/am-msgfix' 2023-09-29 09:04:16 -07:00
annotate.c strvec: rename struct fields 2020-07-30 19:18:06 -07:00
apply.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
archive.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
bisect.c treewide: remove unnecessary includes for wrapper.h 2023-07-05 11:41:59 -07:00
blame.c Merge branch 'cw/compat-util-header-cleanup' 2023-07-17 11:30:42 -07:00
branch.c Merge branch 'rj/branch-in-use-error-message' 2023-08-24 09:32:33 -07:00
bugreport.c treewide: remove unnecessary includes for wrapper.h 2023-07-05 11:41:59 -07:00
bundle.c Merge branch 'rs/bundle-parseopt-cleanup' 2023-08-07 11:57:18 -07:00
cat-file.c Merge branch 'cw/compat-util-header-cleanup' 2023-07-17 11:30:42 -07:00
check-attr.c check-attr: integrate with sparse-index 2023-08-11 09:44:52 -07:00
check-ignore.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
check-mailmap.c builtin.h: remove unneccessary includes 2023-06-21 13:39:54 -07:00
check-ref-format.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
checkout--worker.c git-compat-util: move alloc macros to git-compat-util.h 2023-07-05 11:42:31 -07:00
checkout-index.c parse-options: prefer opt->value to globals in callbacks 2023-09-05 14:48:17 -07:00
checkout.c Merge branch 'jc/unresolve-removal' 2023-10-02 11:20:00 -07:00
clean.c Merge branch 'gc/config-context' 2023-07-06 11:54:48 -07:00
clone.c Merge branch 'jc/transport-parseopt-fix' 2023-07-27 15:26:37 -07:00
column.c Merge branch 'gc/config-context' 2023-07-06 11:54:48 -07:00
commit-graph.c Merge branch 'gc/config-context' 2023-07-06 11:54:48 -07:00
commit-tree.c object-store-ll.h: split this header out of object-store.h 2023-06-21 13:39:54 -07:00
commit.c Merge branch 'js/empty-index-fixes' 2023-07-08 11:23:07 -07:00
config.c Merge branch 'cw/compat-util-header-cleanup' 2023-07-17 11:30:42 -07:00
count-objects.c count-objects: mark unused parameter in alternates callback 2023-07-13 17:24:00 -07:00
credential-cache--daemon.c git-compat-util: move alloc macros to git-compat-util.h 2023-07-05 11:42:31 -07:00
credential-cache.c treewide: remove unnecessary includes for wrapper.h 2023-07-05 11:41:59 -07:00
credential-store.c Merge branch 'cw/strbuf-cleanup' 2023-07-06 11:54:46 -07:00
credential.c builtins: mark unused prefix parameters 2023-03-28 14:11:24 -07:00
describe.c Merge branch 'jk/unused-post-2.42-part2' 2023-09-13 10:07:56 -07:00
diagnose.c object-file.h: move declarations for object-file.c functions from cache.h 2023-04-11 08:52:10 -07:00
diff-files.c diff: drop useless "status" parameter from diff_result_code() 2023-08-21 15:33:24 -07:00
diff-index.c diff: drop useless "status" parameter from diff_result_code() 2023-08-21 15:33:24 -07:00
diff-tree.c diff: drop useless "status" parameter from diff_result_code() 2023-08-21 15:33:24 -07:00
diff.c diff --stat: add config option to limit filename width 2023-09-18 09:39:07 -07:00
difftool.c Merge branch 'cw/compat-util-header-cleanup' 2023-07-17 11:30:42 -07:00
fast-export.c parse-options: prefer opt->value to globals in callbacks 2023-09-05 14:48:17 -07:00
fast-import.c Merge branch 'ew/hash-with-openssl-evp' 2023-09-13 10:07:57 -07:00
fetch-pack.c git-compat-util: move alloc macros to git-compat-util.h 2023-07-05 11:42:31 -07:00
fetch.c Merge branch 'jk/unused-post-2.42-part2' 2023-09-13 10:07:56 -07:00
fmt-merge-msg.c treewide: remove unnecessary includes for wrapper.h 2023-07-05 11:41:59 -07:00
for-each-ref.c Merge branch 'tb/refs-exclusion-and-packed-refs' 2023-07-21 13:47:26 -07:00
for-each-repo.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
fsck.c fsck: use enum object_type for fsck_walk callback 2023-08-19 21:17:32 -07:00
fsmonitor--daemon.c run-command: mark unused parameters in start_bg_wait callbacks 2023-09-18 15:56:15 -07:00
gc.c builtin/repack.c: implement support for --max-cruft-size 2023-10-05 13:26:11 -07:00
get-tar-commit-id.c treewide: remove unnecessary includes for wrapper.h 2023-07-05 11:41:59 -07:00
grep.c Merge branch 'rs/grep-no-no-or' 2023-09-18 13:53:13 -07:00
hash-object.c object-store-ll.h: split this header out of object-store.h 2023-06-21 13:39:54 -07:00
help.c Merge branch 'gc/config-context' 2023-07-06 11:54:48 -07:00
hook.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
index-pack.c Merge branch 'ew/hash-with-openssl-evp' 2023-09-13 10:07:57 -07:00
init-db.c treewide: remove unnecessary includes for wrapper.h 2023-07-05 11:41:59 -07:00
interpret-trailers.c interpret-trailers: mark unused "unset" parameters in option callbacks 2023-09-05 14:48:17 -07:00
log.c diff --stat: add config option to limit filename width 2023-09-18 09:39:07 -07:00
ls-files.c Merge branch 'rs/strbuf-expand-step' 2023-07-06 11:54:45 -07:00
ls-remote.c git-compat-util.h: remove unneccessary include of wildmatch.h 2023-06-21 13:39:54 -07:00
ls-tree.c ls-tree: mark unused parameter in callback 2023-08-29 17:56:24 -07:00
mailinfo.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
mailsplit.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
merge-base.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
merge-file.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
merge-index.c read-cache*.h: move declarations for read-cache.c functions from cache.h 2023-06-21 13:39:53 -07:00
merge-ours.c diff.h: remove unnecessary include of oidset.h 2023-06-21 13:39:53 -07:00
merge-recursive.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
merge-tree.c merge-tree: mark unused parameter in traverse callback 2023-07-13 17:24:00 -07:00
merge.c diff --stat: add config option to limit filename width 2023-09-18 09:39:07 -07:00
mktag.c fsck: mark unused parameters in various fsck callbacks 2023-07-13 17:24:00 -07:00
mktree.c git-compat-util: move alloc macros to git-compat-util.h 2023-07-05 11:42:31 -07:00
multi-pack-index.c Merge branch 'gc/config-context' 2023-07-06 11:54:48 -07:00
mv.c Merge branch 'jc/mv-d-to-d-error-message-fix' 2023-08-29 13:51:43 -07:00
name-rev.c name-rev: use OPT_HIDDEN_BOOL for --peel-tag 2023-09-05 14:58:44 -07:00
notes.c Merge branch 'tl/notes-separator' 2023-07-06 11:54:47 -07:00
pack-objects.c Merge branch 'jk/unused-post-2.42-part2' 2023-09-13 10:07:56 -07:00
pack-redundant.c object-store-ll.h: split this header out of object-store.h 2023-06-21 13:39:54 -07:00
pack-refs.c pack-refs: teach pack-refs --include option 2023-05-12 14:54:14 -07:00
patch-id.c Merge branch 'gc/config-context' 2023-07-06 11:54:48 -07:00
prune-packed.c treewide: be explicit about dependence on gettext.h 2023-03-21 10:56:51 -07:00
prune.c Merge branch 'en/header-split-cache-h-part-3' 2023-06-29 16:43:21 -07:00
pull.c Merge branch 'gc/config-context' 2023-07-06 11:54:48 -07:00
push.c Merge branch 'jc/transport-parseopt-fix' 2023-07-27 15:26:37 -07:00
range-diff.c diff.h: remove unnecessary include of oidset.h 2023-06-21 13:39:53 -07:00
read-tree.c parse-options: mark unused "opt" parameter in callbacks 2023-09-05 14:48:17 -07:00
rebase.c diff --stat: add config option to limit filename width 2023-09-18 09:39:07 -07:00
receive-pack.c Merge branch 'ts/unpacklimit-config-fix' 2023-08-30 13:50:41 -07:00
reflog.c Merge branch 'gc/config-context' 2023-07-06 11:54:48 -07:00
remote-ext.c builtins: annotate always-empty prefix parameters 2023-03-28 14:11:24 -07:00
remote-fd.c builtins: annotate always-empty prefix parameters 2023-03-28 14:11:24 -07:00
remote.c Merge branch 'jc/parse-options-short-help' 2023-08-04 10:52:31 -07:00
repack.c builtin/repack.c: implement support for --max-cruft-size 2023-10-05 13:26:11 -07:00
replace.c replace: mark unused parameter in each_mergetag_fn callback 2023-07-13 17:24:00 -07:00
rerere.c treewide: remove unnecessary includes for wrapper.h 2023-07-05 11:41:59 -07:00
reset.c Merge branch 'jc/parse-options-reset' 2023-07-27 15:26:37 -07:00
rev-list.c object-store-ll.h: split this header out of object-store.h 2023-06-21 13:39:54 -07:00
rev-parse.c Merge branch 'jk/unused-parameter' 2023-07-25 12:05:24 -07:00
revert.c git-compat-util: move alloc macros to git-compat-util.h 2023-07-05 11:42:31 -07:00
rm.c git-compat-util: move alloc macros to git-compat-util.h 2023-07-05 11:42:31 -07:00
send-pack.c config: add ctx arg to config_fn_t 2023-06-28 14:06:39 -07:00
shortlog.c diff.h: remove unnecessary include of oidset.h 2023-06-21 13:39:53 -07:00
show-branch.c Merge branch 'jc/parse-options-show-branch' 2023-07-27 15:26:37 -07:00
show-index.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
show-ref.c object-store-ll.h: split this header out of object-store.h 2023-06-21 13:39:54 -07:00
sparse-checkout.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
stash.c Merge branch 'jk/unused-post-2.42' 2023-09-07 15:06:07 -07:00
stripspace.c Merge branch 'cw/strbuf-cleanup' 2023-07-06 11:54:46 -07:00
submodule--helper.c diff: drop useless "status" parameter from diff_result_code() 2023-08-21 15:33:24 -07:00
symbolic-ref.c git-compat-util: move strbuf.c funcs to its header 2023-07-05 11:41:18 -07:00
tag.c Merge branch 'jk/unused-parameter' 2023-07-25 12:05:24 -07:00
unpack-file.c treewide: remove unnecessary includes for wrapper.h 2023-07-05 11:41:59 -07:00
unpack-objects.c Merge branch 'ew/hash-with-openssl-evp' 2023-09-13 10:07:57 -07:00
update-index.c Merge branch 'jc/unresolve-removal' 2023-10-02 11:20:00 -07:00
update-ref.c update-ref: mark unused parameter in parser callbacks 2023-08-29 17:56:26 -07:00
update-server-info.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
upload-archive.c repository: remove unnecessary include of path.h 2023-06-21 13:39:53 -07:00
upload-pack.c Merge branch 'en/header-split-cache-h-part-3' 2023-06-29 16:43:21 -07:00
var.c Merge branch 'bc/more-git-var' 2023-09-13 10:07:57 -07:00
verify-commit.c object-store-ll.h: split this header out of object-store.h 2023-06-21 13:39:54 -07:00
verify-pack.c builtin.h: remove unneccessary includes 2023-06-21 13:39:54 -07:00
verify-tag.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00
worktree.c worktree: mark unused parameters in each_ref_fn callback 2023-08-29 17:56:24 -07:00
write-tree.c cache.h: remove this no-longer-used header 2023-06-21 13:39:53 -07:00