When a multi-pack bitmap is used to implement verbatim pack reuse (that
is, when verbatim chunks from an on-disk packfile are copied
directly[^1]), it does so by using its "preferred pack" as the source
for pack-reuse.
This allows repositories to pack the majority of their objects into a
single (often large) pack, and then use it as the single source for
verbatim pack reuse. This increases the amount of objects that are
reused verbatim (and consequently, decrease the amount of time it takes
to generate many packs). But this performance comes at a cost, which is
that the preferred packfile must pace its growth with that of the entire
repository in order to maintain the utility of verbatim pack reuse.
As repositories grow beyond what we can reasonably store in a single
packfile, the utility of verbatim pack reuse diminishes. Or, at the very
least, it becomes increasingly more expensive to maintain as the pack
grows larger and larger.
It would be beneficial to be able to perform this same optimization over
multiple packs, provided some modest constraints (most importantly, that
the set of packs eligible for verbatim reuse are disjoint with respect
to the subset of their objects being sent).
If we assume that the packs which we treat as candidates for verbatim
reuse are disjoint with respect to any of their objects we may output,
we need to make only modest modifications to the verbatim pack-reuse
code itself. Most notably, we need to remove the assumption that the
bits in the reachability bitmap corresponding to objects from the single
reuse pack begin at the first bit position.
Future patches will unwind these assumptions and reimplement their
existing functionality as special cases of the more general assumptions
(e.g. that reuse bits can start anywhere within the bitset, but happen
to start at 0 for all existing cases).
This patch does not yet relax any of those assumptions. Instead, it
implements a foundational data-structure, the "Bitampped Packs" (`BTMP`)
chunk of the multi-pack index. The `BTMP` chunk's contents are described
in detail here. Importantly, the `BTMP` chunk contains information to
map regions of a multi-pack index's reachability bitmap to the packs
whose objects they represent.
For now, this chunk is only written, not read (outside of the test-tool
used in this patch to test the new chunk's behavior). Future patches
will begin to make use of this new chunk.
[^1]: Modulo patching any `OFS_DELTA`'s that cross over a region of the
pack that wasn't used verbatim.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up for jk/chunk-bounds topic.
* jk/chunk-bounds-more:
commit-graph: mark chunk error messages for translation
commit-graph: drop verify_commit_graph_lite()
commit-graph: check order while reading fanout chunk
commit-graph: use fanout value for graph size
commit-graph: abort as soon as we see a bogus chunk
commit-graph: clarify missing-chunk error messages
commit-graph: drop redundant call to "lite" verification
midx: check consistency of fanout table
commit-graph: handle overflow in chunk_size checks
The commit-graph, midx, and pack idx on-disk formats all have oid fanout
tables which are fed to bsearch_hash(). If these tables do not increase
monotonically, then the binary search may not only produce bogus values,
it may cause out of bounds reads.
We fixed this for commit graphs in 4169d89645 (commit-graph: check
consistency of fanout table, 2023-10-09). That commit argued that we did
not need to do the same for midx and pack idx files, because they
already did this check. However, that is wrong. We _do_ check the fanout
table for pack idx files when we load them, but we only do so for midx
files when running "git multi-pack-index verify". So it is possible to
get an out-of-bounds read by running a normal command with a specially
crafted midx file.
Let's fix this using the same solution (and roughly the same test) we
did for the commit-graph in 4169d89645. This replaces the same check
from "multi-pack-index verify", because verify uses the same read
routines, we'd bail on reading the midx much sooner now. So let's make
sure to copy its verbose error message.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Another step to deprecate test_i18ngrep.
* jc/test-i18ngrep:
tests: teach callers of test_i18ngrep to use test_grep
test framework: further deprecate test_i18ngrep
They are equivalents and the former still exists, so as long as the
only change this commit makes are to rewrite test_i18ngrep to
test_grep, there won't be any new bug, even if there still are
callers of test_i18ngrep remaining in the tree, or when merged to
other topics that add new uses of test_i18ngrep.
This patch was produced more or less with
git grep -l -e 'test_i18ngrep ' 't/t[0-9][0-9][0-9][0-9]-*.sh' |
xargs perl -p -i -e 's/test_i18ngrep /test_grep /'
and a good way to sanity check the result yourself is to run the
above in a checkout of c4603c1c (test framework: further deprecate
test_i18ngrep, 2023-10-31) and compare the resulting working tree
contents with the result of applying this patch to the same commit.
You'll see that test_i18ngrep in a few t/lib-*.sh files corrected,
in addition to the manual reproduction.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The test t5319.88 ("reader bounds-checks large offset table") can fail
intermittently. The failure mode looks like this:
1. An earlier test sets up "objects64", a directory that can be used
to produce a midx with a corrupted large-offsets table. To get the
large offsets, it corrupts the normal ".idx" file to have a fake
large offset, and then builds a midx from that.
That midx now has a large offset table, which is what we want. But
we also have a .idx on disk that has a corrupted entry. We'll call
the object with the corrupted large-offset "X".
2. In t5319.88, we further corrupt the midx by reducing the size of
the large-offset chunk (because our goal is to make sure we do not
do an out-of-bounds read on it).
3. We then enumerate all of the objects with "cat-file --batch-check
--batch-all-objects", expecting to see a complaint when we try to
show object X. We use --batch-all-objects because our objects64
repo doesn't actually have any refs (but if we check them all, one
of them will be the failing one). The default batch-check format
includes %(objecttype) and %(objectsize), both of which require us
to access the actual pack data (and thus requires looking at the
offset).
4a. Usually, this succeeds. We try to output object X, do a lookup via
the midx for the type/size lookup, and run into the corrupt
large-offset table.
4b. But sometimes we hit a different error. If another object points
to X as a delta base, then trying to find the type of that object
requires walking the delta chain to the base entry (since only the
base has the concrete type; deltas themselves are either OFS_DELTA
or REF_DELTA).
Normally this would not require separate offset lookups at all, as
deltas are usually stored as OFS_DELTA, specifying the relative
offset to the base. But the corrupt idx created in step 1 is done
directly with "git pack-objects" and does not pass the
--delta-base-offset option, meaning we have REF_DELTA entries!
Those do have to consult an index to find the location of the base
object, and they use the pack .idx to do this. The same pack .idx
that we know is corrupted from step 1!
Git does notice the error, but it does so by seeing the corrupt
.idx file, not the corrupt midx file, and the error it reports is
different, causing the test to fail.
The set of objects created in the test is deterministic. But the delta
selection seems not to be (which is not too surprising, as it is
multi-threaded). I have seen the failure in Windows CI but haven't
reproduced it locally (not even with --stress). Re-running a failed
Windows CI job tends to work. But when I download and examine the trash
directory from a failed run, it shows a different set of deltas than I
get locally. But the exact source of non-determinism isn't that
important; our test should be robust against any order.
There are a few options to fix this:
a. It would be OK for the "objects64" setup to "unbreak" the .idx file
after generating the midx. But then it would be hard for subsequent
tests to reuse it, since it is the corrupted idx that forces the
midx to have a large offset table.
b. The "objects64" setup could use --delta-base-offset. This would fix
our problem, but earlier tests have many hard-coded offsets. Using
OFS_DELTA would change the locations of objects in the pack (this
might even be OK because I think most of the offsets are within the
.idx file, but it seems brittle and I'm afraid to touch it).
c. Our cat-file output is in oid order by default. Since we store
bases before deltas, if we went in pack order (using the
"--unordered" flag), we'd always see our corrupt X before any delta
which depends on it. But using "--unordered" means we skip the midx
entirely. That makes sense, since it is just enumerating all of
the packs, using the offsets found in their .idx files directly.
So it doesn't work for our test.
d. We could ask directly about object X, rather than enumerating all
of them. But that requires further hard-coding of the oid (both
sha1 and sha256) of object X. I'd prefer not to introduce more
brittleness.
e. We can use a --batch-check format that looks at the pack data, but
doesn't have to chase deltas. The problem in this case is
%(objecttype), which has to walk to the base. But %(objectsize)
does not; we can get the value directly from the delta itself.
Another option would be %(deltabase), where we report the REF_DELTA
name but don't look at its data.
I've gone with option (e) here. It's kind of subtle, but it's simple and
has no side effects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we load a revindex from disk, we check the size of the file
compared to the number of objects we expect it to have. But when we use
a RIDX chunk stored directly in the midx, we just access the memory
directly. This can lead to out-of-bounds memory access for a corrupted
or malicious multi-pack-index file.
We can catch this by recording the RIDX chunk size, and then checking it
against the expected size when we "load" the revindex. Note that this
check is much simpler than the one that load_revindex_from_disk() does,
because we just have the data array with no header (so we do not need
to account for the header size, and nor do we need to bother validating
the header values).
The test confirms both that we catch this case, and that we continue the
process (the revindex is required to use the midx bitmaps, but we
fallback to a non-bitmap traversal).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we see a large offset bit in the regular midx offset table, we
use the entry as an index into a separate large offset table (just like
a pack idx does). But we don't bounds-check the access to that large
offset table (nor even record its size when we parse the chunk!).
The equivalent code for a regular pack idx is in check_pack_index_ptr().
But things are a bit simpler here because of the chunked format: we can
just check our array index directly.
As a bonus, we can get rid of the st_mult() here. If our array
bounds-check is successful, then we know that the result will fit in a
size_t (and the bounds check uses a division to avoid overflow
entirely).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The object offset chunk has one fixed-size entry for each object in the
midx. But since we don't check its size, we may access out-of-bounds
memory if we see a corrupt or malicious midx file.
Sine the entries are fixed-size, the total length can be known up-front,
and we can just check it while parsing the chunk (this is similar to
what we do when opening pack idx files, which contain a similar offset
table).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The midx reader assumes chunks are aligned to a 4-byte boundary: we
treat the fanout chunk as an array of uint32_t, indexing it to feed the
results to ntohl(). Without aligning the chunks, we may violate the
CPU's alignment constraints. Though many platforms allow this, some do
not. And certanily UBSan will complain, since it is undefined behavior.
Even though most chunks are naturally 4-byte-aligned (because they are
storing uint32_t or larger types), PNAM is not. It stores NUL-terminated
pack names, so you can have a valid chunk with any length. The writing
side handles this by 4-byte-aligning the chunk, introducing a few extra
NULs as necessary. But since we don't check this on the reading side, we
may end up with a misaligned fanout and trigger the undefined behavior.
We have two options here:
1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The
latter handles alignment itself. It's possible that it's slightly
slower (though in practice I'm not sure how true that is,
especially for these code paths which then go on to do a binary
search).
2. Enforce the alignment when reading the chunks. This is easy to do,
since the table-of-contents reader can check it in one spot.
I went with the second option here, just because it places less burden
on maintenance going forward (it is OK to continue using ntohl), and we
know it can't have any performance impact on the actual reads.
The commit-graph code uses the same chunk API. It's usually also 4-byte
aligned, but some chunks are not (like Bloom filter BDAT chunks). So
we'll pass "1" here to allow any alignment. It doesn't suffer from the
same problem as midx with its fanout because the fanout chunk is always
the first (and the rest of the format dictates that the first chunk will
start aligned).
The new test shows the effect on a midx with a misaligned PNAM chunk.
Note that the midx-reading code treats chunk-toc errors as soft, falling
back to the non-midx path rather than calling die(), as we do for other
parsing errors. Arguably we should make all of these behave the same,
but that's out of scope for this patch. For now the test just expects
the fallback behavior.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We parse the pack-name chunk as a series of NUL-terminated strings. But
since we don't look at the chunk size, there's nothing to guarantee that
we don't parse off the end of the chunk (or even off the end of the
mapped file).
We can record the length, and then as we parse make sure that we never
walk past it.
The new test exercises the case, though note that it does not actually
segfault before this patch. It hits a NUL byte somewhere in one of the
other chunks, and comes up with a garbage pack name. You could construct
one that reads out-of-bounds (e.g., a PNAM chunk at the end of file),
but this case is simple and sufficient to check that we detect the
problem.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When reading an on-disk multi-pack-index, we take the number of objects
in the midx from the final value of the fanout table. But we just
blindly assume that the chunk containing the actual oid entries is the
correct size. This can lead to us reading out-of-bounds memory if the
lookup chunk is too small (or if the fanout is corrupted; when they
don't agree we cannot tell which one is wrong).
Note that we bump the assignment of m->num_objects into the fanout
parser callback, so that it's set when we parse the lookup table
(otherwise we'd have to manually record the lookup table size and check
it later).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When we load the oid-fanout chunk, our callback checks that its size is
reasonable and returns an error if not. However, the caller only checks
our return value against CHUNK_NOT_FOUND, so we end up ignoring the
error completely! Using a too-small fanout table means we end up
accessing random memory for the fanout and segfault.
We can fix this by checking for any non-zero return value, rather than
just CHUNK_NOT_FOUND, and adjusting our error message to cover both
cases. We could handle each error code individually, but there's not
much point for such a rare case. The extra message produced in the
callback makes it clear what is going on.
The same pattern is used in the adjacent code. Those cases are actually
OK for now because they do not use a custom callback, so the only error
they can get is CHUNK_NOT_FOUND. But let's convert them, as this is an
accident waiting to happen (especially as we convert some of them away
from pair_chunk). The error messages are more verbose, but it should be
rare for a user to see these anyway.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a similar spirit as the previous commit, address a bug where `git
fsck` produces output when calling `git multi-pack-index verify` even
when invoked with `--no-progress`.
$ git.compile fsck --connectivity-only --no-progress --no-dangling
Verifying OID order in multi-pack-index: 100% (605677/605677), done.
Sorting objects by packfile: 100% (605678/605678), done.
Verifying object offsets: 100% (605678/605678), done.
The three lines produced by `git fsck` come from `git multi-pack-index
verify`, but should be squelched due to `--no-progress`.
The MIDX machinery learned to generate these progress messages as early
as 430efb8a74 (midx: add progress indicators in multi-pack-index
verify, 2019-03-21), but did not respect `--progress` or `--no-progress`
until ad60096d1c (midx: honor the MIDX_PROGRESS flag in
verify_midx_file, 2019-10-21).
But the `git multi-pack-index verify` step was added to fsck in
66ec0390e7 (fsck: verify multi-pack-index, 2018-09-13), pre-dating any
of the above patches.
Pass `--[no-]progress` as appropriate to ensure that we don't produce
output when told not to.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When asked to write a multi-pack-index the user can specify a preferred
pack that is used as a tie breaker when multiple packs contain the same
objects. When this packfile cannot be found, we just pick the first pack
that is getting tracked by the newly written multi-pack-index as a
fallback.
Picking the fallback can fail in the case where we're asked to write a
multi-pack-index with no packfiles at all: picking the fallback value
will cause a segfault as we blindly index into the array of packfiles,
which would be empty.
Fix this bug by resetting the preferred packfile index to `-1` before
searching for the preferred pack. This fixes the segfault as we already
check for whether the index is `> - 1`. If it is not, we simply don't
pick a preferred packfile at all.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 647982bb71 (delta-islands: free island_marks and bitmaps, 2023-02-03)
we have introduced logic to free `island_marks` in order to reduce heap
memory usage in git-pack-objects(1). This commit is causing segfaults in
the case where this Git command does not load delta islands at all, e.g.
when reading object IDs from standard input. One such crash can be hit
when using repacking multi-pack-indices with delta islands enabled.
The root cause of this bug is that we unconditionally dereference the
`island_marks` variable even in the case where it is a `NULL` pointer,
which is fixed by making it conditional. Note that we still leave the
logic in place to set the pointer to `-1` to detect use-after-free bugs
even when there are no loaded island marks at all.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Acked-by: Eric Wong <e@80x24.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Apply similar treatment with respect to cruft packs as in a few commits
ago to `repack` with a non-zero `--batch-size`.
Since the case of a non-zero `--batch-size` is handled separately (in
`fill_included_packs_batch()` instead of `fill_included_packs_all()`), a
separate fix must be applied for this case.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `repack` sub-command of the `git multi-pack-index` builtin creates a
new pack aggregating smaller packs contained in the MIDX up to some
given `--batch-size`.
When `--batch-size=0`, this instructs the MIDX builtin to repack
everything contained in the MIDX into a single pack.
In similar spirit as a previous commit, it is undesirable to repack the
contents of a cruft pack in this step. Teach `repack` to ignore any
cruft pack(s) when `--batch-size=0` for the same reason(s).
(The case of a non-zero `--batch-size` will be handled in a subsequent
commit).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `expire` sub-command unlinks any packs that are (a) contained in the
MIDX, but (b) have no objects referenced by the MIDX.
This sub-command ignores `.keep` packs, which remain on-disk even if
they have no objects referenced by the MIDX. Cruft packs, however,
aren't given the same treatment: if none of the objects contained in the
cruft pack are selected from the cruft pack by the MIDX, then the cruft
pack is eligible to be expired.
This is less than desireable, since the cruft pack has important
metadata about the individual object mtimes, which is useful to
determine how quickly an object should age out of the repository when
pruning.
Ordinarily, we wouldn't expect the contents of a cruft pack to
duplicated across non-cruft packs (and we'd expect to see the MIDX
select all cruft objects from other sources even less often). But
nonetheless, it is still possible to trick the `expire` sub-command into
removing the `.mtimes` file in this circumstance.
Teach the `expire` sub-command to ignore cruft packs in the same manner
as it does `.keep` packs, in order to keep their metadata around, even
when they are unreferenced by the MIDX.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Broken &&-chains in the test scripts have been corrected.
* es/test-chain-lint:
t6000-t9999: detect and signal failure within loop
t5000-t5999: detect and signal failure within loop
t4000-t4999: detect and signal failure within loop
t0000-t3999: detect and signal failure within loop
tests: simplify by dropping unnecessary `for` loops
tests: apply modern idiom for exiting loop upon failure
tests: apply modern idiom for signaling test failure
tests: fix broken &&-chains in `{...}` groups
tests: fix broken &&-chains in `$(...)` command substitutions
tests: fix broken &&-chains in compound statements
tests: use test_write_lines() to generate line-oriented output
tests: simplify construction of large blocks of text
t9107: use shell parameter expansion to avoid breaking &&-chain
t6300: make `%(raw:size) --shell` test more robust
t5516: drop unnecessary subshell and command invocation
t4202: clarify intent by creating expected content less cleverly
t1020: avoid aborting entire test script when one test fails
t1010: fix unnoticed failure on Windows
t/lib-pager: use sane_unset() to avoid breaking &&-chain
Failures within `for` and `while` loops can go unnoticed if not detected
and signaled manually since the loop itself does not abort when a
contained command fails, nor will a failure necessarily be detected when
the loop finishes since the loop returns the exit code of the last
command it ran on the final iteration, which may not be the command
which failed. Therefore, detect and signal failures manually within
loops using the idiom `|| return 1` (or `|| exit 1` within subshells).
Signed-off-by: Eric Sunshine <sunshine@sunshineco.com>
Reviewed-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
One of the tests in t5319 corrupts the checksum of the midx file by
writing a single 0xff over the final byte, and then confirms that we
detect the problem. This usually works fine, but would break if the
actual checksum ended with that same byte already.
It seems like this should happen in 1 out of 256 test runs, but it turns
out to be less often in practice. The contents of the midx are mostly
deterministic because it's based on the objects, and we remove most
sources of randomness by setting GIT_COMMITTER_DATE, etc. However,
there's still some randomness: some objects are duplicated between
packs, and the midx must decide which to use, which can be based on
timing.
So very occasionally we can end up with a real 0xff byte, and the test
fails. The most robust fix would be to read out the final byte and then
change it to something else (e.g., adding 1 mod 256). But that's awkward
to do in shell. Let's just blindly corrupt 10 bytes instead of 1, which
reduces our chances of an accidental noop to 1 in 2^80.
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It is wrong to read some settings directly from the config
subsystem, as things like feature.experimental can affect their
default values.
* gc/use-repo-settings:
gc: perform incremental repack when implictly enabled
fsck: verify multi-pack-index when implictly enabled
fsck: verify commit graph when implicitly enabled
"git repack" has been taught to generate multi-pack reachability
bitmaps.
* tb/repack-write-midx:
test-read-midx: fix leak of bitmap_index struct
builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
builtin/repack.c: make largest pack preferred
builtin/repack.c: support writing a MIDX while repacking
builtin/repack.c: extract showing progress to a variable
builtin/repack.c: rename variables that deal with non-kept packs
builtin/repack.c: keep track of existing packs unconditionally
midx: preliminary support for `--refs-snapshot`
builtin/multi-pack-index.c: support `--stdin-packs` mode
midx: expose `write_midx_file_only()` publicly
Like the previous commit, change fsck to check the
"core_multi_pack_index" variable set in "repo-settings.c" instead of
reading the "core.multiPackIndex" config variable. This fixes a bug
where we wouldn't verify midx if the config key was missing. This bug
was introduced in 18e449f86b (midx: enable core.multiPackIndex by
default, 2020-09-25) where core.multiPackIndex was turned on by default.
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Glen Choo <chooglen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Regression in "git commit-graph" command line parsing has been
corrected.
* tb/commit-graph-usage-fix:
builtin/multi-pack-index.c: disable top-level --[no-]progress
builtin/commit-graph.c: don't accept common --[no-]progress
To power a new `--write-midx` mode, `git repack` will want to write a
multi-pack index containing a certain set of packs in the repository.
This new option will be used by `git repack` to write a MIDX which
contains only the packs which will survive after the repack (that is, it
will exclude any packs which are about to be deleted).
This patch effectively exposes the function implemented in the previous
commit via the `git multi-pack-index` builtin. An alternative approach
would have been to call that function from the `git repack` builtin
directly, but this introduces awkward problems around closing and
reopening the object store, so the MIDX will be written out-of-process.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a similar spirit as the previous patch, let sub-commands which
support showing or hiding a progress meter handle parsing the
`--progress` or `--no-progress` option, but do not expose it as an
option to the top-level `multi-pack-index` builtin.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This test is specifically about generating a midx still respecting a
pack-based bitmap file. Generating a MIDX bitmap would confuse the test.
Let's override the 'GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP' variable to
make sure we don't do so.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Opening multiple instance of the same MIDX can lead to problems like two
separate packed_git structures which represent the same pack being added
to the repository's object store.
The above scenario can happen because prepare_midx_pack() checks if
`m->packs[pack_int_id]` is NULL in order to determine if a pack has been
opened and installed in the repository before. But a caller can
construct two copies of the same MIDX by calling get_multi_pack_index()
and load_multi_pack_index() since the former manipulates the
object store directly but the latter is a lower-level routine which
allocates a new MIDX for each call.
So if prepare_midx_pack() is called on multiple MIDXs with the same
pack_int_id, then that pack will be installed twice in the object
store's packed_git pointer.
This can lead to problems in, for e.g., the pack-bitmap code, which does
something like the following (in pack-bitmap.c:open_pack_bitmap()):
struct bitmap_index *bitmap_git = ...;
for (p = get_all_packs(r); p; p = p->next) {
if (open_pack_bitmap_1(bitmap_git, p) == 0)
ret = 0;
}
which is a problem if two copies of the same pack exist in the
packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a
conditional like the following:
if (bitmap_git->pack || bitmap_git->midx) {
/* ignore extra bitmap file; we can only handle one */
warning("ignoring extra bitmap file: %s", packfile->pack_name);
close(fd);
return -1;
}
Avoid this scenario by not letting write_midx_internal() open a MIDX
that isn't also pointed at by the object store. So long as this is the
case, other routines should prefer to open MIDXs with
get_multi_pack_index() or reprepare_packed_git() instead of creating
instances on their own. Because get_multi_pack_index() returns
`r->object_store->multi_pack_index` if it is non-NULL, we'll only have
one instance of a MIDX open at one time, avoiding these problems.
To encourage this, drop the `struct multi_pack_index *` parameter from
`write_midx_internal()`, and rely instead on the `object_dir` to find
(or initialize) the correct MIDX instance.
Likewise, replace the call to `close_midx()` with
`close_object_store()`, since we're about to replace the MIDX with a new
one and should invalidate the object store's memory of any MIDX that
might have existed beforehand.
Note that this now forbids passing object directories that don't belong
to alternate repositories over `--object-dir`, since before we would
have happily opened a MIDX in any directory, but now restrict ourselves
to only those reachable by `r->objects->multi_pack_index` (and alternate
MIDXs that we can see by walking the `next` pointer).
As far as I can tell, supporting arbitrary directories with
`--object-dir` was a historical accident, since even the documentation
says `<alt>` when referring to the value passed to this option.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The soon-to-be-implemented multi-pack bitmap treats object in the first
bit position specially by assuming that all objects in the pack it was
selected from are also represented from that pack in the MIDX. In other
words, the pack from which the first object was selected must also have
all of its other objects selected from that same pack in the MIDX in
case of any duplicates.
But this assumption relies on the fact that there is at least one object
in that pack to begin with; otherwise the object in the first bit
position isn't from a preferred pack, in which case we can no longer
assume that all objects in that pack were also selected from the same
pack.
Guard this assumption by checking the number of objects in the given
preferred pack, and failing if the given pack is empty.
To make sure we can safely perform this check, open any packs which are
contained in an existing MIDX via prepare_midx_pack(). The same is done
for new packs via the add_pack_to_midx() callback, but packs picked up
from a previous MIDX will not yet have these opened.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
If using --object-dir to point into an object directory which belongs to
a different repository than the one in the current working directory,
such as:
git init repo
git -C repo ... # add some objects
cd alternate
git multi-pack-index --object-dir ../repo/.git/objects write
the binary will segfault trying to access the object-dir via the repo it
found, but that's not fully initialized. Worse, if we later call
clear_midx_files_ext(), we will use `the_repository` and remove files
out of the wrong object directory.
Fix this by using the given object_dir (or the object directory of
`the_repository` if `--object-dir` wasn't given) to properly to clean up
the *.rev files, avoiding the crash.
Original-patch-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The multi-pack-index command supports working with arbitrary object
directories via the `--object-dir` flag. Though this has historically
worked in arbitrary repositories (including when the command itself was
run outside of a Git repository), this has been somewhat of an accident.
For example, running:
git multi-pack-index write --object-dir=/path/to/repo/objects
outside of a Git repository causes a BUG(). This is because the
top-level `cmd_multi_pack_index()` function stops parsing when it sees
"write", and then fills in the default object directory (the result of
calling `get_object_directory()`) before handing off to
`cmd_multi_pack_index_write()`. But there is no repository to
initialize, and so calling `get_object_directory()` results in a BUG()
(indicating that the current repository is not initialized).
Another case where this doesn't quite work as expected is when operating
in a SHA-256 repository. To see the failure, try this in your shell:
git init --object-format=sha256 repo
git -C repo commit --allow-empty base
git -C repo repack -d
git multi-pack-index --object-dir=$(pwd)/repo/.git/objects write
and observe that we cannot open the `.idx` file in "repo", because the
outermost process assumes that any repository that it works in also uses
the default value of `the_hash_algo` (at the time of writing, SHA-1).
There may be compelling reasons for trying to work around these bugs,
but working in arbitrary `--object-dir`'s is non-standard enough (and
likewise, these bugs prevalent enough) that I don't think any workflows
would be broken by abandoning this behavior.
Accordingly, restrict the `multi-pack-index` builtin to only work when
inside of a Git repository (i.e., its main utility becomes selecting
which alternate to operate in), which avoids both of the bugs above.
(Note that you can still trigger a bug when writing a MIDX in an
alternate which does not use the same object format as the repository
which it is an alternate of, but that is an unrelated bug to this one).
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The code that gives an error message in "git multi-pack-index" when
no subcommand is given tried to print a NULL pointer as a strong,
which has been corrected.
* tb/reverse-midx:
multi-pack-index: fix potential segfault without sub-command
Since cd57bc41bb (builtin/multi-pack-index.c: display usage on
unrecognized command, 2021-03-30) we have used a "usage" label to avoid
having two separate callers of usage_with_options (one when no arguments
are given, and another for unrecognized sub-commands).
But the first caller has been broken since cd57bc41bb, since it will
happily jump to usage without arguments, and then pass argv[0] to the
"unrecognized subcommand" error.
Many compilers will save us from a segfault here, but the end result is
ugly, since it mentions an unrecognized subcommand when we didn't even
pass one, and (on GCC) includes "(null)" in its output.
Move the "usage" label down past the error about unrecognized
subcommands so that it is only triggered when it should be. While we're
at it, bulk up our test coverage in this area, too.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
'git multi-pack-index verify' inspects the data in an existing MIDX for
correctness by checking that the recorded object offsets are correct,
and so on.
But it does not check that the file's trailing checksum matches the data
that it records. So, if an on-disk corruption happened to occur in the
final few bytes (and all other data was recorded correctly), we would:
- get a clean result from 'git multi-pack-index verify', but
- be unable to reuse the existing MIDX when writing a new one (since
we now check for checksum mismatches before reusing a MIDX)
Teach the 'verify' sub-command to recognize corruption in the checksum
by calling midx_checksum_valid().
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When writing a new multi-pack index, Git tries to reuse as much of the
data from an existing MIDX as possible, like object offsets. This is
done to avoid re-opening a bunch of *.idx files unnecessarily, but can
lead to problems if the data we are reusing is corrupt.
That's because we'll blindly reuse data from an existing MIDX without
checking its trailing checksum for validity. So if there is memory
corruption while writing a MIDX, or disk corruption in the intervening
period between writing and reuse, we'll blindly propagate those bad
values forward.
Suppose we experience a memory corruption while writing a MIDX such that
we write an incorrect object offset (or alternatively, the disk corrupts
the data after being written, but before being reused). Then when we go
to write a new MIDX, we'll reuse the bad object offset without checking
its validity. This means that the MIDX we just wrote is broken, but its
trailing checksum is in-tact, since we never bothered to look at the
values before writing.
In the above, a "git multi-pack-index verify" would have caught the
problem before writing, but writing a new MIDX wouldn't have noticed
anything wrong, blindly carrying forward the corrupt offset.
Individual pack indexes check their validity by verifying the crc32
attached to each entry when carrying data forward during a repack.
We could solve this problem for MIDXs in the same way, but individual
crc32's don't make much sense, since their entries are so small.
Likewise, checking the whole file on every read may be prohibitively
expensive if a repository has a lot of objects, packs, or both.
But we can check the trailing checksum when reusing an existing MIDX
when writing a new one. And a corrupt MIDX need not stop us from writing
a new one, since we can just avoid reusing the existing one at all and
pretend as if we are writing a new MIDX from scratch.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When multiple packs in the multi-pack index contain the same object, the
MIDX machinery must make a choice about which pack it associates with
that object. Prior to this patch, the lowest-ordered[1] pack was always
selected.
Pack selection for duplicate objects is relatively unimportant today,
but it will become important for multi-pack bitmaps. This is because we
can only invoke the pack-reuse mechanism when all of the bits for reused
objects come from the reuse pack (in order to ensure that all reused
deltas can find their base objects in the same pack).
To encourage the pack selection process to prefer one pack over another
(the pack to be preferred is the one a caller would like to later use as
a reuse pack), introduce the concept of a "preferred pack". When
provided, the MIDX code will always prefer an object found in a
preferred pack over any other.
No format changes are required to store the preferred pack, since it
will be able to be inferred with a corresponding MIDX bitmap, by looking
up the pack associated with the object in the first bit position (this
ordering is described in detail in a subsequent commit).
[1]: the ordering is specified by MIDX internals; for our purposes we
can consider the "lowest ordered" pack to be "the one with the
most-recent mtime.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* ds/chunked-file-api:
commit-graph.c: display correct number of chunks when writing
chunk-format: add technical docs
chunk-format: restore duplicate chunk checks
midx: use 64-bit multiplication for chunk sizes
midx: use chunk-format read API
commit-graph: use chunk-format read API
chunk-format: create read chunk API
midx: use chunk-format API in write_midx_internal()
midx: drop chunk progress during write
midx: return success/failure in chunk write methods
midx: add num_large_offsets to write_midx_context
midx: add pack_perm to write_midx_context
midx: add entries to write_midx_context
midx: use context in write_midx_pack_names()
midx: rename pack_info to write_midx_context
commit-graph: use chunk-format write API
chunk-format: create chunk format write API
commit-graph: anonymize data in chunk_write_fn
Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). In particular, we
can use the return value of pair_chunk() to generate an error when a
required chunk is missing.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the next patch, we'll add support for unconditionally enabling the
'pack.writeReverseIndex' setting with a new GIT_TEST_WRITE_REV_INDEX
environment variable.
This causes a little bit of fallout with tests that, for example,
compare the list of files in the pack directory being unprepared to see
.rev files in its output.
Those locations can be cleaned up to look for specific file extensions,
rather than take everything in the pack directory (for instance) and
then grep out unwanted items.
Once the pack.writeReverseIndex option has been thoroughly
tested, we will default it to 'true', removing GIT_TEST_WRITE_REV_INDEX,
and making it possible to revert this patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Processes that access packdata while the .idx file gets removed
(e.g. while repacking) did not fail or fall back gracefully as they
could.
* tb/idx-midx-race-fix:
midx.c: protect against disappearing packs
packfile.c: protect against disappearing indexes
When a packed object is stored in a multi-pack index, but that pack has
racily gone away, the MIDX code simply calls die(), when it could be
returning an error to the caller, which would in turn lead to
re-scanning the pack directory.
A pack can racily disappear, for example, due to a simultaneous 'git
repack -ad',
You can also reproduce this with two terminals, where one is running:
git init
while true; do
git commit -q --allow-empty -m foo
git repack -ad
git multi-pack-index write
done
(in effect, constantly writing new MIDXs), and the other is running:
obj=$(git rev-parse HEAD)
while true; do
echo $obj | git cat-file --batch-check='%(objectsize:disk)' || break
done
That will sometimes hit the error preparing packfile from
multi-pack-index message, which this patch fixes.
Right now, that path to discovering a missing pack looks something like
'find_pack_entry()' calling 'fill_midx_entry()' and eventually making
its way to call 'nth_midxed_pack_entry()'.
'nth_midxed_pack_entry()' already checks 'is_pack_valid()' and
propagates an error if the pack is invalid. So, this works if the pack
has gone away between calling 'prepare_midx_pack()' and before calling
'is_pack_valid()', but not if it disappears before then.
Catch the case where the pack has already disappeared before
'prepare_midx_pack()' by returning an error in that case, too.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 17c35c8969 (packfile: skip loading index if in multi-pack-index,
2018-07-12) we stopped loading the .idx file for packs that are
contained within a multi-pack index.
This saves us the effort of loading an .idx and doing some lightweight
validity checks by way of 'packfile.c:load_idx()', but introduces a race
between processes that need to load the index (e.g., to generate a
reverse index) and processes that can delete the index.
For example, running the following in your shell:
$ git init repo && cd repo
$ git commit --allow-empty -m 'base'
$ git repack -ad && git multi-pack-index write
followed by:
$ rm -f .git/objects/pack/pack-*.idx
$ git rev-parse HEAD | git cat-file --batch-check='%(objectsize:disk)'
will result in a segfault prior to this patch. What's happening here is
that we notice that the pack is in the multi-pack index, and so don't
check that it still has a .idx. When we then try and load that index to
generate a reverse index, we don't have it, so the call to
'find_pack_revindex()' in 'packfile.c:packed_object_info()' returns
NULL, and then dereferencing it causes a segfault.
Of course, we don't ever expect someone to remove the index file by
hand, or to be in a state where we never wrote it to begin with (yet
find that pack in the multi-pack-index). But, this can happen in a
timing race with 'git repack -ad', which removes all existing packs
after writing a new pack containing all of their objects.
Avoid this by reverting the hunk of 17c35c8969 which stops loading the
index when the pack is contained in a MIDX. This makes the latter half
of 17c35c8969 useless, since we'll always have a non-NULL
'p->index_data', in which case that if statement isn't guarding
anything.
These two together effectively revert 17c35c8969, and avoid the race
explained above.
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.
One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.
Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).
The 'incremental-repack' task runs the following steps:
1. 'git multi-pack-index write' creates a multi-pack-index file if
one did not exist, and otherwise will update the multi-pack-index
with any new pack-files that appeared since the last write. This
is particularly relevant with the background fetch job.
When the multi-pack-index sees two copies of the same object, it
stores the offset data into the newer pack-file. This means that
some old pack-files could become "unreferenced" which I will use
to mean "a pack-file that is in the pack-file list of the
multi-pack-index but none of the objects in the multi-pack-index
reference a location inside that pack-file."
2. 'git multi-pack-index expire' deletes any unreferenced pack-files
and updaes the multi-pack-index to drop those pack-files from the
list. This is safe to do as concurrent Git processes will see the
multi-pack-index and not open those packs when looking for object
contents. (Similar to the 'loose-objects' job, there are some Git
commands that open pack-files regardless of the multi-pack-index,
but they are rarely used. Further, a user that self-selects to
use background operations would likely refrain from using those
commands.)
3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
of pack-files that are listed in the multi-pack-index and creates
a new pack-file containing the objects whose offsets are listed
by the multi-pack-index to be in those objects. The set of pack-
files is selected greedily by sorting the pack-files by modified
time and adding a pack-file to the set if its "expected size" is
smaller than the batch size until the total expected size of the
selected pack-files is at least the batch size. The "expected
size" is calculated by taking the size of the pack-file divided
by the number of objects in the pack-file and multiplied by the
number of objects from the multi-pack-index with offset in that
pack-file. The expected size approximates how much data from that
pack-file will contribute to the resulting pack-file size. The
intention is that the resulting pack-file will be close in size
to the provided batch size.
The next run of the incremental-repack task will delete these
repacked pack-files during the 'expire' step.
In this version, the batch size is set to "0" which ignores the
size restrictions when selecting the pack-files. It instead
selects all pack-files and repacks all packed objects into a
single pack-file. This will be updated in the next change, but
it requires doing some calculations that are better isolated to
a separate change.
These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3 (midx: add packs to
packed_git linked list, 2019-04-29).
These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.
By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that the multi-pack-index may be written as part of auto maintenance
at the end of a command, reduce the progress output when the operations
are quick. Use start_delayed_progress() instead of start_progress().
Update t5319-multi-pack-index.sh to use GIT_PROGRESS_DELAY=0 now that
the progress indicators are conditional.
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When a packfile is removed by "git repack", multi-pack-index gets
cleared; the code was taught to do so less aggressively by first
checking if the midx actually refers to a pack that no longer
exists.
* tb/repack-clearing-midx:
midx: traverse the local MIDX first
builtin/repack.c: invalidate MIDX only when necessary
In 525e18c04b (midx: clear midx on repack, 2018-07-12), 'git repack'
learned to remove a multi-pack-index file if it added or removed a pack
from the object store.
This mechanism is a little over-eager, since it is only necessary to
drop a MIDX if 'git repack' removes a pack that the MIDX references.
Adding a pack outside of the MIDX does not require invalidating the
MIDX, and likewise for removing a pack the MIDX does not know about.
Teach 'git repack' to check for this by loading the MIDX, and checking
whether the to-be-removed pack is known to the MIDX. This requires a
slightly odd alternation to a test in t5319, which is explained with a
comment. A new test is added to show that the MIDX is left alone when
both packs known to it are marked as .keep, but two packs unknown to it
are removed and combined into one new pack.
Helped-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>