Commit graph

10 commits

Author SHA1 Message Date
Taylor Blau 7f514b7a5e midx: read RIDX chunk when present
When a MIDX contains the new `RIDX` chunk, ensure that the reverse index
is read from it instead of the on-disk .rev file. Since we need to
encode the object order in the MIDX itself for correctness reasons,
there is no point in storing the same data again outside of the MIDX.

So, this patch stops writing separate .rev files, and reads it out of
the MIDX itself. This is possible to do with relatively little new code,
since the format of the RIDX chunk is identical to the data in the .rev
file. In other words, we can implement this by pointing the
`revindex_data` field at the reverse index chunk of the MIDX instead of
the .rev file without any other changes.

Note that we have two knobs that are adjusted for the new tests:
GIT_TEST_MIDX_WRITE_REV and GIT_TEST_MIDX_READ_RIDX. The former controls
whether the MIDX .rev is written at all, and the latter controls whether
we read the MIDX's RIDX chunk.

Both are necessary to ensure that the test added at the beginning of
this series continues to work. This is because we always need to write
the RIDX chunk in the MIDX in order to change its checksum, but we want
to make sure reading the existing .rev file still works (since the RIDX
chunk takes precedence by default).

Arguably this isn't a very interesting mode to test, because the
precedence rules mean that we'll always read the RIDX chunk over the
.rev file. But it makes it impossible for a user to induce corruption in
their repository by adjusting the test knobs (since if we had an
either/or knob they could stop writing the RIDX chunk, allowing them to
tweak the MIDX's object order without changing its checksum).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27 12:07:53 -08:00
Taylor Blau 791170fa2b t5326: move tests to t/lib-bitmap.sh
In t5326, we have a handful of tests that we would like to run twice:
once using the MIDX's new `RIDX` chunk as the source of the
reverse-index cache, and once using the separate `.rev` file.

But because these tests mutate the state of the underlying repository,
and then make assumptions about those mutations occurring in a certain
sequence, simply running the tests twice in the same repository is
awkward.

Instead, extract the core of interesting tests into t/lib-bitmap.sh to
prepare for them to be run twice, each in a separate test script. This
means that they can each operate on a separate repository, removing any
concerns about mutating state.

For now, this patch is a strict cut-and-paste of some tests from t5326.
The tests which did not move are not interesting with respect to the
source of their reverse index data.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27 12:07:53 -08:00
Taylor Blau f0ed59afcc t5326: extract test_rev_exists
To determine which source of data is used for the MIDX's reverse index
cache, introduce a helper which forces loading the reverse index, and
then looks for the special trace2 event introduced in a previous commit.

For now, this helper just looks for when the legacy MIDX .rev file was
loaded, but in a subsequent commit will become parameterized over the
the reverse index's source.

This function replaces checking for the existence of the .rev file. We
could write a similar helper to ensure that the .rev file is cleaned up
after repacking, but it will make subsequent tests more difficult to
write, and provides marginal value since we already check that the MIDX
.bitmap file is removed.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27 12:07:53 -08:00
Taylor Blau 90a8ea47d8 t5326: drop unnecessary setup
The core.multiPackIndex config became true by default back in 18e449f86b
(midx: enable core.multiPackIndex by default, 2020-09-25), so it is no
longer necessary to enable it explicitly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27 12:07:53 -08:00
Taylor Blau 95e8383bac midx.c: make changing the preferred pack safe
The previous patch demonstrates a bug where a MIDX's auxiliary object
order can become out of sync with a MIDX bitmap.

This is because of two confounding factors:

  - First, the object order is stored in a file which is named according
    to the multi-pack index's checksum, and the MIDX does not store the
    object order. This means that the object order can change without
    altering the checksum.

  - But the .rev file is moved into place with finalize_object_file(),
    which link(2)'s the file into place instead of renaming it. For us,
    that means that a modified .rev file will not be moved into place if
    MIDX's checksum was unchanged.

This fix is to force the MIDX's checksum to change when the preferred
pack changes but the set of packs contained in the MIDX does not. In
other words, when the object order changes, the MIDX's checksum needs to
change with it (regardless of whether the MIDX is tracking the same or
different packs).

This prevents a race whereby changing the object order (but not the
packs themselves) enables a reader to see the new .rev file with the old
MIDX, or similarly seeing the new bitmap with the old object order.

But why can't we just stop hardlinking the .rev into place instead
adding additional data to the MIDX? Suppose that's what we did. Then
when we go to generate the new bitmap, we'll load the old MIDX bitmap,
along with the MIDX that it references. That's fine, since the new MIDX
isn't moved into place until after the new bitmap is generated. But the
new object order *has* been moved into place. So we'll read the old
bitmaps in the new order when generating the new bitmap file, meaning
that without this secondary change, bitmap generation itself would
become a victim of the race described here.

This can all be prevented by forcing the MIDX's checksum to change when
the object order does. By embedding the entire object order into the
MIDX, we do just that. That is, the MIDX's checksum will change in
response to any perturbation of the underlying object order. In t5326,
this will cause the MIDX's checksum to update (even without changing the
set of packs in the MIDX), preventing the stale read problem.

Note that this makes it safe to continue to link(2) the MIDX .rev file
into place, since it is now impossible to have a .rev file that is
out-of-sync with the MIDX whose checksum it references. (But we will do
away with MIDX .rev files later in this series anyway, so this is
somewhat of a moot point).

In theory, it is possible to store a "fingerprint" of the full object
order here, so long as that fingerprint changes at least as often as the
full object order does. Some possibilities here include storing the
identity of the preferred pack, along with the mtimes of the
non-preferred packs in a consistent order. But storing a limited part of
the information makes it difficult to reason about whether or not there
are gaps between the two that would cause us to get bitten by this bug
again.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27 12:07:52 -08:00
Taylor Blau 61fd31a179 t5326: demonstrate bitmap corruption after permutation
This patch demonstrates a cause of bitmap corruption that can occur when
the contents of the multi-pack index does not change, but the underlying
object order does.

In this example, we have a MIDX containing two packs, each with a
distinct set of objects (pack A corresponds to the tree, blob, and
commit from the first patch, and pack B corresponds to the second
patch).

First, a MIDX is written where the 'A' pack is preferred. As expected,
the bitmaps generated there are in-tact. But then, we generate an
identical MIDX with a different object order: this time preferring pack
'B'.

Due to a bug which will be explained and fixed in the following commit,
the MIDX is updated, but the .rev file is not, causing the .bitmap file
to be read incorrectly. Specifically, the .bitmap file will contain
correct data, but the auxiliary object order in the .rev file is stale,
causing readers to get confused by reading the new bitmaps using the old
object order.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27 12:07:52 -08:00
Junio C Hamano 0b69bb0fb1 Merge branch 'tb/repack-write-midx'
"git repack" has been taught to generate multi-pack reachability
bitmaps.

* tb/repack-write-midx:
  test-read-midx: fix leak of bitmap_index struct
  builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  builtin/repack.c: make largest pack preferred
  builtin/repack.c: support writing a MIDX while repacking
  builtin/repack.c: extract showing progress to a variable
  builtin/repack.c: rename variables that deal with non-kept packs
  builtin/repack.c: keep track of existing packs unconditionally
  midx: preliminary support for `--refs-snapshot`
  builtin/multi-pack-index.c: support `--stdin-packs` mode
  midx: expose `write_midx_file_only()` publicly
2021-10-18 15:47:57 -07:00
Taylor Blau 08944d1c22 midx: preliminary support for --refs-snapshot
To figure out which commits we can write a bitmap for, the multi-pack
index/bitmap code does a reachability traversal, marking any commit
which can be found in the MIDX as eligible to receive a bitmap.

This approach will cause a problem when multi-pack bitmaps are able to
be generated from `git repack`, since the reference tips can change
during the repack. Even though we ignore commits that don't exist in
the MIDX (when doing a scan of the ref tips), it's possible that a
commit in the MIDX reaches something that isn't.

This can happen when a multi-pack index contains some pack which refers
to loose objects (e.g., if a pack was pushed after starting the repack
but before generating the MIDX which depends on an object which is
stored as loose in the repository, and by definition isn't included in
the multi-pack index).

By taking a snapshot of the references before we start repacking, we can
close that race window. In the above scenario (where we have a packed
object pointing at a loose one), we'll either (a) take a snapshot of the
references before seeing the packed one, or (b) take it after, at which
point we can guarantee that the loose object will be packed and included
in the MIDX.

This patch does just that. It writes a temporary "reference snapshot",
which is a list of OIDs that are at the ref tips before writing a
multi-pack bitmap. References that are "preferred" (i.e,. are a suffix
of at least one value of the 'pack.preferBitmapTips' configuration) are
marked with a special '+'.

The format is simple: one line per commit at each tip, with an optional
'+' at the beginning (for preferred references, as described above).

When provided, the reference snapshot is used to drive bitmap selection
instead of the MIDX code doing its own traversal. When it isn't
provided, the usual traversal takes place instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-28 21:20:56 -07:00
Taylor Blau 54156af0d6 t5326: test propagating hashcache values
Now that we both can propagate values from the hashcache, and respect
the configuration to enable the hashcache at all, test that both of
these function correctly by hardening their behavior with a test.

Like the hash-cache in classic single-pack bitmaps, this helps more
proportionally the more up-to-date your bitmap coverage is. When our
bitmap coverage is out-of-date with the ref tips, we spend more time
proportionally traversing, and all of that traversal gets the name-hash
filled in.

But for the up-to-date bitmaps, this helps quite a bit. These numbers
are on git.git, with `pack.threads=1` to help see the difference
reflected in the overall runtime.

    Test                            origin/tb/multi-pack-bitmaps   HEAD
    -------------------------------------------------------------------------------------
    5326.4: simulated clone         1.87(1.80+0.07)                1.46(1.42+0.03) -21.9%
    5326.5: simulated fetch         2.66(2.61+0.04)                1.47(1.43+0.04) -44.7%
    5326.6: pack to file (bitmap)   2.74(2.62+0.12)                1.89(1.82+0.07) -31.0%

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-17 14:34:48 -07:00
Taylor Blau c51f5a6437 t5326: test multi-pack bitmap behavior
This patch introduces a new test, t5326, which tests the basic
functionality of multi-pack bitmaps.

Some trivial behavior is tested, such as:

  - Whether bitmaps can be generated with more than one pack.
  - Whether clones can be served with all objects in the bitmap.
  - Whether follow-up fetches can be served with some objects outside of
    the server's bitmap

These use lib-bitmap's tests (which in turn were pulled from t5310), and
we cover cases where the MIDX represents both a single pack and multiple
packs.

In addition, some non-trivial and MIDX-specific behavior is tested, too,
including:

  - Whether multi-pack bitmaps behave correctly with respect to the
    pack-reuse machinery when the base for some object is selected from
    a different pack than the delta.
  - Whether multi-pack bitmaps correctly respect the
    pack.preferBitmapTips configuration.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 13:56:43 -07:00