1
0
mirror of https://github.com/git/git synced 2024-07-05 00:58:49 +00:00

Documentation/technical: describe multi-pack reverse indexes

As a prerequisite to implementing multi-pack bitmaps, motivate and
describe the format and ordering of the multi-pack reverse index.

The subsequent patch will implement reading this format, and the patch
after that will implement writing it while producing a multi-pack index.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit is contained in:
Taylor Blau 2021-03-30 11:04:23 -04:00 committed by Junio C Hamano
parent 62f2c1b509
commit b25fd24c00

View File

@ -379,3 +379,86 @@ CHUNK DATA:
TRAILER:
Index checksum of the above contents.
== multi-pack-index reverse indexes
Similar to the pack-based reverse index, the multi-pack index can also
be used to generate a reverse index.
Instead of mapping between offset, pack-, and index position, this
reverse index maps between an object's position within the MIDX, and
that object's position within a pseudo-pack that the MIDX describes
(i.e., the ith entry of the multi-pack reverse index holds the MIDX
position of ith object in pseudo-pack order).
To clarify the difference between these orderings, consider a multi-pack
reachability bitmap (which does not yet exist, but is what we are
building towards here). Each bit needs to correspond to an object in the
MIDX, and so we need an efficient mapping from bit position to MIDX
position.
One solution is to let bits occupy the same position in the oid-sorted
index stored by the MIDX. But because oids are effectively random, their
resulting reachability bitmaps would have no locality, and thus compress
poorly. (This is the reason that single-pack bitmaps use the pack
ordering, and not the .idx ordering, for the same purpose.)
So we'd like to define an ordering for the whole MIDX based around
pack ordering, which has far better locality (and thus compresses more
efficiently). We can think of a pseudo-pack created by the concatenation
of all of the packs in the MIDX. E.g., if we had a MIDX with three packs
(a, b, c), with 10, 15, and 20 objects respectively, we can imagine an
ordering of the objects like:
|a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
where the ordering of the packs is defined by the MIDX's pack list,
and then the ordering of objects within each pack is the same as the
order in the actual packfile.
Given the list of packs and their counts of objects, you can
naïvely reconstruct that pseudo-pack ordering (e.g., the object at
position 27 must be (c,1) because packs "a" and "b" consumed 25 of the
slots). But there's a catch. Objects may be duplicated between packs, in
which case the MIDX only stores one pointer to the object (and thus we'd
want only one slot in the bitmap).
Callers could handle duplicates themselves by reading objects in order
of their bit-position, but that's linear in the number of objects, and
much too expensive for ordinary bitmap lookups. Building a reverse index
solves this, since it is the logical inverse of the index, and that
index has already removed duplicates. But, building a reverse index on
the fly can be expensive. Since we already have an on-disk format for
pack-based reverse indexes, let's reuse it for the MIDX's pseudo-pack,
too.
Objects from the MIDX are ordered as follows to string together the
pseudo-pack. Let `pack(o)` return the pack from which `o` was selected
by the MIDX, and define an ordering of packs based on their numeric ID
(as stored by the MIDX). Let `offset(o)` return the object offset of `o`
within `pack(o)`. Then, compare `o1` and `o2` as follows:
- If one of `pack(o1)` and `pack(o2)` is preferred and the other
is not, then the preferred one sorts first.
+
(This is a detail that allows the MIDX bitmap to determine which
pack should be used by the pack-reuse mechanism, since it can ask
the MIDX for the pack containing the object at bit position 0).
- If `pack(o1) ≠ pack(o2)`, then sort the two objects in descending
order based on the pack ID.
- Otherwise, `pack(o1) = pack(o2)`, and the objects are sorted in
pack-order (i.e., `o1` sorts ahead of `o2` exactly when `offset(o1)
< offset(o2)`).
In short, a MIDX's pseudo-pack is the de-duplicated concatenation of
objects in packs stored by the MIDX, laid out in pack order, and the
packs arranged in MIDX order (with the preferred pack coming first).
Finally, note that the MIDX's reverse index is not stored as a chunk in
the multi-pack-index itself. This is done because the reverse index
includes the checksum of the pack or MIDX to which it belongs, which
makes it impossible to write in the MIDX. To avoid races when rewriting
the MIDX, a MIDX reverse index includes the MIDX's checksum in its
filename (e.g., `multi-pack-index-xyz.rev`).