From 0074cc299493ff4567641e9af5492eb4799ca0ef Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:10 -0400 Subject: [PATCH 01/26] Documentation/gitpacking.txt: initial commit Introduce a new manual page, gitpacking(7) to collect useful information about advanced packing concepts in Git. In future commits in this series, this manual page will expand to describe the new pseudo-merge bitmaps feature, as well as include examples, relevant configuration bits, use-cases, and so on. Outside of this series, this manual page may absorb similar pieces from other parts of Git's documentation about packing. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/Makefile | 1 + Documentation/gitpacking.txt | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+) create mode 100644 Documentation/gitpacking.txt diff --git a/Documentation/Makefile b/Documentation/Makefile index 3f2383a12c..920b6248aa 100644 --- a/Documentation/Makefile +++ b/Documentation/Makefile @@ -51,6 +51,7 @@ MAN7_TXT += gitdiffcore.txt MAN7_TXT += giteveryday.txt MAN7_TXT += gitfaq.txt MAN7_TXT += gitglossary.txt +MAN7_TXT += gitpacking.txt MAN7_TXT += gitnamespaces.txt MAN7_TXT += gitremote-helpers.txt MAN7_TXT += gitrevisions.txt diff --git a/Documentation/gitpacking.txt b/Documentation/gitpacking.txt new file mode 100644 index 0000000000..50e9900d84 --- /dev/null +++ b/Documentation/gitpacking.txt @@ -0,0 +1,34 @@ +gitpacking(7) +============= + +NAME +---- +gitpacking - Advanced concepts related to packing in Git + +SYNOPSIS +-------- +gitpacking + +DESCRIPTION +----------- + +This document aims to describe some advanced concepts related to packing +in Git. + +Many concepts are currently described scattered between manual pages of +various Git commands, including linkgit:git-pack-objects[1], +linkgit:git-repack[1], and others, as well as linkgit:gitformat-pack[5], +and parts of the `Documentation/technical` tree. + +There are many aspects of packing in Git that are not covered in this +document that instead live in the aforementioned areas. Over time, those +scattered bits may coalesce into this document. + +SEE ALSO +-------- +linkgit:git-pack-objects[1] +linkgit:git-repack[1] + +GIT +--- +Part of the linkgit:git[1] suite From 40864ac9028e379d45f6fe3a3f4b072502bd7631 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:13 -0400 Subject: [PATCH 02/26] Documentation/gitpacking.txt: describe pseudo-merge bitmaps Add some details to the gitpacking(7) manual page which motivate and describe pseudo-merge bitmaps. The exact on-disk format and many of the configuration knobs will be described in subsequent commits. Helped-by: Jeff King Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/gitpacking.txt | 72 ++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) diff --git a/Documentation/gitpacking.txt b/Documentation/gitpacking.txt index 50e9900d84..f24396f017 100644 --- a/Documentation/gitpacking.txt +++ b/Documentation/gitpacking.txt @@ -24,6 +24,78 @@ There are many aspects of packing in Git that are not covered in this document that instead live in the aforementioned areas. Over time, those scattered bits may coalesce into this document. +== Pseudo-merge bitmaps + +NOTE: Pseudo-merge bitmaps are considered an experimental feature, so +the configuration and many of the ideas are subject to change. + +=== Background + +Reachability bitmaps are most efficient when we have on-disk stored +bitmaps for one or more of the starting points of a traversal. For this +reason, Git prefers storing bitmaps for commits at the tips of refs, +because traversals tend to start with those points. + +But if you have a large number of refs, it's not feasible to store a +bitmap for _every_ ref tip. It takes up space, and just OR-ing all of +those bitmaps together is expensive. + +One way we can deal with that is to create bitmaps that represent +_groups_ of refs. When a traversal asks about the entire group, then we +can use this single bitmap instead of considering each ref individually. +Because these bitmaps represent the set of objects which would be +reachable in a hypothetical merge of all of the commits, we call them +pseudo-merge bitmaps. + +=== Overview + +A "pseudo-merge bitmap" is used to refer to a pair of bitmaps, as +follows: + +Commit bitmap:: + + A bitmap whose set bits describe the set of commits included in the + pseudo-merge's "merge" bitmap (as below). + +Merge bitmap:: + + A bitmap whose set bits describe the reachability closure over the set + of commits in the pseudo-merge's "commits" bitmap (as above). An + identical bitmap would be generated for an octopus merge with the same + set of parents as described in the commits bitmap. + +Pseudo-merge bitmaps can accelerate bitmap traversals when all commits +for a given pseudo-merge are listed on either side of the traversal, +either directly (by explicitly asking for them as part of the `HAVES` +or `WANTS`) or indirectly (by encountering them during a fill-in +traversal). + +=== Use-cases + +For example, suppose there exists a pseudo-merge bitmap with a large +number of commits, all of which are listed in the `WANTS` section of +some bitmap traversal query. When pseudo-merge bitmaps are enabled, the +bitmap machinery can quickly determine there is a pseudo-merge which +satisfies some subset of the wanted objects on either side of the query. +Then, we can inflate the EWAH-compressed bitmap, and `OR` it in to the +resulting bitmap. By contrast, without pseudo-merge bitmaps, we would +have to repeat the decompression and `OR`-ing step over a potentially +large number of individual bitmaps, which can take proportionally more +time. + +Another benefit of pseudo-merges arises when there is some combination +of (a) a large number of references, with (b) poor bitmap coverage, and +(c) deep, nested trees, making fill-in traversal relatively expensive. +For example, suppose that there are a large enough number of tags where +bitmapping each of the tags individually is infeasible. Without +pseudo-merge bitmaps, computing the result of, say, `git rev-list +--use-bitmap-index --count --objects --tags` would likely require a +large amount of fill-in traversal. But when a large quantity of those +tags are stored together in a pseudo-merge bitmap, the bitmap machinery +can take advantage of the fact that we only care about the union of +objects reachable from all of those tags, and answer the query much +faster. + SEE ALSO -------- linkgit:git-pack-objects[1] From 2bfc24ecf60e21b4cc4b849ad48a44d42ada36c3 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:16 -0400 Subject: [PATCH 03/26] Documentation/technical: describe pseudo-merge bitmaps format Prepare to implement pseudo-merge bitmaps over the next several commits by first describing the serialization format which will store the new pseudo-merge bitmaps themselves. This format is implemented as an optional extension within the bitmap v1 format, making it compatible with previous versions of Git, as well as the original .bitmap implementation within JGit. The format is described in detail in the patch contents below, but the high-level description is as follows: - An array of pseudo-merge bitmaps, each containing a pair of EWAH bitmaps: one describing the set of pseudo-merge "parents", and another describing the set of object(s) reachable from those parents. - A lookup table to determine which pseudo-merge(s) a given commit appears in. An optional extended lookup table follows when there is at least one commit which appears in multiple pseudo-merge groups. - Trailing metadata, including the number of pseudo-merge(s), number of unique parents, the offset within the .bitmap file for the pseudo-merge commit lookup table, and the size of the optional extension itself. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/technical/bitmap-format.txt | 132 ++++++++++++++++++++++ 1 file changed, 132 insertions(+) diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index f5d200939b..ee7775a258 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -255,3 +255,135 @@ triplet is - xor_row (4 byte integer, network byte order): :: The position of the triplet whose bitmap is used to compress this one, or `0xffffffff` if no such bitmap exists. + +Pseudo-merge bitmaps +-------------------- + +If the `BITMAP_OPT_PSEUDO_MERGES` flag is set, a variable number of +bytes (preceding the name-hash cache, commit lookup table, and trailing +checksum) of the `.bitmap` file is used to store pseudo-merge bitmaps. + +For more information on what pseudo-merges are, why they are useful, and +how to configure them, see the information in linkgit:gitpacking[7]. + +=== File format + +If enabled, pseudo-merge bitmaps are stored in an optional section at +the end of a `.bitmap` file. The format is as follows: + +.... ++-------------------------------------------+ +| .bitmap File | ++-------------------------------------------+ +| | +| Pseudo-merge bitmaps (Variable Length) | +| +---------------------------+ | +| | commits_bitmap (EWAH) | | +| +---------------------------+ | +| | merge_bitmap (EWAH) | | +| +---------------------------+ | +| | ++-------------------------------------------+ +| | +| Lookup Table | +| +---------------------------+ | +| | commit_pos (4 bytes) | | +| +---------------------------+ | +| | offset (8 bytes) | | +| +------------+--------------+ | +| | +| Offset Cases: | +| ------------- | +| | +| 1. MSB Unset: single pseudo-merge bitmap | +| + offset to pseudo-merge bitmap | +| | +| 2. MSB Set: multiple pseudo-merges | +| + offset to extended lookup table | +| | ++-------------------------------------------+ +| | +| Extended Lookup Table (Optional) | +| +----+----------+----------+----------+ | +| | N | Offset 1 | .... | Offset N | | +| +----+----------+----------+----------+ | +| | | 8 bytes | .... | 8 bytes | | +| +----+----------+----------+----------+ | +| | ++-------------------------------------------+ +| | +| Pseudo-merge Metadata | +| +-----------------------------------+ | +| | # pseudo-merges (4 bytes) | | +| +-----------------------------------+ | +| | # commits (4 bytes) | | +| +-----------------------------------+ | +| | Lookup offset (8 bytes) | | +| +-----------------------------------+ | +| | Extension size (8 bytes) | | +| +-----------------------------------+ | +| | ++-------------------------------------------+ +.... + +* One or more pseudo-merge bitmaps, each containing: + + ** `commits_bitmap`, an EWAH-compressed bitmap describing the set of + commits included in the this psuedo-merge. + + ** `merge_bitmap`, an EWAH-compressed bitmap describing the union of + the set of objects reachable from all commits listed in the + `commits_bitmap`. + +* A lookup table, mapping pseudo-merged commits to the pseudo-merges + they belong to. Entries appear in increasing order of each commit's + bit position. Each entry is 12 bytes wide, and is comprised of the + following: + + ** `commit_pos`, a 4-byte unsigned value (in network byte-order) + containing the bit position for this commit. + + ** `offset`, an 8-byte unsigned value (also in network byte-order) + containing either one of two possible offsets, depending on whether or + not the most-significant bit is set. + + *** If unset (i.e. `offset & ((uint64_t)1<<63) == 0`), the offset + (relative to the beginning of the `.bitmap` file) at which the + pseudo-merge bitmap for this commit can be read. This indicates + only a single pseudo-merge bitmap contains this commit. + + *** If set (i.e. `offset & ((uint64_t)1<<63) != 0`), the offset + (again relative to the beginning of the `.bitmap` file) at which + the extended offset table can be located describing the set of + pseudo-merge bitmaps which contain this commit. This indicates + that multiple pseudo-merge bitmaps contain this commit. + +* An (optional) extended lookup table (written if and only if there is + at least one commit which appears in more than one pseudo-merge). + There are as many entries as commits which appear in multiple + pseudo-merges. Each entry contains the following: + + ** `N`, a 4-byte unsigned value equal to the number of pseudo-merges + which contain a given commit. + + ** An array of `N` 8-byte unsigned values, each of which is + interpreted as an offset (relative to the beginning of the + `.bitmap` file) at which a pseudo-merge bitmap for this commit can + be read. These values occur in no particular order. + +* Positions for all pseudo-merges, each stored as an 8-byte unsigned + value (in network byte-order) containing the offset (relative to the + beginning of the `.bitmap` file) of each consecutive pseudo-merge. + +* A 4-byte unsigned value (in network byte-order) equal to the number of + pseudo-merges. + +* A 4-byte unsigned value (in network byte-order) equal to the number of + unique commits which appear in any pseudo-merge. + +* An 8-byte unsigned value (in network byte-order) equal to the number + of bytes between the start of the pseudo-merge section and the + beginning of the lookup table. + +* An 8-byte unsigned value (in network byte-order) equal to the number + of bytes in the pseudo-merge section (including this field). From 10a96af8dc22ec890ffacafca52959bd61bd1a28 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:20 -0400 Subject: [PATCH 04/26] ewah: implement `ewah_bitmap_is_subset()` In order to know whether a given pseudo-merge (comprised of a "parents" and "objects" bitmaps) is "satisfied" and can be OR'd into the bitmap result, we need to be able to quickly determine whether the "parents" bitmap is a subset of the current set of objects reachable on either side of a traversal. Implement a helper function to prepare for that, which determines whether an EWAH bitmap (the parents bitmap from the pseudo-merge) is a subset of a non-EWAH bitmap (in this case, the results bitmap from either side of the traversal). This function makes use of the EWAH iterator to avoid inflating any part of the EWAH bitmap after we determine it is not a subset of the non-EWAH bitmap. This "fail-fast" allows us to avoid a potentially large amount of wasted effort. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- ewah/bitmap.c | 43 +++++++++++++++++++++++++++++++++++++++++++ ewah/ewok.h | 6 ++++++ 2 files changed, 49 insertions(+) diff --git a/ewah/bitmap.c b/ewah/bitmap.c index ac7e0af622..d352fec54c 100644 --- a/ewah/bitmap.c +++ b/ewah/bitmap.c @@ -138,6 +138,49 @@ void bitmap_or(struct bitmap *self, const struct bitmap *other) self->words[i] |= other->words[i]; } +int ewah_bitmap_is_subset(struct ewah_bitmap *self, struct bitmap *other) +{ + struct ewah_iterator it; + eword_t word; + size_t i; + + ewah_iterator_init(&it, self); + + for (i = 0; i < other->word_alloc; i++) { + if (!ewah_iterator_next(&word, &it)) { + /* + * If we reached the end of `self`, and haven't + * rejected `self` as a possible subset of + * `other` yet, then we are done and `self` is + * indeed a subset of `other`. + */ + return 1; + } + if (word & ~other->words[i]) { + /* + * Otherwise, compare the next two pairs of + * words. If the word from `self` has bit(s) not + * in the word from `other`, `self` is not a + * subset of `other`. + */ + return 0; + } + } + + /* + * If we got to this point, there may be zero or more words + * remaining in `self`, with no remaining words left in `other`. + * If there are any bits set in the remaining word(s) in `self`, + * then `self` is not a subset of `other`. + */ + while (ewah_iterator_next(&word, &it)) + if (word) + return 0; + + /* `self` is definitely a subset of `other` */ + return 1; +} + void bitmap_or_ewah(struct bitmap *self, struct ewah_bitmap *other) { size_t original_size = self->word_alloc; diff --git a/ewah/ewok.h b/ewah/ewok.h index c11d76c6f3..2b6c4ac499 100644 --- a/ewah/ewok.h +++ b/ewah/ewok.h @@ -179,7 +179,13 @@ void bitmap_unset(struct bitmap *self, size_t pos); int bitmap_get(struct bitmap *self, size_t pos); void bitmap_free(struct bitmap *self); int bitmap_equals(struct bitmap *self, struct bitmap *other); + +/* + * Both `bitmap_is_subset()` and `ewah_bitmap_is_subset()` return 1 if the set + * of bits in 'self' are a subset of the bits in 'other'. Returns 0 otherwise. + */ int bitmap_is_subset(struct bitmap *self, struct bitmap *other); +int ewah_bitmap_is_subset(struct ewah_bitmap *self, struct bitmap *other); struct ewah_bitmap * bitmap_to_ewah(struct bitmap *bitmap); struct bitmap *ewah_to_bitmap(struct ewah_bitmap *ewah); From 4722e06edcbb57142387081684b41eda2a6cf635 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:23 -0400 Subject: [PATCH 05/26] pack-bitmap: move some initialization to `bitmap_writer_init()` The pack-bitmap-writer machinery uses a oidmap (backed by khash.h) to map from commits selected for bitmaps (by OID) to a bitmapped_commit structure (containing the bitmap itself, among other things like its XOR offset, etc.) This map was initialized at the end of `bitmap_writer_build()`. New entries are added in `pack-bitmap-write.c::store_selected()`, which is called by the bitmap_builder machinery (which is responsible for traversing history and generating the actual bitmaps). Reorganize when this field is initialized and when entries are added to it so that we can quickly determine whether a commit is a candidate for pseudo-merge selection, or not (since it was already selected to receive a bitmap, and thus storing it in a pseudo-merge would be redundant). The changes are as follows: - Introduce a new `bitmap_writer_init()` function which initializes the `writer.bitmaps` field (instead of waiting until the end of `bitmap_writer_build()`). - Add map entries in `push_bitmapped_commit()` (which is called via `bitmap_writer_select_commits()`) with OID keys and NULL values to track whether or not we *expect* to write a bitmap for some given commit. - Validate that a NULL entry is found matching the given key when we store a selected bitmap. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- builtin/pack-objects.c | 3 ++- midx-write.c | 2 +- pack-bitmap-write.c | 24 ++++++++++++++++++------ pack-bitmap.h | 2 +- 4 files changed, 22 insertions(+), 9 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index 26a6d0d791..6209264e60 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1340,7 +1340,8 @@ static void write_pack_file(void) hash_to_hex(hash)); if (write_bitmap_index) { - bitmap_writer_init(&bitmap_writer); + bitmap_writer_init(&bitmap_writer, + the_repository); bitmap_writer_set_checksum(&bitmap_writer, hash); bitmap_writer_build_type_index(&bitmap_writer, &to_pack, written_list, nr_written); diff --git a/midx-write.c b/midx-write.c index 7c0c08c64b..c747d1a6af 100644 --- a/midx-write.c +++ b/midx-write.c @@ -820,7 +820,7 @@ static int write_midx_bitmap(const char *midx_name, for (i = 0; i < pdata->nr_objects; i++) index[i] = &pdata->objects[i].idx; - bitmap_writer_init(&writer); + bitmap_writer_init(&writer, the_repository); bitmap_writer_show_progress(&writer, flags & MIDX_PROGRESS); bitmap_writer_build_type_index(&writer, pdata, index, pdata->nr_objects); diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 6cae670412..d887015583 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -27,9 +27,12 @@ struct bitmapped_commit { uint32_t commit_pos; }; -void bitmap_writer_init(struct bitmap_writer *writer) +void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r) { memset(writer, 0, sizeof(struct bitmap_writer)); + if (writer->bitmaps) + BUG("bitmap writer already initialized"); + writer->bitmaps = kh_init_oid_map(); } void bitmap_writer_free(struct bitmap_writer *writer) @@ -128,11 +131,21 @@ void bitmap_writer_build_type_index(struct bitmap_writer *writer, static inline void push_bitmapped_commit(struct bitmap_writer *writer, struct commit *commit) { + int hash_ret; + khiter_t hash_pos; + if (writer->selected_nr >= writer->selected_alloc) { writer->selected_alloc = (writer->selected_alloc + 32) * 2; REALLOC_ARRAY(writer->selected, writer->selected_alloc); } + hash_pos = kh_put_oid_map(writer->bitmaps, commit->object.oid, + &hash_ret); + if (!hash_ret) + die(_("duplicate entry when writing bitmap index: %s"), + oid_to_hex(&commit->object.oid)); + kh_value(writer->bitmaps, hash_pos) = NULL; + writer->selected[writer->selected_nr].commit = commit; writer->selected[writer->selected_nr].bitmap = NULL; writer->selected[writer->selected_nr].write_as = NULL; @@ -483,14 +496,14 @@ static void store_selected(struct bitmap_writer *writer, { struct bitmapped_commit *stored = &writer->selected[ent->idx]; khiter_t hash_pos; - int hash_ret; stored->bitmap = bitmap_to_ewah(ent->bitmap); - hash_pos = kh_put_oid_map(writer->bitmaps, commit->object.oid, &hash_ret); - if (hash_ret == 0) - die("Duplicate entry when writing index: %s", + hash_pos = kh_get_oid_map(writer->bitmaps, commit->object.oid); + if (hash_pos == kh_end(writer->bitmaps)) + die(_("attempted to store non-selected commit: '%s'"), oid_to_hex(&commit->object.oid)); + kh_value(writer->bitmaps, hash_pos) = stored; } @@ -506,7 +519,6 @@ int bitmap_writer_build(struct bitmap_writer *writer, uint32_t *mapping; int closed = 1; /* until proven otherwise */ - writer->bitmaps = kh_init_oid_map(); writer->to_pack = to_pack; if (writer->show_progress) diff --git a/pack-bitmap.h b/pack-bitmap.h index 3091095f33..f87e60153d 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -114,7 +114,7 @@ struct bitmap_writer { unsigned char pack_checksum[GIT_MAX_RAWSZ]; }; -void bitmap_writer_init(struct bitmap_writer *writer); +void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r); void bitmap_writer_show_progress(struct bitmap_writer *writer, int show); void bitmap_writer_set_checksum(struct bitmap_writer *writer, const unsigned char *sha1); From 89f47c45df7ff65f555548718641c4610f466f08 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:26 -0400 Subject: [PATCH 06/26] pseudo-merge.ch: initial commit Add a new (empty) header file to contain the implementation for selecting, reading, and applying pseudo-merge bitmaps. For now this header and its corresponding implementation are left empty, but they will evolve over the course of subsequent commit(s). Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Makefile | 1 + pseudo-merge.c | 2 ++ pseudo-merge.h | 6 ++++++ 3 files changed, 9 insertions(+) create mode 100644 pseudo-merge.c create mode 100644 pseudo-merge.h diff --git a/Makefile b/Makefile index 1e31acc72e..6a3d164fdf 100644 --- a/Makefile +++ b/Makefile @@ -1119,6 +1119,7 @@ LIB_OBJS += prompt.o LIB_OBJS += protocol.o LIB_OBJS += protocol-caps.o LIB_OBJS += prune-packed.o +LIB_OBJS += pseudo-merge.o LIB_OBJS += quote.o LIB_OBJS += range-diff.o LIB_OBJS += reachable.o diff --git a/pseudo-merge.c b/pseudo-merge.c new file mode 100644 index 0000000000..37e037ba27 --- /dev/null +++ b/pseudo-merge.c @@ -0,0 +1,2 @@ +#include "git-compat-util.h" +#include "pseudo-merge.h" diff --git a/pseudo-merge.h b/pseudo-merge.h new file mode 100644 index 0000000000..cab8ff6960 --- /dev/null +++ b/pseudo-merge.h @@ -0,0 +1,6 @@ +#ifndef PSEUDO_MERGE_H +#define PSEUDO_MERGE_H + +#include "git-compat-util.h" + +#endif From 0d41b18317ece0d46e2cfdf1d99ba2b2166c0248 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:29 -0400 Subject: [PATCH 07/26] pack-bitmap-write: support storing pseudo-merge commits Prepare to write pseudo-merge bitmaps by annotating individual bitmapped commits (which are represented by the `bitmapped_commit` structure) with an extra bit indicating whether or not they are a pseudo-merge. In subsequent commits, pseudo-merge bitmaps will be generated by allocating a fake commit node with parents covering the full set of commits represented by the pseudo-merge bitmap. These commits will be added to the set of "selected" commits as usual, but will be written specially instead of being included with the rest of the selected commits. Mechanically speaking, there are two parts of this change: - The bitmapped_commit struct gets a new bit indicating whether it is a pseudo-merge, or an ordinary commit selected for bitmaps. - A handful of changes to only write out the non-pseudo-merge commits when enumerating through the selected array (see the new `bitmap_writer_selected_nr()` function). Pseudo-merge commits appear after all non-pseudo-merge commits, so it is safe to enumerate through the selected array like so: for (i = 0; i < bitmap_writer_selected_nr(); i++) if (writer.selected[i].pseudo_merge) BUG("unexpected pseudo-merge"); without encountering the BUG(). Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- object.h | 2 +- pack-bitmap-write.c | 96 +++++++++++++++++++++++++++++---------------- pack-bitmap.h | 3 ++ 3 files changed, 67 insertions(+), 34 deletions(-) diff --git a/object.h b/object.h index 99b9c8f114..e6f9e89d3c 100644 --- a/object.h +++ b/object.h @@ -81,7 +81,7 @@ void object_array_init(struct object_array *array); * reflog.c: 10--12 * builtin/show-branch.c: 0-------------------------------------------26 * builtin/unpack-objects.c: 2021 - * pack-bitmap.h: 22 + * pack-bitmap.h: 2122 */ #define FLAG_BITS 28 diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index d887015583..60eb1e71c9 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -25,8 +25,14 @@ struct bitmapped_commit { int flags; int xor_offset; uint32_t commit_pos; + unsigned pseudo_merge : 1; }; +static inline int bitmap_writer_nr_selected_commits(struct bitmap_writer *writer) +{ + return writer->selected_nr - writer->pseudo_merges_nr; +} + void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r) { memset(writer, 0, sizeof(struct bitmap_writer)); @@ -129,27 +135,31 @@ void bitmap_writer_build_type_index(struct bitmap_writer *writer, */ static inline void push_bitmapped_commit(struct bitmap_writer *writer, - struct commit *commit) + struct commit *commit, + unsigned pseudo_merge) { - int hash_ret; - khiter_t hash_pos; - if (writer->selected_nr >= writer->selected_alloc) { writer->selected_alloc = (writer->selected_alloc + 32) * 2; REALLOC_ARRAY(writer->selected, writer->selected_alloc); } - hash_pos = kh_put_oid_map(writer->bitmaps, commit->object.oid, - &hash_ret); - if (!hash_ret) - die(_("duplicate entry when writing bitmap index: %s"), - oid_to_hex(&commit->object.oid)); - kh_value(writer->bitmaps, hash_pos) = NULL; + if (!pseudo_merge) { + int hash_ret; + khiter_t hash_pos = kh_put_oid_map(writer->bitmaps, + commit->object.oid, + &hash_ret); + + if (!hash_ret) + die(_("duplicate entry when writing bitmap index: %s"), + oid_to_hex(&commit->object.oid)); + kh_value(writer->bitmaps, hash_pos) = NULL; + } writer->selected[writer->selected_nr].commit = commit; writer->selected[writer->selected_nr].bitmap = NULL; writer->selected[writer->selected_nr].write_as = NULL; writer->selected[writer->selected_nr].flags = 0; + writer->selected[writer->selected_nr].pseudo_merge = pseudo_merge; writer->selected_nr++; } @@ -180,16 +190,20 @@ static void compute_xor_offsets(struct bitmap_writer *writer) while (next < writer->selected_nr) { struct bitmapped_commit *stored = &writer->selected[next]; - int best_offset = 0; struct ewah_bitmap *best_bitmap = stored->bitmap; struct ewah_bitmap *test_xor; + if (stored->pseudo_merge) + goto next; + for (i = 1; i <= MAX_XOR_OFFSET_SEARCH; ++i) { int curr = next - i; if (curr < 0) break; + if (writer->selected[curr].pseudo_merge) + continue; test_xor = ewah_pool_new(); ewah_xor(writer->selected[curr].bitmap, stored->bitmap, test_xor); @@ -205,6 +219,7 @@ static void compute_xor_offsets(struct bitmap_writer *writer) } } +next: stored->xor_offset = best_offset; stored->write_as = best_bitmap; @@ -217,7 +232,8 @@ struct bb_commit { struct bitmap *commit_mask; struct bitmap *bitmap; unsigned selected:1, - maximal:1; + maximal:1, + pseudo_merge:1; unsigned idx; /* within selected array */ }; @@ -255,17 +271,18 @@ static void bitmap_builder_init(struct bitmap_builder *bb, revs.first_parent_only = 1; for (i = 0; i < writer->selected_nr; i++) { - struct commit *c = writer->selected[i].commit; - struct bb_commit *ent = bb_data_at(&bb->data, c); + struct bitmapped_commit *bc = &writer->selected[i]; + struct bb_commit *ent = bb_data_at(&bb->data, bc->commit); ent->selected = 1; ent->maximal = 1; + ent->pseudo_merge = bc->pseudo_merge; ent->idx = i; ent->commit_mask = bitmap_new(); bitmap_set(ent->commit_mask, i); - add_pending_object(&revs, &c->object, ""); + add_pending_object(&revs, &bc->commit->object, ""); } if (prepare_revision_walk(&revs)) @@ -444,8 +461,13 @@ static int fill_bitmap_commit(struct bitmap_writer *writer, struct commit *c = prio_queue_get(queue); if (old_bitmap && mapping) { - struct ewah_bitmap *old = bitmap_for_commit(old_bitmap, c); + struct ewah_bitmap *old; struct bitmap *remapped = bitmap_new(); + + if (commit->object.flags & BITMAP_PSEUDO_MERGE) + old = NULL; + else + old = bitmap_for_commit(old_bitmap, c); /* * If this commit has an old bitmap, then translate that * bitmap and add its bits to this one. No need to walk @@ -464,12 +486,14 @@ static int fill_bitmap_commit(struct bitmap_writer *writer, * Mark ourselves and queue our tree. The commit * walk ensures we cover all parents. */ - pos = find_object_pos(writer, &c->object.oid, &found); - if (!found) - return -1; - bitmap_set(ent->bitmap, pos); - prio_queue_put(tree_queue, - repo_get_commit_tree(the_repository, c)); + if (!(c->object.flags & BITMAP_PSEUDO_MERGE)) { + pos = find_object_pos(writer, &c->object.oid, &found); + if (!found) + return -1; + bitmap_set(ent->bitmap, pos); + prio_queue_put(tree_queue, + repo_get_commit_tree(the_repository, c)); + } for (p = c->parents; p; p = p->next) { pos = find_object_pos(writer, &p->item->object.oid, @@ -499,6 +523,9 @@ static void store_selected(struct bitmap_writer *writer, stored->bitmap = bitmap_to_ewah(ent->bitmap); + if (ent->pseudo_merge) + return; + hash_pos = kh_get_oid_map(writer->bitmaps, commit->object.oid); if (hash_pos == kh_end(writer->bitmaps)) die(_("attempted to store non-selected commit: '%s'"), @@ -631,7 +658,7 @@ void bitmap_writer_select_commits(struct bitmap_writer *writer, if (indexed_commits_nr < 100) { for (i = 0; i < indexed_commits_nr; ++i) - push_bitmapped_commit(writer, indexed_commits[i]); + push_bitmapped_commit(writer, indexed_commits[i], 0); return; } @@ -664,7 +691,7 @@ void bitmap_writer_select_commits(struct bitmap_writer *writer, } } - push_bitmapped_commit(writer, chosen); + push_bitmapped_commit(writer, chosen, 0); i += next + 1; display_progress(writer->progress, i); @@ -701,8 +728,11 @@ static void write_selected_commits_v1(struct bitmap_writer *writer, { int i; - for (i = 0; i < writer->selected_nr; ++i) { + for (i = 0; i < bitmap_writer_nr_selected_commits(writer); ++i) { struct bitmapped_commit *stored = &writer->selected[i]; + if (stored->pseudo_merge) + BUG("unexpected pseudo-merge among selected: %s", + oid_to_hex(&stored->commit->object.oid)); if (offsets) offsets[i] = hashfile_total(f); @@ -735,10 +765,10 @@ static void write_lookup_table(struct bitmap_writer *writer, struct hashfile *f, uint32_t i; uint32_t *table, *table_inv; - ALLOC_ARRAY(table, writer->selected_nr); - ALLOC_ARRAY(table_inv, writer->selected_nr); + ALLOC_ARRAY(table, bitmap_writer_nr_selected_commits(writer)); + ALLOC_ARRAY(table_inv, bitmap_writer_nr_selected_commits(writer)); - for (i = 0; i < writer->selected_nr; i++) + for (i = 0; i < bitmap_writer_nr_selected_commits(writer); i++) table[i] = i; /* @@ -746,16 +776,16 @@ static void write_lookup_table(struct bitmap_writer *writer, struct hashfile *f, * bitmap corresponds to j'th bitmapped commit (among the selected * commits) in lex order of OIDs. */ - QSORT_S(table, writer->selected_nr, table_cmp, writer); + QSORT_S(table, bitmap_writer_nr_selected_commits(writer), table_cmp, writer); /* table_inv helps us discover that relationship (i'th bitmap * to j'th commit by j = table_inv[i]) */ - for (i = 0; i < writer->selected_nr; i++) + for (i = 0; i < bitmap_writer_nr_selected_commits(writer); i++) table_inv[table[i]] = i; trace2_region_enter("pack-bitmap-write", "writing_lookup_table", the_repository); - for (i = 0; i < writer->selected_nr; i++) { + for (i = 0; i < bitmap_writer_nr_selected_commits(writer); i++) { struct bitmapped_commit *selected = &writer->selected[table[i]]; uint32_t xor_offset = selected->xor_offset; uint32_t xor_row; @@ -827,7 +857,7 @@ void bitmap_writer_finish(struct bitmap_writer *writer, memcpy(header.magic, BITMAP_IDX_SIGNATURE, sizeof(BITMAP_IDX_SIGNATURE)); header.version = htons(default_version); header.options = htons(flags | options); - header.entry_count = htonl(writer->selected_nr); + header.entry_count = htonl(bitmap_writer_nr_selected_commits(writer)); hashcpy(header.checksum, writer->pack_checksum); hashwrite(f, &header, sizeof(header) - GIT_MAX_RAWSZ + the_hash_algo->rawsz); @@ -839,7 +869,7 @@ void bitmap_writer_finish(struct bitmap_writer *writer, if (options & BITMAP_OPT_LOOKUP_TABLE) CALLOC_ARRAY(offsets, index_nr); - for (i = 0; i < writer->selected_nr; i++) { + for (i = 0; i < bitmap_writer_nr_selected_commits(writer); i++) { struct bitmapped_commit *stored = &writer->selected[i]; int commit_pos = oid_pos(&stored->commit->object.oid, index, index_nr, oid_access); diff --git a/pack-bitmap.h b/pack-bitmap.h index f87e60153d..6937a0f090 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -21,6 +21,7 @@ struct bitmap_disk_header { unsigned char checksum[GIT_MAX_RAWSZ]; }; +#define BITMAP_PSEUDO_MERGE (1u<<21) #define NEEDS_BITMAP (1u<<22) /* @@ -109,6 +110,8 @@ struct bitmap_writer { struct bitmapped_commit *selected; unsigned int selected_nr, selected_alloc; + uint32_t pseudo_merges_nr; + struct progress *progress; int show_progress; unsigned char pack_checksum[GIT_MAX_RAWSZ]; From 245a7f2e01dc86baa08ceb70937585c6389cb9db Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:32 -0400 Subject: [PATCH 08/26] pack-bitmap: implement `bitmap_writer_has_bitmapped_object_id()` Prepare to implement pseudo-merge bitmap selection by implementing a necessary new function, `bitmap_writer_has_bitmapped_object_id()`. This function returns whether or not the bitmap_writer selected the given object ID for bitmapping. This will allow the pseudo-merge machinery to reject candidates for pseudo-merges if they have already been selected as an ordinary bitmap tip. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap-write.c | 6 ++++++ pack-bitmap.h | 2 ++ 2 files changed, 8 insertions(+) diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 60eb1e71c9..299aa8af6f 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -130,6 +130,12 @@ void bitmap_writer_build_type_index(struct bitmap_writer *writer, } } +int bitmap_writer_has_bitmapped_object_id(struct bitmap_writer *writer, + const struct object_id *oid) +{ + return kh_get_oid_map(writer->bitmaps, *oid) != kh_end(writer->bitmaps); +} + /** * Compute the actual bitmaps */ diff --git a/pack-bitmap.h b/pack-bitmap.h index 6937a0f090..e175f28e0d 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -125,6 +125,8 @@ void bitmap_writer_build_type_index(struct bitmap_writer *writer, struct packing_data *to_pack, struct pack_idx_entry **index, uint32_t index_nr); +int bitmap_writer_has_bitmapped_object_id(struct bitmap_writer *writer, + const struct object_id *oid); uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git, struct packing_data *mapping); int rebuild_bitmap(const uint32_t *reposition, From c059c8795e926cfe3ad8cb7a47f54c071227fcfa Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:36 -0400 Subject: [PATCH 09/26] pack-bitmap: make `bitmap_writer_push_bitmapped_commit()` public The pseudo-merge selection code will be added in a subsequent commit, and will need a way to push the allocated commit structures into the bitmap writer from a separate compilation unit. Make the `bitmap_writer_push_bitmapped_commit()` function part of the pack-bitmap.h header in order to make this possible. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap-write.c | 9 ++++----- pack-bitmap.h | 2 ++ 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 299aa8af6f..bc19b33ad1 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -140,9 +140,8 @@ int bitmap_writer_has_bitmapped_object_id(struct bitmap_writer *writer, * Compute the actual bitmaps */ -static inline void push_bitmapped_commit(struct bitmap_writer *writer, - struct commit *commit, - unsigned pseudo_merge) +void bitmap_writer_push_commit(struct bitmap_writer *writer, + struct commit *commit, unsigned pseudo_merge) { if (writer->selected_nr >= writer->selected_alloc) { writer->selected_alloc = (writer->selected_alloc + 32) * 2; @@ -664,7 +663,7 @@ void bitmap_writer_select_commits(struct bitmap_writer *writer, if (indexed_commits_nr < 100) { for (i = 0; i < indexed_commits_nr; ++i) - push_bitmapped_commit(writer, indexed_commits[i], 0); + bitmap_writer_push_commit(writer, indexed_commits[i], 0); return; } @@ -697,7 +696,7 @@ void bitmap_writer_select_commits(struct bitmap_writer *writer, } } - push_bitmapped_commit(writer, chosen, 0); + bitmap_writer_push_commit(writer, chosen, 0); i += next + 1; display_progress(writer->progress, i); diff --git a/pack-bitmap.h b/pack-bitmap.h index e175f28e0d..a7e2f56c97 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -127,6 +127,8 @@ void bitmap_writer_build_type_index(struct bitmap_writer *writer, uint32_t index_nr); int bitmap_writer_has_bitmapped_object_id(struct bitmap_writer *writer, const struct object_id *oid); +void bitmap_writer_push_commit(struct bitmap_writer *writer, + struct commit *commit, unsigned pseudo_merge); uint32_t *create_bitmap_mapping(struct bitmap_index *bitmap_git, struct packing_data *mapping); int rebuild_bitmap(const uint32_t *reposition, From 5831f8ac41a137b93d5481cc48078cb5628538e3 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:39 -0400 Subject: [PATCH 10/26] config: introduce `git_config_double()` Future commits will want to parse a double-precision floating point value from configuration, but we have no way to parse such a value prior to this patch. The core of the routine is implemented in git_parse_double(). Unlike git_parse_unsigned() and git_parse_signed(), however, the function implemented here only works on type "double", and not related types like "float", or "long double". This is because "float" and "long double" use different functions to convert from ASCII strings to floating point values (strtof() and strtold(), respectively). Likewise, there is no pointer type that can assign to any of these values (except for "void *"), so the only way to define this trio of functions would be with a macro expansion that is parameterized over the floating point type and conversion function. That is all doable, but likely to be overkill given our current needs, which is only to parse double-precision floats. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- config.c | 9 +++++++++ config.h | 7 +++++++ parse.c | 29 +++++++++++++++++++++++++++++ parse.h | 1 + 4 files changed, 46 insertions(+) diff --git a/config.c b/config.c index ae3652b08f..8a8332617f 100644 --- a/config.c +++ b/config.c @@ -1243,6 +1243,15 @@ ssize_t git_config_ssize_t(const char *name, const char *value, return ret; } +double git_config_double(const char *name, const char *value, + const struct key_value_info *kvi) +{ + double ret; + if (!git_parse_double(value, &ret)) + die_bad_number(name, value, kvi); + return ret; +} + static const struct fsync_component_name { const char *name; enum fsync_component component_bits; diff --git a/config.h b/config.h index f4966e3749..f5f306f373 100644 --- a/config.h +++ b/config.h @@ -261,6 +261,13 @@ unsigned long git_config_ulong(const char *, const char *, ssize_t git_config_ssize_t(const char *, const char *, const struct key_value_info *); +/** + * Identically to `git_config_double`, but for double-precision floating point + * values. + */ +double git_config_double(const char *, const char *, + const struct key_value_info *); + /** * Same as `git_config_bool`, except that integers are returned as-is, and * an `is_bool` flag is unset. diff --git a/parse.c b/parse.c index 42d691a0fb..7a60a4f816 100644 --- a/parse.c +++ b/parse.c @@ -125,6 +125,35 @@ int git_parse_ssize_t(const char *value, ssize_t *ret) return 1; } +int git_parse_double(const char *value, double *ret) +{ + char *end; + double val; + uintmax_t factor; + + if (!value || !*value) { + errno = EINVAL; + return 0; + } + + errno = 0; + val = strtod(value, &end); + if (errno == ERANGE) + return 0; + if (end == value) { + errno = EINVAL; + return 0; + } + factor = get_unit_factor(end); + if (!factor) { + errno = EINVAL; + return 0; + } + val *= factor; + *ret = val; + return 1; +} + int git_parse_maybe_bool_text(const char *value) { if (!value) diff --git a/parse.h b/parse.h index 07d2193d69..6bb9a54d9a 100644 --- a/parse.h +++ b/parse.h @@ -6,6 +6,7 @@ int git_parse_ssize_t(const char *, ssize_t *); int git_parse_ulong(const char *, unsigned long *); int git_parse_int(const char *value, int *ret); int git_parse_int64(const char *value, int64_t *ret); +int git_parse_double(const char *value, double *ret); /** * Same as `git_config_bool`, except that it returns -1 on error rather From faf558b23ef55b18f395d1f7a7c2714ccc1320e2 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:42 -0400 Subject: [PATCH 11/26] pseudo-merge: implement support for selecting pseudo-merge commits Teach the new pseudo-merge machinery how to select non-bitmapped commits for inclusion in different pseudo-merge group(s) based on a handful of criteria. Note that the selected pseudo-merge commits aren't actually used or written anywhere yet. This will be done in the following commit. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/config.txt | 2 + Documentation/config/bitmap-pseudo-merge.txt | 91 ++++ Documentation/gitpacking.txt | 83 ++++ pack-bitmap-write.c | 21 + pack-bitmap.h | 2 + pseudo-merge.c | 454 +++++++++++++++++++ pseudo-merge.h | 94 ++++ 7 files changed, 747 insertions(+) create mode 100644 Documentation/config/bitmap-pseudo-merge.txt diff --git a/Documentation/config.txt b/Documentation/config.txt index 70b448b132..bbedb7b9a0 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -383,6 +383,8 @@ include::config/apply.txt[] include::config/attr.txt[] +include::config/bitmap-pseudo-merge.txt[] + include::config/blame.txt[] include::config/branch.txt[] diff --git a/Documentation/config/bitmap-pseudo-merge.txt b/Documentation/config/bitmap-pseudo-merge.txt new file mode 100644 index 0000000000..1f264eca99 --- /dev/null +++ b/Documentation/config/bitmap-pseudo-merge.txt @@ -0,0 +1,91 @@ +NOTE: The configuration options in `bitmapPseudoMerge.*` are considered +EXPERIMENTAL and may be subject to change or be removed entirely in the +future. For more information about the pseudo-merge bitmap feature, see +the "Pseudo-merge bitmaps" section of linkgit:gitpacking[7]. + +bitmapPseudoMerge..pattern:: + Regular expression used to match reference names. Commits + pointed to by references matching this pattern (and meeting + the below criteria, like `bitmapPseudoMerge..sampleRate` + and `bitmapPseudoMerge..threshold`) will be considered + for inclusion in a pseudo-merge bitmap. ++ +Commits are grouped into pseudo-merge groups based on whether or not +any reference(s) that point at a given commit match the pattern, which +is an extended regular expression. ++ +Within a pseudo-merge group, commits may be further grouped into +sub-groups based on the capture groups in the pattern. These +sub-groupings are formed from the regular expressions by concatenating +any capture groups from the regular expression, with a '-' dash in +between. ++ +For example, if the pattern is `refs/tags/`, then all tags (provided +they meet the below criteria) will be considered candidates for the +same pseudo-merge group. However, if the pattern is instead +`refs/remotes/([0-9])+/tags/`, then tags from different remotes will +be grouped into separate pseudo-merge groups, based on the remote +number. + +bitmapPseudoMerge..decay:: + Determines the rate at which consecutive pseudo-merge bitmap + groups decrease in size. Must be non-negative. This parameter + can be thought of as `k` in the function `f(n) = C * n^-k`, + where `f(n)` is the size of the `n`th group. ++ +Setting the decay rate equal to `0` will cause all groups to be the +same size. Setting the decay rate equal to `1` will cause the `n`th +group to be `1/n` the size of the initial group. Higher values of the +decay rate cause consecutive groups to shrink at an increasing rate. +The default is `1`. ++ +If all groups are the same size, it is possible that groups containing +newer commits will be able to be used less often than earlier groups, +since it is more likely that the references pointing at newer commits +will be updated more often than a reference pointing at an old commit. + +bitmapPseudoMerge..sampleRate:: + Determines the proportion of non-bitmapped commits (among + reference tips) which are selected for inclusion in an + unstable pseudo-merge bitmap. Must be between `0` and `1` + (inclusive). The default is `1`. + +bitmapPseudoMerge..threshold:: + Determines the minimum age of non-bitmapped commits (among + reference tips, as above) which are candidates for inclusion + in an unstable pseudo-merge bitmap. The default is + `1.week.ago`. + +bitmapPseudoMerge..maxMerges:: + Determines the maximum number of pseudo-merge commits among + which commits may be distributed. ++ +For pseudo-merge groups whose pattern does not contain any capture +groups, this setting is applied for all commits matching the regular +expression. For patterns that have one or more capture groups, this +setting is applied for each distinct capture group. ++ +For example, if your capture group is `refs/tags/`, then this setting +will distribute all tags into a maximum of `maxMerges` pseudo-merge +commits. However, if your capture group is, say, +`refs/remotes/([0-9]+)/tags/`, then this setting will be applied to +each remote's set of tags individually. ++ +Must be non-negative. The default value is 64. + +bitmapPseudoMerge..stableThreshold:: + Determines the minimum age of commits (among reference tips, + as above, however stable commits are still considered + candidates even when they have been covered by a bitmap) which + are candidates for a stable a pseudo-merge bitmap. The default + is `1.month.ago`. ++ +Setting this threshold to a smaller value (e.g., 1.week.ago) will cause +more stable groups to be generated (which impose a one-time generation +cost) but those groups will likely become stale over time. Using a +larger value incurs the opposite penalty (fewer stable groups which are +more useful). + +bitmapPseudoMerge..stableSize:: + Determines the size (in number of commits) of a stable + psuedo-merge bitmap. The default is `512`. diff --git a/Documentation/gitpacking.txt b/Documentation/gitpacking.txt index f24396f017..4a6fcba6f7 100644 --- a/Documentation/gitpacking.txt +++ b/Documentation/gitpacking.txt @@ -96,6 +96,89 @@ can take advantage of the fact that we only care about the union of objects reachable from all of those tags, and answer the query much faster. +=== Configuration + +Reference tips are grouped into different pseudo-merge groups according +to two criteria. A reference name matches one or more of the defined +pseudo-merge patterns, and optionally one or more capture groups within +that pattern which further partition the group. + +Within a group, commits may be considered "stable", or "unstable" +depending on their age. These are adjusted by setting the +`bitmapPseudoMerge..stableThreshold` and +`bitmapPseudoMerge..threshold` configuration values, respectively. + +All stable commits are grouped into pseudo-merges of equal size +(`bitmapPseudoMerge..stableSize`). If the `stableSize` +configuration is set to, say, 100, then the first 100 commits (ordered +by committer date) which are older than the `stableThreshold` value will +form one group, the next 100 commits will form another group, and so on. + +Among unstable commits, the pseudo-merge machinery will attempt to +combine older commits into large groups as opposed to newer commits +which will appear in smaller groups. This is based on the heuristic that +references whose tip commit is older are less likely to be modified to +point at a different commit than a reference whose tip commit is newer. + +The size of groups is determined by a power-law decay function, and the +decay parameter roughly corresponds to "k" in `f(n) = C*n^(-k/100)`, +where `f(n)` describes the size of the `n`-th pseudo-merge group. The +sample rate controls what percentage of eligible commits are considered +as candidates. The threshold parameter indicates the minimum age (so as +to avoid including too-recent commits in a pseudo-merge group, making it +less likely to be valid). The "maxMerges" parameter sets an upper-bound +on the number of pseudo-merge commits an individual group + +The "stable"-related parameters control "stable" pseudo-merge groups, +comprised of a fixed number of commits which are older than the +configured "stable threshold" value and may be grouped together in +chunks of "stableSize" in order of age. + +The exact configuration for pseudo-merges is as follows: + +include::config/bitmap-pseudo-merge.txt[] + +=== Examples + +Suppose that you have a repository with a large number of references, +and you want a bare-bones configuration of pseudo-merge bitmaps that +will enhance bitmap coverage of the `refs/` namespace. You may start +wiht a configuration like so: + + [bitmapPseudoMerge "all"] + pattern = "refs/" + threshold = now + stableThreshold = never + sampleRate = 100 + maxMerges = 64 + +This will create pseudo-merge bitmaps for all references, regardless of +their age, and group them into 64 pseudo-merge commits. + +If you wanted to separate tags from branches when generating +pseudo-merge commits, you would instead define the pattern with a +capture group, like so: + + [bitmapPseudoMerge "all"] + pattern = "refs/(heads/tags)/" + +Suppose instead that you are working in a fork-network repository, with +each fork specified by some numeric ID, and whose refs reside in +`refs/virtual/NNN/` (where `NNN` is the numeric ID corresponding to some +fork) in the network. In this instance, you may instead write something +like: + + [bitmapPseudoMerge "all"] + pattern = "refs/virtual/([0-9]+)/(heads|tags)/" + threshold = now + stableThreshold = never + sampleRate = 100 + maxMerges = 64 + +Which would generate pseudo-merge group identifiers like "1234-heads", +and "5678-tags" (for branches in fork "1234", and tags in remote "5678", +respectively). + SEE ALSO -------- linkgit:git-pack-objects[1] diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index bc19b33ad1..d5884ea5e9 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -17,6 +17,7 @@ #include "trace2.h" #include "tree.h" #include "tree-walk.h" +#include "pseudo-merge.h" struct bitmapped_commit { struct commit *commit; @@ -39,11 +40,25 @@ void bitmap_writer_init(struct bitmap_writer *writer, struct repository *r) if (writer->bitmaps) BUG("bitmap writer already initialized"); writer->bitmaps = kh_init_oid_map(); + writer->pseudo_merge_commits = kh_init_oid_map(); + + string_list_init_dup(&writer->pseudo_merge_groups); + + load_pseudo_merges_from_config(&writer->pseudo_merge_groups); +} + +static void free_pseudo_merge_commit_idx(struct pseudo_merge_commit_idx *idx) +{ + if (!idx) + return; + free(idx->pseudo_merge); + free(idx); } void bitmap_writer_free(struct bitmap_writer *writer) { uint32_t i; + struct pseudo_merge_commit_idx *idx; if (!writer) return; @@ -55,6 +70,10 @@ void bitmap_writer_free(struct bitmap_writer *writer) kh_destroy_oid_map(writer->bitmaps); + kh_foreach_value(writer->pseudo_merge_commits, idx, + free_pseudo_merge_commit_idx(idx)); + kh_destroy_oid_map(writer->pseudo_merge_commits); + for (i = 0; i < writer->selected_nr; i++) { struct bitmapped_commit *bc = &writer->selected[i]; if (bc->write_as != bc->bitmap) @@ -703,6 +722,8 @@ void bitmap_writer_select_commits(struct bitmap_writer *writer, } stop_progress(&writer->progress); + + select_pseudo_merges(writer, indexed_commits, indexed_commits_nr); } diff --git a/pack-bitmap.h b/pack-bitmap.h index a7e2f56c97..1e730ea1e5 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -110,6 +110,8 @@ struct bitmap_writer { struct bitmapped_commit *selected; unsigned int selected_nr, selected_alloc; + struct string_list pseudo_merge_groups; + kh_oid_map_t *pseudo_merge_commits; /* oid -> pseudo merge(s) */ uint32_t pseudo_merges_nr; struct progress *progress; diff --git a/pseudo-merge.c b/pseudo-merge.c index 37e037ba27..0f6854c753 100644 --- a/pseudo-merge.c +++ b/pseudo-merge.c @@ -1,2 +1,456 @@ #include "git-compat-util.h" #include "pseudo-merge.h" +#include "date.h" +#include "oid-array.h" +#include "strbuf.h" +#include "config.h" +#include "string-list.h" +#include "refs.h" +#include "pack-bitmap.h" +#include "commit.h" +#include "alloc.h" +#include "progress.h" + +#define DEFAULT_PSEUDO_MERGE_DECAY 1.0 +#define DEFAULT_PSEUDO_MERGE_MAX_MERGES 64 +#define DEFAULT_PSEUDO_MERGE_SAMPLE_RATE 1 +#define DEFAULT_PSEUDO_MERGE_THRESHOLD approxidate("1.week.ago") +#define DEFAULT_PSEUDO_MERGE_STABLE_THRESHOLD approxidate("1.month.ago") +#define DEFAULT_PSEUDO_MERGE_STABLE_SIZE 512 + +static double gitexp(double base, int exp) +{ + double result = 1; + while (1) { + if (exp % 2) + result *= base; + exp >>= 1; + if (!exp) + break; + base *= base; + } + return result; +} + +static uint32_t pseudo_merge_group_size(const struct pseudo_merge_group *group, + const struct pseudo_merge_matches *matches, + uint32_t i) +{ + double C = 0.0f; + uint32_t n; + + /* + * The size of pseudo-merge groups decays according to a power series, + * which looks like: + * + * f(n) = C * n^-k + * + * , where 'n' is the n-th pseudo-merge group, 'f(n)' is its size, 'k' + * is the decay rate, and 'C' is a scaling value. + * + * The value of C depends on the number of groups, decay rate, and total + * number of commits. It is computed such that if there are M and N + * total groups and commits, respectively, that: + * + * N = f(0) + f(1) + ... f(M-1) + * + * Rearranging to isolate C, we get: + * + * N = \sum_{n=1}^M C / n^k + * + * N / C = \sum_{n=1}^M n^-k + * + * C = N / \sum_{n=1}^M n^-k + * + * For example, if we have a decay rate of 'k' being equal to 1.5, 'N' + * total commits equal to 10,000, and 'M' being equal to 6 groups, then + * the (rounded) group sizes are: + * + * { 5469, 1934, 1053, 684, 489, 372 } + * + * increasing the number of total groups, say to 10, scales the group + * sizes appropriately: + * + * { 5012, 1772, 964, 626, 448, 341, 271, 221, 186, 158 } + */ + for (n = 0; n < group->max_merges; n++) + C += 1.0 / gitexp(n + 1, group->decay); + C = matches->unstable_nr / C; + + return (uint32_t)((C / gitexp(i + 1, group->decay)) + 0.5); +} + +static void pseudo_merge_group_init(struct pseudo_merge_group *group) +{ + memset(group, 0, sizeof(struct pseudo_merge_group)); + + strmap_init_with_options(&group->matches, NULL, 0); + + group->decay = DEFAULT_PSEUDO_MERGE_DECAY; + group->max_merges = DEFAULT_PSEUDO_MERGE_MAX_MERGES; + group->sample_rate = DEFAULT_PSEUDO_MERGE_SAMPLE_RATE; + group->threshold = DEFAULT_PSEUDO_MERGE_THRESHOLD; + group->stable_threshold = DEFAULT_PSEUDO_MERGE_STABLE_THRESHOLD; + group->stable_size = DEFAULT_PSEUDO_MERGE_STABLE_SIZE; +} + +static int pseudo_merge_config(const char *var, const char *value, + const struct config_context *ctx, + void *cb_data) +{ + struct string_list *list = cb_data; + struct string_list_item *item; + struct pseudo_merge_group *group; + struct strbuf buf = STRBUF_INIT; + const char *sub, *key; + size_t sub_len; + int ret = 0; + + if (parse_config_key(var, "bitmappseudomerge", &sub, &sub_len, &key)) + goto done; + + if (!sub_len) + goto done; + + strbuf_add(&buf, sub, sub_len); + + item = string_list_lookup(list, buf.buf); + if (!item) { + item = string_list_insert(list, buf.buf); + + item->util = xmalloc(sizeof(struct pseudo_merge_group)); + pseudo_merge_group_init(item->util); + } + + group = item->util; + + if (!strcmp(key, "pattern")) { + struct strbuf re = STRBUF_INIT; + + free(group->pattern); + if (*value != '^') + strbuf_addch(&re, '^'); + strbuf_addstr(&re, value); + + group->pattern = xcalloc(1, sizeof(regex_t)); + if (regcomp(group->pattern, re.buf, REG_EXTENDED)) + die(_("failed to load pseudo-merge regex for %s: '%s'"), + sub, re.buf); + + strbuf_release(&re); + } else if (!strcmp(key, "decay")) { + group->decay = git_config_double(var, value, ctx->kvi); + if (group->decay < 0) { + warning(_("%s must be non-negative, using default"), var); + group->decay = DEFAULT_PSEUDO_MERGE_DECAY; + } + } else if (!strcmp(key, "samplerate")) { + group->sample_rate = git_config_double(var, value, ctx->kvi); + if (!(0 <= group->sample_rate && group->sample_rate <= 1)) { + warning(_("%s must be between 0 and 1, using default"), var); + group->sample_rate = DEFAULT_PSEUDO_MERGE_SAMPLE_RATE; + } + } else if (!strcmp(key, "threshold")) { + if (git_config_expiry_date(&group->threshold, var, value)) { + ret = -1; + goto done; + } + } else if (!strcmp(key, "maxmerges")) { + group->max_merges = git_config_int(var, value, ctx->kvi); + if (group->max_merges < 0) { + warning(_("%s must be non-negative, using default"), var); + group->max_merges = DEFAULT_PSEUDO_MERGE_MAX_MERGES; + } + } else if (!strcmp(key, "stablethreshold")) { + if (git_config_expiry_date(&group->stable_threshold, var, value)) { + ret = -1; + goto done; + } + } else if (!strcmp(key, "stablesize")) { + group->stable_size = git_config_int(var, value, ctx->kvi); + if (group->stable_size <= 0) { + warning(_("%s must be positive, using default"), var); + group->stable_size = DEFAULT_PSEUDO_MERGE_STABLE_SIZE; + } + } + +done: + strbuf_release(&buf); + + return ret; +} + +void load_pseudo_merges_from_config(struct string_list *list) +{ + struct string_list_item *item; + + git_config(pseudo_merge_config, list); + + for_each_string_list_item(item, list) { + struct pseudo_merge_group *group = item->util; + if (!group->pattern) + die(_("pseudo-merge group '%s' missing required pattern"), + item->string); + if (group->threshold < group->stable_threshold) + die(_("pseudo-merge group '%s' has unstable threshold " + "before stable one"), item->string); + } +} + +static int find_pseudo_merge_group_for_ref(const char *refname, + const struct object_id *oid, + int flags UNUSED, + void *_data) +{ + struct bitmap_writer *writer = _data; + struct object_id peeled; + struct commit *c; + uint32_t i; + int has_bitmap; + + if (!peel_iterated_oid(oid, &peeled)) + oid = &peeled; + + c = lookup_commit(the_repository, oid); + if (!c) + return 0; + + has_bitmap = bitmap_writer_has_bitmapped_object_id(writer, oid); + + for (i = 0; i < writer->pseudo_merge_groups.nr; i++) { + struct pseudo_merge_group *group; + struct pseudo_merge_matches *matches; + struct strbuf group_name = STRBUF_INIT; + regmatch_t captures[16]; + size_t j; + + group = writer->pseudo_merge_groups.items[i].util; + if (regexec(group->pattern, refname, ARRAY_SIZE(captures), + captures, 0)) + continue; + + if (captures[ARRAY_SIZE(captures) - 1].rm_so != -1) + warning(_("pseudo-merge regex from config has too many capture " + "groups (max=%"PRIuMAX")"), + (uintmax_t)ARRAY_SIZE(captures) - 2); + + for (j = !!group->pattern->re_nsub; j < ARRAY_SIZE(captures); j++) { + regmatch_t *match = &captures[j]; + if (match->rm_so == -1) + continue; + + if (group_name.len) + strbuf_addch(&group_name, '-'); + + strbuf_add(&group_name, refname + match->rm_so, + match->rm_eo - match->rm_so); + } + + matches = strmap_get(&group->matches, group_name.buf); + if (!matches) { + matches = xcalloc(1, sizeof(*matches)); + strmap_put(&group->matches, strbuf_detach(&group_name, NULL), + matches); + } + + if (c->date <= group->stable_threshold) { + ALLOC_GROW(matches->stable, matches->stable_nr + 1, + matches->stable_alloc); + matches->stable[matches->stable_nr++] = c; + } else if (c->date <= group->threshold && !has_bitmap) { + ALLOC_GROW(matches->unstable, matches->unstable_nr + 1, + matches->unstable_alloc); + matches->unstable[matches->unstable_nr++] = c; + } + + strbuf_release(&group_name); + } + + return 0; +} + +static struct commit *push_pseudo_merge(struct pseudo_merge_group *group) +{ + struct commit *merge; + + ALLOC_GROW(group->merges, group->merges_nr + 1, group->merges_alloc); + + merge = alloc_commit_node(the_repository); + merge->object.parsed = 1; + merge->object.flags |= BITMAP_PSEUDO_MERGE; + + group->merges[group->merges_nr++] = merge; + + return merge; +} + +static struct pseudo_merge_commit_idx *pseudo_merge_idx(kh_oid_map_t *pseudo_merge_commits, + const struct object_id *oid) + +{ + struct pseudo_merge_commit_idx *pmc; + int hash_ret; + khiter_t hash_pos = kh_put_oid_map(pseudo_merge_commits, *oid, + &hash_ret); + + if (hash_ret) { + CALLOC_ARRAY(pmc, 1); + kh_value(pseudo_merge_commits, hash_pos) = pmc; + } else { + pmc = kh_value(pseudo_merge_commits, hash_pos); + } + + return pmc; +} + +#define MIN_PSEUDO_MERGE_SIZE 8 + +static void select_pseudo_merges_1(struct bitmap_writer *writer, + struct pseudo_merge_group *group, + struct pseudo_merge_matches *matches) +{ + uint32_t i, j; + uint32_t stable_merges_nr; + + if (!matches->stable_nr && !matches->unstable_nr) + return; /* all tips in this group already have bitmaps */ + + stable_merges_nr = matches->stable_nr / group->stable_size; + if (matches->stable_nr % group->stable_size) + stable_merges_nr++; + + /* make stable_merges_nr pseudo merges for stable commits */ + for (i = 0, j = 0; i < stable_merges_nr; i++) { + struct commit *merge; + struct commit_list **p; + + merge = push_pseudo_merge(group); + p = &merge->parents; + + /* + * For each pseudo-merge created above, add parents to the + * allocated commit node from the stable set of commits + * (un-bitmapped, newer than the stable threshold). + */ + do { + struct commit *c; + struct pseudo_merge_commit_idx *pmc; + + if (j >= matches->stable_nr) + break; + + c = matches->stable[j++]; + /* + * Here and below, make sure that we keep our mapping of + * commits -> pseudo-merge(s) which include the key'd + * commit up-to-date. + */ + pmc = pseudo_merge_idx(writer->pseudo_merge_commits, + &c->object.oid); + + ALLOC_GROW(pmc->pseudo_merge, pmc->nr + 1, pmc->alloc); + + pmc->pseudo_merge[pmc->nr++] = writer->pseudo_merges_nr; + p = commit_list_append(c, p); + } while (j % group->stable_size); + + bitmap_writer_push_commit(writer, merge, 1); + writer->pseudo_merges_nr++; + } + + /* make up to group->max_merges pseudo merges for unstable commits */ + for (i = 0, j = 0; i < group->max_merges; i++) { + struct commit *merge; + struct commit_list **p; + uint32_t size, end; + + merge = push_pseudo_merge(group); + p = &merge->parents; + + size = pseudo_merge_group_size(group, matches, i); + end = size < MIN_PSEUDO_MERGE_SIZE ? matches->unstable_nr : j + size; + + /* + * For each pseudo-merge commit created above, add parents to + * the allocated commit node from the unstable set of commits + * (newer than the stable threshold). + * + * Account for the sample rate, since not every candidate from + * the set of stable commits will be included as a pseudo-merge + * parent. + */ + for (; j < end && j < matches->unstable_nr; j++) { + struct commit *c = matches->unstable[j]; + struct pseudo_merge_commit_idx *pmc; + + if (j % (uint32_t)(1.0 / group->sample_rate)) + continue; + + pmc = pseudo_merge_idx(writer->pseudo_merge_commits, + &c->object.oid); + + ALLOC_GROW(pmc->pseudo_merge, pmc->nr + 1, pmc->alloc); + + pmc->pseudo_merge[pmc->nr++] = writer->pseudo_merges_nr; + p = commit_list_append(c, p); + } + + bitmap_writer_push_commit(writer, merge, 1); + writer->pseudo_merges_nr++; + if (end >= matches->unstable_nr) + break; + } +} + +static int commit_date_cmp(const void *va, const void *vb) +{ + timestamp_t a = (*(const struct commit **)va)->date; + timestamp_t b = (*(const struct commit **)vb)->date; + + if (a < b) + return -1; + else if (a > b) + return 1; + return 0; +} + +static void sort_pseudo_merge_matches(struct pseudo_merge_matches *matches) +{ + QSORT(matches->stable, matches->stable_nr, commit_date_cmp); + QSORT(matches->unstable, matches->unstable_nr, commit_date_cmp); +} + +void select_pseudo_merges(struct bitmap_writer *writer, + struct commit **commits, size_t commits_nr) +{ + struct progress *progress = NULL; + uint32_t i; + + if (!writer->pseudo_merge_groups.nr) + return; + + if (writer->show_progress) + progress = start_progress("Selecting pseudo-merge commits", + writer->pseudo_merge_groups.nr); + + for_each_ref(find_pseudo_merge_group_for_ref, writer); + + for (i = 0; i < writer->pseudo_merge_groups.nr; i++) { + struct pseudo_merge_group *group; + struct hashmap_iter iter; + struct strmap_entry *e; + + group = writer->pseudo_merge_groups.items[i].util; + strmap_for_each_entry(&group->matches, &iter, e) { + struct pseudo_merge_matches *matches = e->value; + + sort_pseudo_merge_matches(matches); + + select_pseudo_merges_1(writer, group, matches); + } + + display_progress(progress, i + 1); + } + + stop_progress(&progress); +} diff --git a/pseudo-merge.h b/pseudo-merge.h index cab8ff6960..f809cf42ae 100644 --- a/pseudo-merge.h +++ b/pseudo-merge.h @@ -2,5 +2,99 @@ #define PSEUDO_MERGE_H #include "git-compat-util.h" +#include "strmap.h" +#include "khash.h" +#include "ewah/ewok.h" + +struct commit; +struct string_list; +struct bitmap_index; +struct bitmap_writer; + +/* + * A pseudo-merge group tracks the set of non-bitmapped reference tips + * that match the given pattern. + * + * Within those matches, they are further segmented by separating + * consecutive capture groups with '-' dash character capture groups + * with '-' dash characters. + * + * Those groups are then ordered by committer date and partitioned + * into individual pseudo-merge(s) according to the decay, max_merges, + * sample_rate, and threshold parameters. + */ +struct pseudo_merge_group { + regex_t *pattern; + + /* capture group(s) -> struct pseudo_merge_matches */ + struct strmap matches; + + /* + * The individual pseudo-merge(s) that are generated from the + * above array of matches, partitioned according to the below + * parameters. + */ + struct commit **merges; + size_t merges_nr; + size_t merges_alloc; + + /* + * Pseudo-merge grouping parameters. See git-config(1) for + * more information. + */ + double decay; + int max_merges; + double sample_rate; + int stable_size; + timestamp_t threshold; + timestamp_t stable_threshold; +}; + +struct pseudo_merge_matches { + struct commit **stable; + struct commit **unstable; + size_t stable_nr, stable_alloc; + size_t unstable_nr, unstable_alloc; +}; + +/* + * Read the repository's configuration: + * + * - bitmapPseudoMerge..pattern + * - bitmapPseudoMerge..decay + * - bitmapPseudoMerge..sampleRate + * - bitmapPseudoMerge..threshold + * - bitmapPseudoMerge..maxMerges + * - bitmapPseudoMerge..stableThreshold + * - bitmapPseudoMerge..stableSize + * + * and populates the given `list` with pseudo-merge groups. String + * entry keys are the pseudo-merge group names, and the values are + * pointers to the pseudo_merge_group structure itself. + */ +void load_pseudo_merges_from_config(struct string_list *list); + +/* + * A pseudo-merge commit index (pseudo_merge_commit_idx) maps a + * particular (non-pseudo-merge) commit to the list of pseudo-merge(s) + * it appears in. + */ +struct pseudo_merge_commit_idx { + uint32_t *pseudo_merge; + size_t nr, alloc; +}; + +/* + * Selects pseudo-merges from a list of commits, populating the given + * string_list of pseudo-merge groups. + * + * Populates the pseudo_merge_commits map with a commit_idx + * corresponding to each commit in the list. Counts the total number + * of pseudo-merges generated. + * + * Optionally shows a progress meter. + */ +void select_pseudo_merges(struct bitmap_writer *writer, + struct commit **commits, size_t commits_nr); #endif From 53ea3ec47990e9ad00a238fd962f36e5d87a6386 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:46 -0400 Subject: [PATCH 12/26] pack-bitmap-write.c: write pseudo-merge table Now that the pack-bitmap writer machinery understands how to select and store pseudo-merge commits, teach it how to write the new optional pseudo-merge .bitmap extension. No readers yet exist for this new extension to the .bitmap format. The following commits will take any preparatory step(s) necessary before then implementing the routines necessary to read this new table. In the meantime, the new `write_pseudo_merges()` function implements writing this new format as described by a previous commit in Documentation/technical/bitmap-format.txt. Writing this table is fairly straightforward and consists of a few sub-components: - a pair of bitmaps for each pseudo-merge (one for the pseudo-merge "parents", and another for the objects reachable from those parents) - for each commit, the offset of either (a) the pseudo-merge it belongs to, or (b) an extended lookup table if it belongs to >1 pseudo-merge groups - if there are any commits belonging to >1 pseudo-merge group, the extended lookup tables (which each consist of the number of pseudo-merge groups a commit appears in, and then that many 4-byte unsigned ) Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap-write.c | 131 ++++++++++++++++++++++++++++++++++++++++++++ pack-bitmap.h | 1 + 2 files changed, 132 insertions(+) diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index d5884ea5e9..47250398aa 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -18,6 +18,7 @@ #include "tree.h" #include "tree-walk.h" #include "pseudo-merge.h" +#include "oid-array.h" struct bitmapped_commit { struct commit *commit; @@ -771,6 +772,130 @@ static void write_selected_commits_v1(struct bitmap_writer *writer, } } +static void write_pseudo_merges(struct bitmap_writer *writer, + struct hashfile *f) +{ + struct oid_array commits = OID_ARRAY_INIT; + struct bitmap **commits_bitmap = NULL; + off_t *pseudo_merge_ofs = NULL; + off_t start, table_start, next_ext; + + uint32_t base = bitmap_writer_nr_selected_commits(writer); + size_t i, j = 0; + + CALLOC_ARRAY(commits_bitmap, writer->pseudo_merges_nr); + CALLOC_ARRAY(pseudo_merge_ofs, writer->pseudo_merges_nr); + + for (i = 0; i < writer->pseudo_merges_nr; i++) { + struct bitmapped_commit *merge = &writer->selected[base + i]; + struct commit_list *p; + + if (!merge->pseudo_merge) + BUG("found non-pseudo merge commit at %"PRIuMAX, (uintmax_t)i); + + commits_bitmap[i] = bitmap_new(); + + for (p = merge->commit->parents; p; p = p->next) + bitmap_set(commits_bitmap[i], + find_object_pos(writer, &p->item->object.oid, + NULL)); + } + + start = hashfile_total(f); + + for (i = 0; i < writer->pseudo_merges_nr; i++) { + struct ewah_bitmap *commits_ewah = bitmap_to_ewah(commits_bitmap[i]); + + pseudo_merge_ofs[i] = hashfile_total(f); + + dump_bitmap(f, commits_ewah); + dump_bitmap(f, writer->selected[base+i].write_as); + + ewah_free(commits_ewah); + } + + next_ext = st_add(hashfile_total(f), + st_mult(kh_size(writer->pseudo_merge_commits), + sizeof(uint64_t))); + + table_start = hashfile_total(f); + + commits.alloc = kh_size(writer->pseudo_merge_commits); + CALLOC_ARRAY(commits.oid, commits.alloc); + + for (i = kh_begin(writer->pseudo_merge_commits); i != kh_end(writer->pseudo_merge_commits); i++) { + if (!kh_exist(writer->pseudo_merge_commits, i)) + continue; + oid_array_append(&commits, &kh_key(writer->pseudo_merge_commits, i)); + } + + oid_array_sort(&commits); + + /* write lookup table (non-extended) */ + for (i = 0; i < commits.nr; i++) { + int hash_pos; + struct pseudo_merge_commit_idx *c; + + hash_pos = kh_get_oid_map(writer->pseudo_merge_commits, + commits.oid[i]); + if (hash_pos == kh_end(writer->pseudo_merge_commits)) + BUG("could not find pseudo-merge commit %s", + oid_to_hex(&commits.oid[i])); + + c = kh_value(writer->pseudo_merge_commits, hash_pos); + + hashwrite_be32(f, find_object_pos(writer, &commits.oid[i], + NULL)); + if (c->nr == 1) + hashwrite_be64(f, pseudo_merge_ofs[c->pseudo_merge[0]]); + else if (c->nr > 1) { + if (next_ext & ((uint64_t)1<<63)) + die(_("too many pseudo-merges")); + hashwrite_be64(f, next_ext | ((uint64_t)1<<63)); + next_ext = st_add3(next_ext, + sizeof(uint32_t), + st_mult(c->nr, sizeof(uint64_t))); + } else + BUG("expected commit '%s' to have at least one " + "pseudo-merge", oid_to_hex(&commits.oid[i])); + } + + /* write lookup table (extended) */ + for (i = 0; i < commits.nr; i++) { + int hash_pos; + struct pseudo_merge_commit_idx *c; + + hash_pos = kh_get_oid_map(writer->pseudo_merge_commits, + commits.oid[i]); + if (hash_pos == kh_end(writer->pseudo_merge_commits)) + BUG("could not find pseudo-merge commit %s", + oid_to_hex(&commits.oid[i])); + + c = kh_value(writer->pseudo_merge_commits, hash_pos); + if (c->nr == 1) + continue; + + hashwrite_be32(f, c->nr); + for (j = 0; j < c->nr; j++) + hashwrite_be64(f, pseudo_merge_ofs[c->pseudo_merge[j]]); + } + + /* write positions for all pseudo merges */ + for (i = 0; i < writer->pseudo_merges_nr; i++) + hashwrite_be64(f, pseudo_merge_ofs[i]); + + hashwrite_be32(f, writer->pseudo_merges_nr); + hashwrite_be32(f, kh_size(writer->pseudo_merge_commits)); + hashwrite_be64(f, table_start - start); + hashwrite_be64(f, hashfile_total(f) - start + sizeof(uint64_t)); + + for (i = 0; i < writer->pseudo_merges_nr; i++) + bitmap_free(commits_bitmap[i]); + + free(pseudo_merge_ofs); + free(commits_bitmap); +} + static int table_cmp(const void *_va, const void *_vb, void *_data) { struct bitmap_writer *writer = _data; @@ -878,6 +1003,9 @@ void bitmap_writer_finish(struct bitmap_writer *writer, int fd = odb_mkstemp(&tmp_file, "pack/tmp_bitmap_XXXXXX"); + if (writer->pseudo_merges_nr) + options |= BITMAP_OPT_PSEUDO_MERGES; + f = hashfd(fd, tmp_file.buf); memcpy(header.magic, BITMAP_IDX_SIGNATURE, sizeof(BITMAP_IDX_SIGNATURE)); @@ -907,6 +1035,9 @@ void bitmap_writer_finish(struct bitmap_writer *writer, write_selected_commits_v1(writer, f, offsets); + if (options & BITMAP_OPT_PSEUDO_MERGES) + write_pseudo_merges(writer, f); + if (options & BITMAP_OPT_LOOKUP_TABLE) write_lookup_table(writer, f, offsets); diff --git a/pack-bitmap.h b/pack-bitmap.h index 1e730ea1e5..db9ae554fa 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -37,6 +37,7 @@ enum pack_bitmap_opts { BITMAP_OPT_FULL_DAG = 0x1, BITMAP_OPT_HASH_CACHE = 0x4, BITMAP_OPT_LOOKUP_TABLE = 0x10, + BITMAP_OPT_PSEUDO_MERGES = 0x20, }; enum pack_bitmap_flags { From 79621f3e4151dfe8ef8fdb2800c59dc65f4867e6 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:49 -0400 Subject: [PATCH 13/26] pack-bitmap: extract `read_bitmap()` function The pack-bitmap machinery uses the `read_bitmap_1()` function to read a bitmap from within the mmap'd region corresponding to the .bitmap file. As as side-effect of calling this function, `read_bitmap_1()` increments the `index->map_pos` variable to reflect the number of bytes read. Extract the core of this routine to a separate function (that operates over a `const unsigned char *`, a `size_t` and a `size_t *` pointer) instead of a `struct bitmap_index *` pointer. This function (called `read_bitmap()`) is part of the pack-bitmap.h API so that it can be used within the upcoming portion of the implementation in pseduo-merge.ch. Rewrite the existing function, `read_bitmap_1()`, in terms of its more generic counterpart. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap.c | 24 +++++++++++++++--------- pack-bitmap.h | 2 ++ 2 files changed, 17 insertions(+), 9 deletions(-) diff --git a/pack-bitmap.c b/pack-bitmap.c index 35c5ef9d3c..3519edb896 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -129,17 +129,13 @@ static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st) return composed; } -/* - * Read a bitmap from the current read position on the mmaped - * index, and increase the read position accordingly - */ -static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index) +struct ewah_bitmap *read_bitmap(const unsigned char *map, + size_t map_size, size_t *map_pos) { struct ewah_bitmap *b = ewah_pool_new(); - ssize_t bitmap_size = ewah_read_mmap(b, - index->map + index->map_pos, - index->map_size - index->map_pos); + ssize_t bitmap_size = ewah_read_mmap(b, map + *map_pos, + map_size - *map_pos); if (bitmap_size < 0) { error(_("failed to load bitmap index (corrupted?)")); @@ -147,10 +143,20 @@ static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index) return NULL; } - index->map_pos += bitmap_size; + *map_pos += bitmap_size; + return b; } +/* + * Read a bitmap from the current read position on the mmaped + * index, and increase the read position accordingly + */ +static struct ewah_bitmap *read_bitmap_1(struct bitmap_index *index) +{ + return read_bitmap(index->map, index->map_size, &index->map_pos); +} + static uint32_t bitmap_num_objects(struct bitmap_index *index) { if (index->midx) diff --git a/pack-bitmap.h b/pack-bitmap.h index db9ae554fa..21aabf805e 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -160,4 +160,6 @@ int bitmap_is_preferred_refname(struct repository *r, const char *refname); int verify_bitmap_files(struct repository *r); +struct ewah_bitmap *read_bitmap(const unsigned char *map, + size_t map_size, size_t *map_pos); #endif From 0f81b9cb2c5984ae1b090eb79d139e0a7d9ad8df Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:52 -0400 Subject: [PATCH 14/26] pseudo-merge: scaffolding for reads Implement scaffolding within the new pseudo-merge compilation unit necessary to use the pseudo-merge API from within the pack-bitmap.c machinery. The core of this scaffolding is two-fold: - The `pseudo_merge` structure itself, which represents an individual pseudo-merge bitmap. It has fields for both bitmaps, as well as metadata about its position within the memory-mapped region, and a few extra bits indicating whether or not it is satisfied, and which bitmaps(s, if any) have been read, since they are initialized lazily. - The `pseudo_merge_map` structure, which holds an array of pseudo_merges, as well as a pointer to the memory-mapped region containing the pseudo-merge serialization from within a .bitmap file. Note that the `bitmap_index` structure is defined statically within the pack-bitmap.o compilation unit, so we can't take in a `struct bitmap_index *`. Instead, wrap the primary components necessary to read the pseudo-merges in this new structure to avoid exposing the implementation details of the `bitmap_index` structure. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pseudo-merge.c | 10 ++++++++ pseudo-merge.h | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 75 insertions(+) diff --git a/pseudo-merge.c b/pseudo-merge.c index 0f6854c753..f0080d53c0 100644 --- a/pseudo-merge.c +++ b/pseudo-merge.c @@ -454,3 +454,13 @@ void select_pseudo_merges(struct bitmap_writer *writer, stop_progress(&progress); } + +void free_pseudo_merge_map(struct pseudo_merge_map *pm) +{ + uint32_t i; + for (i = 0; i < pm->nr; i++) { + ewah_pool_free(pm->v[i].commits); + ewah_pool_free(pm->v[i].bitmap); + } + free(pm->v); +} diff --git a/pseudo-merge.h b/pseudo-merge.h index f809cf42ae..e9216baace 100644 --- a/pseudo-merge.h +++ b/pseudo-merge.h @@ -97,4 +97,69 @@ struct pseudo_merge_commit_idx { void select_pseudo_merges(struct bitmap_writer *writer, struct commit **commits, size_t commits_nr); +/* + * Represents a serialized view of a file containing pseudo-merge(s) + * (see Documentation/technical/bitmap-format.txt for a specification + * of the format). + */ +struct pseudo_merge_map { + /* + * An array of pseudo-merge(s), lazily loaded from the .bitmap + * file. + */ + struct pseudo_merge *v; + size_t nr; + size_t commits_nr; + + /* + * Pointers into a memory-mapped view of the .bitmap file: + * + * - map: the beginning of the .bitmap file + * - commits: the beginning of the pseudo-merge commit index + * - map_size: the size of the .bitmap file + */ + const unsigned char *map; + const unsigned char *commits; + + size_t map_size; +}; + +/* + * An individual pseudo-merge, storing a pair of lazily-loaded + * bitmaps: + * + * - commits: the set of commit(s) that are part of the pseudo-merge + * - bitmap: the set of object(s) reachable from the above set of + * commits. + * + * The `at` and `bitmap_at` fields are used to store the locations of + * each of the above bitmaps in the .bitmap file. + */ +struct pseudo_merge { + struct ewah_bitmap *commits; + struct ewah_bitmap *bitmap; + + off_t at; + off_t bitmap_at; + + /* + * `satisfied` indicates whether the given pseudo-merge has been + * used. + * + * `loaded_commits` and `loaded_bitmap` indicate whether the + * respective bitmaps have been loaded and read from the + * .bitmap file. + */ + unsigned satisfied : 1, + loaded_commits : 1, + loaded_bitmap : 1; +}; + +/* + * Frees the given pseudo-merge map, releasing any memory held by (a) + * parsed EWAH bitmaps, or (b) the array of pseudo-merges itself. Does + * not free the memory-mapped view of the .bitmap file. + */ +void free_pseudo_merge_map(struct pseudo_merge_map *pm); + #endif From 7c0fae88442824cfb4a3160d61a00c54287d56a7 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:55 -0400 Subject: [PATCH 15/26] pack-bitmap.c: read pseudo-merge extension Now that the scaffolding for reading the pseudo-merge extension has been laid, teach the pack-bitmap machinery to read the pseudo-merge extension when present. Note that pseudo-merges themselves are not yet used during traversal, this step will be taken by a future commit. In the meantime, read the table and initialize the pseudo_merge_map structure introduced by a previous commit. When the pseudo-merge extension is present, `load_bitmap_header()` performs basic sanity checks to make sure that the table is well-formed. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/pack-bitmap.c b/pack-bitmap.c index 3519edb896..fc9c3e2fc4 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -20,6 +20,7 @@ #include "list-objects-filter-options.h" #include "midx.h" #include "config.h" +#include "pseudo-merge.h" /* * An entry on the bitmap index, representing the bitmap for a given @@ -86,6 +87,9 @@ struct bitmap_index { */ unsigned char *table_lookup; + /* This contains the pseudo-merge cache within 'map' (if found). */ + struct pseudo_merge_map pseudo_merges; + /* * Extended index. * @@ -205,6 +209,41 @@ static int load_bitmap_header(struct bitmap_index *index) index->table_lookup = (void *)(index_end - table_size); index_end -= table_size; } + + if (flags & BITMAP_OPT_PSEUDO_MERGES) { + unsigned char *pseudo_merge_ofs; + size_t table_size; + uint32_t i; + + if (sizeof(table_size) > index_end - index->map - header_size) + return error(_("corrupted bitmap index file (too short to fit pseudo-merge table header)")); + + table_size = get_be64(index_end - 8); + if (table_size > index_end - index->map - header_size) + return error(_("corrupted bitmap index file (too short to fit pseudo-merge table)")); + + if (git_env_bool("GIT_TEST_USE_PSEUDO_MERGES", 1)) { + const unsigned char *ext = (index_end - table_size); + + index->pseudo_merges.map = index->map; + index->pseudo_merges.map_size = index->map_size; + index->pseudo_merges.commits = ext + get_be64(index_end - 16); + index->pseudo_merges.commits_nr = get_be32(index_end - 20); + index->pseudo_merges.nr = get_be32(index_end - 24); + + CALLOC_ARRAY(index->pseudo_merges.v, + index->pseudo_merges.nr); + + pseudo_merge_ofs = index_end - 24 - + (index->pseudo_merges.nr * sizeof(uint64_t)); + for (i = 0; i < index->pseudo_merges.nr; i++) { + index->pseudo_merges.v[i].at = get_be64(pseudo_merge_ofs); + pseudo_merge_ofs += sizeof(uint64_t); + } + } + + index_end -= table_size; + } } index->entry_count = ntohl(header->entry_count); From 955747b4daaac33a11b1f5362227f1839cff41d3 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:26:58 -0400 Subject: [PATCH 16/26] pseudo-merge: implement support for reading pseudo-merge commits Implement the basic API for reading pseudo-merge bitmaps, which consists of four basic functions: - pseudo_merge_bitmap() - use_pseudo_merge() - apply_pseudo_merges_for_commit() - cascade_pseudo_merges() These functions are all documented in pseudo-merge.h, but their rough descriptions are as follows: - pseudo_merge_bitmap() reads and inflates the objects EWAH bitmap for a given pseudo-merge - use_pseudo_merge() does the same as pseudo_merge_bitmap(), but on the commits EWAH bitmap, not the objects bitmap - apply_pseudo_merges_for_commit() applies all satisfied pseudo-merge commits for a given result set, and cascades any yet-unsatisfied pseudo-merges if any were applied in the previous step - cascade_pseudo_merges() applies all pseudo-merges which are satisfied but have not been previously applied, repeating this process until no more pseudo-merges can be applied The core of the API is the latter two functions, which are responsible for applying pseudo-merges during the object traversal implemented in the pack-bitmap machinery. The other two functions (pseudo_merge_bitmap(), and use_pseudo_merge()) are low-level ways to interact with the pseudo-merge machinery, which will be useful in future commits. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pseudo-merge.c | 235 +++++++++++++++++++++++++++++++++++++++++++++++++ pseudo-merge.h | 44 +++++++++ 2 files changed, 279 insertions(+) diff --git a/pseudo-merge.c b/pseudo-merge.c index f0080d53c0..7d13101149 100644 --- a/pseudo-merge.c +++ b/pseudo-merge.c @@ -10,6 +10,7 @@ #include "commit.h" #include "alloc.h" #include "progress.h" +#include "hex.h" #define DEFAULT_PSEUDO_MERGE_DECAY 1.0 #define DEFAULT_PSEUDO_MERGE_MAX_MERGES 64 @@ -464,3 +465,237 @@ void free_pseudo_merge_map(struct pseudo_merge_map *pm) } free(pm->v); } + +struct pseudo_merge_commit_ext { + uint32_t nr; + const unsigned char *ptr; +}; + +static int pseudo_merge_ext_at(const struct pseudo_merge_map *pm, + struct pseudo_merge_commit_ext *ext, size_t at) +{ + if (at >= pm->map_size) + return error(_("extended pseudo-merge read out-of-bounds " + "(%"PRIuMAX" >= %"PRIuMAX")"), + (uintmax_t)at, (uintmax_t)pm->map_size); + if (at + 4 >= pm->map_size) + return error(_("extended pseudo-merge entry is too short " + "(%"PRIuMAX" >= %"PRIuMAX")"), + (uintmax_t)(at + 4), (uintmax_t)pm->map_size); + + ext->nr = get_be32(pm->map + at); + ext->ptr = pm->map + at + sizeof(uint32_t); + + return 0; +} + +struct ewah_bitmap *pseudo_merge_bitmap(const struct pseudo_merge_map *pm, + struct pseudo_merge *merge) +{ + if (!merge->loaded_commits) + BUG("cannot use unloaded pseudo-merge bitmap"); + + if (!merge->loaded_bitmap) { + size_t at = merge->bitmap_at; + + merge->bitmap = read_bitmap(pm->map, pm->map_size, &at); + merge->loaded_bitmap = 1; + } + + return merge->bitmap; +} + +struct pseudo_merge *use_pseudo_merge(const struct pseudo_merge_map *pm, + struct pseudo_merge *merge) +{ + if (!merge->loaded_commits) { + size_t pos = merge->at; + + merge->commits = read_bitmap(pm->map, pm->map_size, &pos); + merge->bitmap_at = pos; + merge->loaded_commits = 1; + } + return merge; +} + +static struct pseudo_merge *pseudo_merge_at(const struct pseudo_merge_map *pm, + struct object_id *oid, + size_t want) +{ + size_t lo = 0; + size_t hi = pm->nr; + + while (lo < hi) { + size_t mi = lo + (hi - lo) / 2; + size_t got = pm->v[mi].at; + + if (got == want) + return use_pseudo_merge(pm, &pm->v[mi]); + else if (got < want) + hi = mi; + else + lo = mi + 1; + } + + warning(_("could not find pseudo-merge for commit %s at offset %"PRIuMAX), + oid_to_hex(oid), (uintmax_t)want); + + return NULL; +} + +struct pseudo_merge_commit { + uint32_t commit_pos; + uint64_t pseudo_merge_ofs; +}; + +#define PSEUDO_MERGE_COMMIT_RAWSZ (sizeof(uint32_t)+sizeof(uint64_t)) + +static void read_pseudo_merge_commit_at(struct pseudo_merge_commit *merge, + const unsigned char *at) +{ + merge->commit_pos = get_be32(at); + merge->pseudo_merge_ofs = get_be64(at + sizeof(uint32_t)); +} + +static int nth_pseudo_merge_ext(const struct pseudo_merge_map *pm, + struct pseudo_merge_commit_ext *ext, + struct pseudo_merge_commit *merge, + uint32_t n) +{ + size_t ofs; + + if (n >= ext->nr) + return error(_("extended pseudo-merge lookup out-of-bounds " + "(%"PRIu32" >= %"PRIu32")"), n, ext->nr); + + ofs = get_be64(ext->ptr + st_mult(n, sizeof(uint64_t))); + if (ofs >= pm->map_size) + return error(_("out-of-bounds read: (%"PRIuMAX" >= %"PRIuMAX")"), + (uintmax_t)ofs, (uintmax_t)pm->map_size); + + read_pseudo_merge_commit_at(merge, pm->map + ofs); + + return 0; +} + +static unsigned apply_pseudo_merge(const struct pseudo_merge_map *pm, + struct pseudo_merge *merge, + struct bitmap *result, + struct bitmap *roots) +{ + if (merge->satisfied) + return 0; + + if (!ewah_bitmap_is_subset(merge->commits, roots ? roots : result)) + return 0; + + bitmap_or_ewah(result, pseudo_merge_bitmap(pm, merge)); + if (roots) + bitmap_or_ewah(roots, pseudo_merge_bitmap(pm, merge)); + merge->satisfied = 1; + + return 1; +} + +static int pseudo_merge_commit_cmp(const void *va, const void *vb) +{ + struct pseudo_merge_commit merge; + uint32_t key = *(uint32_t*)va; + + read_pseudo_merge_commit_at(&merge, vb); + + if (key < merge.commit_pos) + return -1; + if (key > merge.commit_pos) + return 1; + return 0; +} + +static struct pseudo_merge_commit *find_pseudo_merge(const struct pseudo_merge_map *pm, + uint32_t pos) +{ + if (!pm->commits_nr) + return NULL; + + return bsearch(&pos, pm->commits, pm->commits_nr, + PSEUDO_MERGE_COMMIT_RAWSZ, pseudo_merge_commit_cmp); +} + +int apply_pseudo_merges_for_commit(const struct pseudo_merge_map *pm, + struct bitmap *result, + struct commit *commit, uint32_t commit_pos) +{ + struct pseudo_merge *merge; + struct pseudo_merge_commit *merge_commit; + int ret = 0; + + merge_commit = find_pseudo_merge(pm, commit_pos); + if (!merge_commit) + return 0; + + if (merge_commit->pseudo_merge_ofs & ((uint64_t)1<<63)) { + struct pseudo_merge_commit_ext ext = { 0 }; + off_t ofs = merge_commit->pseudo_merge_ofs & ~((uint64_t)1<<63); + uint32_t i; + + if (pseudo_merge_ext_at(pm, &ext, ofs) < -1) { + warning(_("could not read extended pseudo-merge table " + "for commit %s"), + oid_to_hex(&commit->object.oid)); + return ret; + } + + for (i = 0; i < ext.nr; i++) { + if (nth_pseudo_merge_ext(pm, &ext, merge_commit, i) < 0) + return ret; + + merge = pseudo_merge_at(pm, &commit->object.oid, + merge_commit->pseudo_merge_ofs); + + if (!merge) + return ret; + + if (apply_pseudo_merge(pm, merge, result, NULL)) + ret++; + } + } else { + merge = pseudo_merge_at(pm, &commit->object.oid, + merge_commit->pseudo_merge_ofs); + + if (!merge) + return ret; + + if (apply_pseudo_merge(pm, merge, result, NULL)) + ret++; + } + + if (ret) + cascade_pseudo_merges(pm, result, NULL); + + return ret; +} + +int cascade_pseudo_merges(const struct pseudo_merge_map *pm, + struct bitmap *result, + struct bitmap *roots) +{ + unsigned any_satisfied; + int ret = 0; + + do { + struct pseudo_merge *merge; + uint32_t i; + + any_satisfied = 0; + + for (i = 0; i < pm->nr; i++) { + merge = use_pseudo_merge(pm, &pm->v[i]); + if (apply_pseudo_merge(pm, merge, result, roots)) { + any_satisfied |= 1; + ret++; + } + } + } while (any_satisfied); + + return ret; +} diff --git a/pseudo-merge.h b/pseudo-merge.h index e9216baace..755edc054a 100644 --- a/pseudo-merge.h +++ b/pseudo-merge.h @@ -162,4 +162,48 @@ struct pseudo_merge { */ void free_pseudo_merge_map(struct pseudo_merge_map *pm); +/* + * Loads the bitmap corresponding to the given pseudo-merge from the + * map, if it has not already been loaded. + */ +struct ewah_bitmap *pseudo_merge_bitmap(const struct pseudo_merge_map *pm, + struct pseudo_merge *merge); + +/* + * Loads the pseudo-merge and its commits bitmap from the given + * pseudo-merge map, if it has not already been loaded. + */ +struct pseudo_merge *use_pseudo_merge(const struct pseudo_merge_map *pm, + struct pseudo_merge *merge); + +/* + * Applies pseudo-merge(s) containing the given commit to the bitmap + * "result". + * + * If any pseudo-merge(s) were satisfied, returns the number + * satisfied, otherwise returns 0. If any were satisfied, the + * remaining unsatisfied pseudo-merges are cascaded (see below). + */ +int apply_pseudo_merges_for_commit(const struct pseudo_merge_map *pm, + struct bitmap *result, + struct commit *commit, uint32_t commit_pos); + +/* + * Applies pseudo-merge(s) which are satisfied according to the + * current bitmap in result (or roots, see below). If any + * pseudo-merges were satisfied, repeat the process over unsatisfied + * pseudo-merge commits until no more pseudo-merges are satisfied. + * + * Result is the bitmap to which the pseudo-merge(s) are applied. + * Roots (if given) is a bitmap of the traversal tip(s) for either + * side of a reachability traversal. + * + * Roots may given instead of a populated results bitmap at the + * beginning of a traversal on either side where the reachability + * closure over tips is not yet known. + */ +int cascade_pseudo_merges(const struct pseudo_merge_map *pm, + struct bitmap *result, + struct bitmap *roots); + #endif From 0481cbf912e23d96076e3a7e149e76a0327a7a70 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:27:02 -0400 Subject: [PATCH 17/26] ewah: implement `ewah_bitmap_popcount()` Some of the pseudo-merge test helpers (which will be introduced in the following commit) will want to indicate the total number of commits in or objects reachable from a pseudo-merge. Implement a popcount() function that operates on EWAH bitmaps to quickly determine how many bits are set in each of the respective bitmaps. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- ewah/bitmap.c | 14 ++++++++++++++ ewah/ewok.h | 1 + 2 files changed, 15 insertions(+) diff --git a/ewah/bitmap.c b/ewah/bitmap.c index d352fec54c..dc2ca190f1 100644 --- a/ewah/bitmap.c +++ b/ewah/bitmap.c @@ -212,6 +212,20 @@ size_t bitmap_popcount(struct bitmap *self) return count; } +size_t ewah_bitmap_popcount(struct ewah_bitmap *self) +{ + struct ewah_iterator it; + eword_t word; + size_t count = 0; + + ewah_iterator_init(&it, self); + + while (ewah_iterator_next(&word, &it)) + count += ewah_bit_popcount64(word); + + return count; +} + int bitmap_is_empty(struct bitmap *self) { size_t i; diff --git a/ewah/ewok.h b/ewah/ewok.h index 2b6c4ac499..7074a6347b 100644 --- a/ewah/ewok.h +++ b/ewah/ewok.h @@ -195,6 +195,7 @@ void bitmap_or_ewah(struct bitmap *self, struct ewah_bitmap *other); void bitmap_or(struct bitmap *self, const struct bitmap *other); size_t bitmap_popcount(struct bitmap *self); +size_t ewah_bitmap_popcount(struct ewah_bitmap *self); int bitmap_is_empty(struct bitmap *self); #endif From 71eca9ab7945b23c057c14259a99c3c1bd6872f9 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:27:05 -0400 Subject: [PATCH 18/26] pack-bitmap: implement test helpers for pseudo-merge Implement three new sub-commands for the "bitmap" test-helper: - t/helper test-tool bitmap dump-pseudo-merges - t/helper test-tool bitmap dump-pseudo-merge-commits - t/helper test-tool bitmap dump-pseudo-merge-objects These three helpers dump the list of pseudo merges, the "parents" of the nth pseudo-merges, and the set of objects reachable from those parents, respectively. These helpers will be useful in subsequent patches when we add test coverage for pseudo-merge bitmaps. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap.c | 126 +++++++++++++++++++++++++++++++++++++++++ pack-bitmap.h | 3 + t/helper/test-bitmap.c | 34 ++++++++--- 3 files changed, 156 insertions(+), 7 deletions(-) diff --git a/pack-bitmap.c b/pack-bitmap.c index fc9c3e2fc4..c13074673a 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -2443,6 +2443,132 @@ int test_bitmap_hashes(struct repository *r) return 0; } +static void bit_pos_to_object_id(struct bitmap_index *bitmap_git, + uint32_t bit_pos, + struct object_id *oid) +{ + uint32_t index_pos; + + if (bitmap_is_midx(bitmap_git)) + index_pos = pack_pos_to_midx(bitmap_git->midx, bit_pos); + else + index_pos = pack_pos_to_index(bitmap_git->pack, bit_pos); + + nth_bitmap_object_oid(bitmap_git, oid, index_pos); +} + +int test_bitmap_pseudo_merges(struct repository *r) +{ + struct bitmap_index *bitmap_git; + uint32_t i; + + bitmap_git = prepare_bitmap_git(r); + if (!bitmap_git || !bitmap_git->pseudo_merges.nr) + goto cleanup; + + for (i = 0; i < bitmap_git->pseudo_merges.nr; i++) { + struct pseudo_merge *merge; + struct ewah_bitmap *commits_bitmap, *merge_bitmap; + + merge = use_pseudo_merge(&bitmap_git->pseudo_merges, + &bitmap_git->pseudo_merges.v[i]); + commits_bitmap = merge->commits; + merge_bitmap = pseudo_merge_bitmap(&bitmap_git->pseudo_merges, + merge); + + printf("at=%"PRIuMAX", commits=%"PRIuMAX", objects=%"PRIuMAX"\n", + (uintmax_t)merge->at, + (uintmax_t)ewah_bitmap_popcount(commits_bitmap), + (uintmax_t)ewah_bitmap_popcount(merge_bitmap)); + } + +cleanup: + free_bitmap_index(bitmap_git); + return 0; +} + +static void dump_ewah_object_ids(struct bitmap_index *bitmap_git, + struct ewah_bitmap *bitmap) + +{ + struct ewah_iterator it; + eword_t word; + uint32_t pos = 0; + + ewah_iterator_init(&it, bitmap); + + while (ewah_iterator_next(&word, &it)) { + struct object_id oid; + uint32_t offset; + + for (offset = 0; offset < BITS_IN_EWORD; offset++) { + if (!(word >> offset)) + break; + + offset += ewah_bit_ctz64(word >> offset); + + bit_pos_to_object_id(bitmap_git, pos + offset, &oid); + printf("%s\n", oid_to_hex(&oid)); + } + pos += BITS_IN_EWORD; + } +} + +int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n) +{ + struct bitmap_index *bitmap_git; + struct pseudo_merge *merge; + int ret = 0; + + bitmap_git = prepare_bitmap_git(r); + if (!bitmap_git || !bitmap_git->pseudo_merges.nr) + goto cleanup; + + if (n >= bitmap_git->pseudo_merges.nr) { + ret = error(_("pseudo-merge index out of range " + "(%"PRIu32" >= %"PRIuMAX")"), + n, (uintmax_t)bitmap_git->pseudo_merges.nr); + goto cleanup; + } + + merge = use_pseudo_merge(&bitmap_git->pseudo_merges, + &bitmap_git->pseudo_merges.v[n]); + dump_ewah_object_ids(bitmap_git, merge->commits); + +cleanup: + free_bitmap_index(bitmap_git); + return ret; +} + +int test_bitmap_pseudo_merge_objects(struct repository *r, uint32_t n) +{ + struct bitmap_index *bitmap_git; + struct pseudo_merge *merge; + int ret = 0; + + bitmap_git = prepare_bitmap_git(r); + if (!bitmap_git || !bitmap_git->pseudo_merges.nr) + goto cleanup; + + if (n >= bitmap_git->pseudo_merges.nr) { + ret = error(_("pseudo-merge index out of range " + "(%"PRIu32" >= %"PRIuMAX")"), + n, (uintmax_t)bitmap_git->pseudo_merges.nr); + goto cleanup; + } + + merge = use_pseudo_merge(&bitmap_git->pseudo_merges, + &bitmap_git->pseudo_merges.v[n]); + + dump_ewah_object_ids(bitmap_git, + pseudo_merge_bitmap(&bitmap_git->pseudo_merges, + merge)); + +cleanup: + free_bitmap_index(bitmap_git); + return ret; +} + int rebuild_bitmap(const uint32_t *reposition, struct ewah_bitmap *source, struct bitmap *dest) diff --git a/pack-bitmap.h b/pack-bitmap.h index 21aabf805e..4466b5ad0f 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -73,6 +73,9 @@ void traverse_bitmap_commit_list(struct bitmap_index *, void test_bitmap_walk(struct rev_info *revs); int test_bitmap_commits(struct repository *r); int test_bitmap_hashes(struct repository *r); +int test_bitmap_pseudo_merges(struct repository *r); +int test_bitmap_pseudo_merge_commits(struct repository *r, uint32_t n); +int test_bitmap_pseudo_merge_objects(struct repository *r, uint32_t n); #define GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL \ "GIT_TEST_PACK_USE_BITMAP_BOUNDARY_TRAVERSAL" diff --git a/t/helper/test-bitmap.c b/t/helper/test-bitmap.c index af43ee1cb5..6af2b42678 100644 --- a/t/helper/test-bitmap.c +++ b/t/helper/test-bitmap.c @@ -13,21 +13,41 @@ static int bitmap_dump_hashes(void) return test_bitmap_hashes(the_repository); } +static int bitmap_dump_pseudo_merges(void) +{ + return test_bitmap_pseudo_merges(the_repository); +} + +static int bitmap_dump_pseudo_merge_commits(uint32_t n) +{ + return test_bitmap_pseudo_merge_commits(the_repository, n); +} + +static int bitmap_dump_pseudo_merge_objects(uint32_t n) +{ + return test_bitmap_pseudo_merge_objects(the_repository, n); +} + int cmd__bitmap(int argc, const char **argv) { setup_git_directory(); - if (argc != 2) - goto usage; - - if (!strcmp(argv[1], "list-commits")) + if (argc == 2 && !strcmp(argv[1], "list-commits")) return bitmap_list_commits(); - if (!strcmp(argv[1], "dump-hashes")) + if (argc == 2 && !strcmp(argv[1], "dump-hashes")) return bitmap_dump_hashes(); + if (argc == 2 && !strcmp(argv[1], "dump-pseudo-merges")) + return bitmap_dump_pseudo_merges(); + if (argc == 3 && !strcmp(argv[1], "dump-pseudo-merge-commits")) + return bitmap_dump_pseudo_merge_commits(atoi(argv[2])); + if (argc == 3 && !strcmp(argv[1], "dump-pseudo-merge-objects")) + return bitmap_dump_pseudo_merge_objects(atoi(argv[2])); -usage: usage("\ttest-tool bitmap list-commits\n" - "\ttest-tool bitmap dump-hashes"); + "\ttest-tool bitmap dump-hashes\n" + "\ttest-tool bitmap dump-pseudo-merges\n" + "\ttest-tool bitmap dump-pseudo-merge-commits \n" + "\ttest-tool bitmap dump-pseudo-merge-objects "); return -1; } From 8e41468ef3b0b8435918bae2b5bda2a5d3facdca Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:27:08 -0400 Subject: [PATCH 19/26] t/test-lib-functions.sh: support `--notick` in `test_commit_bulk()` One of the tests we'll want to add for pseudo-merge bitmaps needs to be able to generate a large number of commits at a specific date. Support the `--notick` option (with identical semantics to the `--notick` option for `test_commit()`) within `test_commit_bulk` as a prerequisite for that. Callers can then set the various _DATE variables themselves. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- t/test-lib-functions.sh | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh index 862d80c974..427b375b39 100644 --- a/t/test-lib-functions.sh +++ b/t/test-lib-functions.sh @@ -458,6 +458,7 @@ test_commit_bulk () { indir=. ref=HEAD n=1 + notick= message='commit %s' filename='%s.t' contents='content %s' @@ -488,6 +489,9 @@ test_commit_bulk () { filename="${1#--*=}-%s.t" contents="${1#--*=} %s" ;; + --notick) + notick=yes + ;; -*) BUG "invalid test_commit_bulk option: $1" ;; @@ -507,7 +511,10 @@ test_commit_bulk () { while test "$total" -gt 0 do - test_tick && + if test -z "$notick" + then + test_tick + fi && echo "commit $ref" printf 'author %s <%s> %s\n' \ "$GIT_AUTHOR_NAME" \ From 11d45a6e6a0639705b7ccf767e897e1870ece89b Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:27:11 -0400 Subject: [PATCH 20/26] pack-bitmap.c: use pseudo-merges during traversal Now that all of the groundwork has been laid to support reading and using pseudo-merges, make use of that work in this commit by teaching the pack-bitmap machinery to use pseudo-merge(s) when available during traversal. The basic operation is as follows: - When enumerating objects on either side of a reachability query, first see if any subset of the roots satisfies some pseudo-merge bitmap. If it does, apply that pseudo-merge bitmap. - If any pseudo-merge bitmap(s) were applied in the previous step, OR them into the result[^1]. Then repeat the process over all pseudo-merge bitmaps (we'll refer to this as "cascading" pseudo-merges). Once this is done, OR in the resulting bitmap. - If there is no fill-in traversal to be done, return the bitmap for that side of the reachability query. If there is fill-in traversal, then for each commit we encounter via show_commit(), check to see if any unsatisfied pseudo-merges containing that commit as one of its parents has been made satisfied by the presence of that commit. If so, OR in the object set from that pseudo-merge bitmap, and then cascade. If not, continue traversal. A similar implementation is present in the boundary-based bitmap traversal routines. [^1]: Importantly, we cannot OR in the entire set of roots along with the objects reachable from whatever pseudo-merge bitmaps were satisfied. This may leave some dangling bits corresponding to any unsatisfied root(s) getting OR'd into the resulting bitmap, tricking other parts of the traversal into thinking we already have a reachability closure over those commit(s) when we do not. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap.c | 112 ++++++++++- t/t5333-pseudo-merge-bitmaps.sh | 328 ++++++++++++++++++++++++++++++++ 2 files changed, 439 insertions(+), 1 deletion(-) create mode 100755 t/t5333-pseudo-merge-bitmaps.sh diff --git a/pack-bitmap.c b/pack-bitmap.c index c13074673a..e61058dada 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -114,6 +114,9 @@ struct bitmap_index { unsigned int version; }; +static int pseudo_merges_satisfied_nr; +static int pseudo_merges_cascades_nr; + static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st) { struct ewah_bitmap *parent; @@ -1006,6 +1009,22 @@ static void show_commit(struct commit *commit UNUSED, { } +static unsigned apply_pseudo_merges_for_commit_1(struct bitmap_index *bitmap_git, + struct bitmap *result, + struct commit *commit, + uint32_t commit_pos) +{ + int ret; + + ret = apply_pseudo_merges_for_commit(&bitmap_git->pseudo_merges, + result, commit, commit_pos); + + if (ret) + pseudo_merges_satisfied_nr += ret; + + return ret; +} + static int add_to_include_set(struct bitmap_index *bitmap_git, struct include_data *data, struct commit *commit, @@ -1026,6 +1045,10 @@ static int add_to_include_set(struct bitmap_index *bitmap_git, } bitmap_set(data->base, bitmap_pos); + if (apply_pseudo_merges_for_commit_1(bitmap_git, data->base, commit, + bitmap_pos)) + return 0; + return 1; } @@ -1151,6 +1174,20 @@ static void show_boundary_object(struct object *object UNUSED, BUG("should not be called"); } +static unsigned cascade_pseudo_merges_1(struct bitmap_index *bitmap_git, + struct bitmap *result, + struct bitmap *roots) +{ + int ret = cascade_pseudo_merges(&bitmap_git->pseudo_merges, + result, roots); + if (ret) { + pseudo_merges_cascades_nr++; + pseudo_merges_satisfied_nr += ret; + } + + return ret; +} + static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git, struct rev_info *revs, struct object_list *roots) @@ -1160,6 +1197,7 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git, unsigned int i; unsigned int tmp_blobs, tmp_trees, tmp_tags; int any_missing = 0; + int existing_bitmaps = 0; cb.bitmap_git = bitmap_git; cb.base = bitmap_new(); @@ -1167,6 +1205,25 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git, revs->ignore_missing_links = 1; + if (bitmap_git->pseudo_merges.nr) { + struct bitmap *roots_bitmap = bitmap_new(); + struct object_list *objects = NULL; + + for (objects = roots; objects; objects = objects->next) { + struct object *object = objects->item; + int pos; + + pos = bitmap_position(bitmap_git, &object->oid); + if (pos < 0) + continue; + + bitmap_set(roots_bitmap, pos); + } + + if (!cascade_pseudo_merges_1(bitmap_git, cb.base, roots_bitmap)) + bitmap_free(roots_bitmap); + } + /* * OR in any existing reachability bitmaps among `roots` into * `cb.base`. @@ -1178,8 +1235,10 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git, continue; if (add_commit_to_bitmap(bitmap_git, &cb.base, - (struct commit *)object)) + (struct commit *)object)) { + existing_bitmaps = 1; continue; + } any_missing = 1; } @@ -1187,6 +1246,9 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git, if (!any_missing) goto cleanup; + if (existing_bitmaps) + cascade_pseudo_merges_1(bitmap_git, cb.base, NULL); + tmp_blobs = revs->blob_objects; tmp_trees = revs->tree_objects; tmp_tags = revs->blob_objects; @@ -1242,6 +1304,13 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git, return cb.base; } +static void unsatisfy_all_pseudo_merges(struct bitmap_index *bitmap_git) +{ + uint32_t i; + for (i = 0; i < bitmap_git->pseudo_merges.nr; i++) + bitmap_git->pseudo_merges.v[i].satisfied = 0; +} + static struct bitmap *find_objects(struct bitmap_index *bitmap_git, struct rev_info *revs, struct object_list *roots, @@ -1249,9 +1318,32 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git, { struct bitmap *base = NULL; int needs_walk = 0; + unsigned existing_bitmaps = 0; struct object_list *not_mapped = NULL; + unsatisfy_all_pseudo_merges(bitmap_git); + + if (bitmap_git->pseudo_merges.nr) { + struct bitmap *roots_bitmap = bitmap_new(); + struct object_list *objects = NULL; + + for (objects = roots; objects; objects = objects->next) { + struct object *object = objects->item; + int pos; + + pos = bitmap_position(bitmap_git, &object->oid); + if (pos < 0) + continue; + + bitmap_set(roots_bitmap, pos); + } + + base = bitmap_new(); + if (!cascade_pseudo_merges_1(bitmap_git, base, roots_bitmap)) + bitmap_free(roots_bitmap); + } + /* * Go through all the roots for the walk. The ones that have bitmaps * on the bitmap index will be `or`ed together to form an initial @@ -1262,11 +1354,21 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git, */ while (roots) { struct object *object = roots->item; + roots = roots->next; + if (base) { + int pos = bitmap_position(bitmap_git, &object->oid); + if (pos > 0 && bitmap_get(base, pos)) { + object->flags |= SEEN; + continue; + } + } + if (object->type == OBJ_COMMIT && add_commit_to_bitmap(bitmap_git, &base, (struct commit *)object)) { object->flags |= SEEN; + existing_bitmaps = 1; continue; } @@ -1282,6 +1384,9 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git, roots = not_mapped; + if (existing_bitmaps) + cascade_pseudo_merges_1(bitmap_git, base, NULL); + /* * Let's iterate through all the roots that don't have bitmaps to * check if we can determine them to be reachable from the existing @@ -1866,6 +1971,11 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs, object_list_free(&wants); object_list_free(&haves); + trace2_data_intmax("bitmap", the_repository, "pseudo_merges_satisfied", + pseudo_merges_satisfied_nr); + trace2_data_intmax("bitmap", the_repository, "pseudo_merges_cascades", + pseudo_merges_cascades_nr); + return bitmap_git; cleanup: diff --git a/t/t5333-pseudo-merge-bitmaps.sh b/t/t5333-pseudo-merge-bitmaps.sh new file mode 100755 index 0000000000..4c9aebcffd --- /dev/null +++ b/t/t5333-pseudo-merge-bitmaps.sh @@ -0,0 +1,328 @@ +#!/bin/sh + +test_description='pseudo-merge bitmaps' + +GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP=0 + +. ./test-lib.sh + +test_pseudo_merges () { + test-tool bitmap dump-pseudo-merges +} + +test_pseudo_merge_commits () { + test-tool bitmap dump-pseudo-merge-commits "$1" +} + +test_pseudo_merges_satisfied () { + test_trace2_data bitmap pseudo_merges_satisfied "$1" +} + +test_pseudo_merges_cascades () { + test_trace2_data bitmap pseudo_merges_cascades "$1" +} + +tag_everything () { + git rev-list --all --no-object-names >in && + perl -lne ' + print "create refs/tags/" . $. . " " . $1 if /([0-9a-f]+)/ + ' expect && + + : >trace2.txt && + GIT_TRACE2_EVENT=$PWD/trace2.txt \ + git rev-list --count --all --objects --use-bitmap-index >actual && + + test_pseudo_merges_satisfied 0 merges && + test_must_be_empty merges && + test_cmp expect actual +' + +test_expect_success 'pseudo-merges accurately represent their objects' ' + test_config bitmapPseudoMerge.test.pattern "refs/tags/" && + test_config bitmapPseudoMerge.test.maxMerges 8 && + test_config bitmapPseudoMerge.test.stableThreshold never && + + git repack -adb && + + test_pseudo_merges >merges && + test_line_count = 8 merges && + + for i in $(test_seq 0 $(($(wc -l commits && + + git rev-list --objects --no-object-names --stdin expect.raw && + test-tool bitmap dump-pseudo-merge-objects $i >actual.raw && + + sort -u expect && + sort -u actual && + + test_cmp expect actual || return 1 + done +' + +test_expect_success 'bitmap traversal with pseudo-merges' ' + : >trace2.txt && + GIT_TRACE2_EVENT=$PWD/trace2.txt \ + git rev-list --count --all --objects --use-bitmap-index >actual && + git rev-list --count --all --objects >expect && + + test_pseudo_merges_satisfied 8 trace2.txt && + GIT_TRACE2_EVENT=$PWD/trace2.txt \ + git rev-list --count --all --objects --use-bitmap-index >actual && + git rev-list --count --all --objects >expect && + + test_pseudo_merges_satisfied 8 merges && + test_line_count = 1 merges && + test_pseudo_merge_commits 0 >commits && + + test-tool bitmap list-commits >bitmaps && + bitmaps_nr="$(wc -l expect && + + test $(cat expect) -eq $(wc -l merges && + test_line_count = 1 merges && + + test_pseudo_merge_commits 0 >oids && + git cat-file --batch commits && + + test $(wc -l in && + git update-ref --stdin merges && + merges_nr="$(wc -l oids && + git cat-file --batch commits && + + expect="$(grep -c "^committer.*$old +0000$" commits)" && + actual="$(wc -l oids && + git cat-file --batch commits && + test $(wc -l err && + + cat >expect <<-EOF && + fatal: pseudo-merge group ${SQ}test${SQ} has unstable threshold before stable one + EOF + + test_cmp expect err +' + +test_expect_success 'pseudo-merge pattern with capture groups' ' + git init pseudo-merge-captures && + ( + cd pseudo-merge-captures && + + test_commit_bulk 128 && + tag_everything && + + for r in $(test_seq 8) + do + test_commit_bulk 16 && + + git rev-list HEAD~16.. >in && + + perl -lne "print \"create refs/remotes/$r/tags/\$. \$_\"" refs && + + test_pseudo_merges >merges && + for m in $(test_seq 0 $(($(wc -l oids && + grep -f oids refs | + perl -lne "print \$1 if /refs\/remotes\/([0-9]+)/" | + sort -u || return 1 + done >remotes && + + test $(wc -l merges && + test_line_count = 2 merges && + + test_pseudo_merge_commits 0 >commits-0.raw && + test_pseudo_merge_commits 1 >commits-1.raw && + + sort commits-0.raw >commits-0 && + sort commits-1.raw >commits-1 && + + comm -12 commits-0 commits-1 >overlap && + + test_line_count -gt 0 overlap + ) +' + +test_expect_success 'pseudo-merge overlap traversal' ' + ( + cd pseudo-merge-overlap && + + : >trace2.txt && + GIT_TRACE2_EVENT=$PWD/trace2.txt \ + git rev-list --count --all --objects --use-bitmap-index >actual && + git rev-list --count --all --objects >expect && + + test_pseudo_merges_satisfied 2 trace2.txt && + GIT_TRACE2_EVENT=$PWD/trace2.txt \ + git rev-list --count --all --objects --use-bitmap-index >actual && + git rev-list --count --all --objects >expect && + + test_pseudo_merges_satisfied 2 Date: Thu, 23 May 2024 17:27:15 -0400 Subject: [PATCH 21/26] pack-bitmap: extra trace2 information Add some extra trace2 lines to capture the number of bitmap lookups that are hits versus misses, as well as the number of reachability roots that have bitmap coverage (versus those that do not). Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/pack-bitmap.c b/pack-bitmap.c index e61058dada..1966b3b95f 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -116,6 +116,10 @@ struct bitmap_index { static int pseudo_merges_satisfied_nr; static int pseudo_merges_cascades_nr; +static int existing_bitmaps_hits_nr; +static int existing_bitmaps_misses_nr; +static int roots_with_bitmaps_nr; +static int roots_without_bitmaps_nr; static struct ewah_bitmap *lookup_stored_bitmap(struct stored_bitmap *st) { @@ -1040,10 +1044,14 @@ static int add_to_include_set(struct bitmap_index *bitmap_git, partial = bitmap_for_commit(bitmap_git, commit); if (partial) { + existing_bitmaps_hits_nr++; + bitmap_or_ewah(data->base, partial); return 0; } + existing_bitmaps_misses_nr++; + bitmap_set(data->base, bitmap_pos); if (apply_pseudo_merges_for_commit_1(bitmap_git, data->base, commit, bitmap_pos)) @@ -1099,8 +1107,12 @@ static int add_commit_to_bitmap(struct bitmap_index *bitmap_git, { struct ewah_bitmap *or_with = bitmap_for_commit(bitmap_git, commit); - if (!or_with) + if (!or_with) { + existing_bitmaps_misses_nr++; return 0; + } + + existing_bitmaps_hits_nr++; if (!*base) *base = ewah_to_bitmap(or_with); @@ -1407,8 +1419,12 @@ static struct bitmap *find_objects(struct bitmap_index *bitmap_git, object->flags &= ~UNINTERESTING; add_pending_object(revs, object, ""); needs_walk = 1; + + roots_without_bitmaps_nr++; } else { object->flags |= SEEN; + + roots_with_bitmaps_nr++; } } @@ -1975,6 +1991,14 @@ struct bitmap_index *prepare_bitmap_walk(struct rev_info *revs, pseudo_merges_satisfied_nr); trace2_data_intmax("bitmap", the_repository, "pseudo_merges_cascades", pseudo_merges_cascades_nr); + trace2_data_intmax("bitmap", the_repository, "bitmap/hits", + existing_bitmaps_hits_nr); + trace2_data_intmax("bitmap", the_repository, "bitmap/misses", + existing_bitmaps_misses_nr); + trace2_data_intmax("bitmap", the_repository, "bitmap/roots_with_bitmap", + roots_with_bitmaps_nr); + trace2_data_intmax("bitmap", the_repository, "bitmap/roots_without_bitmap", + roots_without_bitmaps_nr); return bitmap_git; From 94c1addf862619250a1584c05a43f80f1c8d7d72 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:27:18 -0400 Subject: [PATCH 22/26] ewah: `bitmap_equals_ewah()` Prepare to reuse existing pseudo-merge bitmaps by implementing a `bitmap_equals_ewah()` helper. This helper will be used to see if a raw bitmap (containing the set of parents for some pseudo-merge) is equal to any existing pseudo-merge's commits bitmap (which are stored as EWAH-compressed bitmaps on disk). Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- ewah/bitmap.c | 19 +++++++++++++++++++ ewah/ewok.h | 1 + 2 files changed, 20 insertions(+) diff --git a/ewah/bitmap.c b/ewah/bitmap.c index dc2ca190f1..55928dada8 100644 --- a/ewah/bitmap.c +++ b/ewah/bitmap.c @@ -261,6 +261,25 @@ int bitmap_equals(struct bitmap *self, struct bitmap *other) return 1; } +int bitmap_equals_ewah(struct bitmap *self, struct ewah_bitmap *other) +{ + struct ewah_iterator it; + eword_t word; + size_t i = 0; + + ewah_iterator_init(&it, other); + + while (ewah_iterator_next(&word, &it)) + if (word != (i < self->word_alloc ? self->words[i++] : 0)) + return 0; + + for (; i < self->word_alloc; i++) + if (self->words[i]) + return 0; + + return 1; +} + int bitmap_is_subset(struct bitmap *self, struct bitmap *other) { size_t common_size, i; diff --git a/ewah/ewok.h b/ewah/ewok.h index 7074a6347b..5e357e2493 100644 --- a/ewah/ewok.h +++ b/ewah/ewok.h @@ -179,6 +179,7 @@ void bitmap_unset(struct bitmap *self, size_t pos); int bitmap_get(struct bitmap *self, size_t pos); void bitmap_free(struct bitmap *self); int bitmap_equals(struct bitmap *self, struct bitmap *other); +int bitmap_equals_ewah(struct bitmap *self, struct ewah_bitmap *other); /* * Both `bitmap_is_subset()` and `ewah_bitmap_is_subset()` return 1 if the set From 7252d9a036fabb10f60dc09937fc39e252f3c4d4 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:27:21 -0400 Subject: [PATCH 23/26] pseudo-merge: implement support for finding existing merges This patch implements support for reusing existing pseudo-merge commits when writing bitmaps when there is an existing pseudo-merge bitmap which has exactly the same set of parents as one that we are about to write. Note that unstable pseudo-merges are likely to change between consecutive repacks, and so are generally poor candidates for reuse. However, stable pseudo-merges (see the configuration option 'bitmapPseudoMerge..stableThreshold') are by definition unlikely to change between runs (as they represent long-running branches). Because there is no index from a *set* of pseudo-merge parents to a matching pseudo-merge bitmap, we have to construct the bitmap corresponding to the set of parents for each pending pseudo-merge commit and see if a matching bitmap exists. This is technically quadratic in the number of pseudo-merges, but is OK in practice for a couple of reasons: - non-matching pseudo-merge bitmaps are rejected quickly as soon as they differ in a single bit - already-matched pseudo-merge bitmaps are discarded from subsequent rounds of search - the number of pseudo-merges is generally small, even for large repositories In order to do this, implement (a) a function that finds a matching pseudo-merge given some uncompressed bitset describing its parents, (b) a function that computes the bitset of parents for a given pseudo-merge commit, and (c) call that function before computing the set of reachable objects for some pending pseudo-merge. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap-write.c | 15 +++++++- pack-bitmap.c | 32 ++++++++++++++++ pack-bitmap.h | 2 + pseudo-merge.c | 55 ++++++++++++++++++++++++++++ pseudo-merge.h | 7 ++++ t/t5333-pseudo-merge-bitmaps.sh | 65 +++++++++++++++++++++++++++++++++ 6 files changed, 174 insertions(+), 2 deletions(-) diff --git a/pack-bitmap-write.c b/pack-bitmap-write.c index 47250398aa..6e8060f8a0 100644 --- a/pack-bitmap-write.c +++ b/pack-bitmap-write.c @@ -19,6 +19,10 @@ #include "tree-walk.h" #include "pseudo-merge.h" #include "oid-array.h" +#include "config.h" +#include "alloc.h" +#include "refs.h" +#include "strmap.h" struct bitmapped_commit { struct commit *commit; @@ -465,6 +469,7 @@ static int fill_bitmap_tree(struct bitmap_writer *writer, } static int reused_bitmaps_nr; +static int reused_pseudo_merge_bitmaps_nr; static int fill_bitmap_commit(struct bitmap_writer *writer, struct bb_commit *ent, @@ -490,7 +495,7 @@ static int fill_bitmap_commit(struct bitmap_writer *writer, struct bitmap *remapped = bitmap_new(); if (commit->object.flags & BITMAP_PSEUDO_MERGE) - old = NULL; + old = pseudo_merge_bitmap_for_commit(old_bitmap, c); else old = bitmap_for_commit(old_bitmap, c); /* @@ -501,7 +506,10 @@ static int fill_bitmap_commit(struct bitmap_writer *writer, if (old && !rebuild_bitmap(mapping, old, remapped)) { bitmap_or(ent->bitmap, remapped); bitmap_free(remapped); - reused_bitmaps_nr++; + if (commit->object.flags & BITMAP_PSEUDO_MERGE) + reused_pseudo_merge_bitmaps_nr++; + else + reused_bitmaps_nr++; continue; } bitmap_free(remapped); @@ -631,6 +639,9 @@ int bitmap_writer_build(struct bitmap_writer *writer, the_repository); trace2_data_intmax("pack-bitmap-write", the_repository, "building_bitmaps_reused", reused_bitmaps_nr); + trace2_data_intmax("pack-bitmap-write", the_repository, + "building_bitmaps_pseudo_merge_reused", + reused_pseudo_merge_bitmaps_nr); stop_progress(&writer->progress); diff --git a/pack-bitmap.c b/pack-bitmap.c index 1966b3b95f..70230e2647 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -1316,6 +1316,37 @@ static struct bitmap *find_boundary_objects(struct bitmap_index *bitmap_git, return cb.base; } +struct ewah_bitmap *pseudo_merge_bitmap_for_commit(struct bitmap_index *bitmap_git, + struct commit *commit) +{ + struct commit_list *p; + struct bitmap *parents; + struct pseudo_merge *match = NULL; + + if (!bitmap_git->pseudo_merges.nr) + return NULL; + + parents = bitmap_new(); + + for (p = commit->parents; p; p = p->next) { + int pos = bitmap_position(bitmap_git, &p->item->object.oid); + if (pos < 0 || pos >= bitmap_num_objects(bitmap_git)) + goto done; + + bitmap_set(parents, pos); + } + + match = pseudo_merge_for_parents(&bitmap_git->pseudo_merges, + parents); + +done: + bitmap_free(parents); + if (match) + return pseudo_merge_bitmap(&bitmap_git->pseudo_merges, match); + + return NULL; +} + static void unsatisfy_all_pseudo_merges(struct bitmap_index *bitmap_git) { uint32_t i; @@ -2809,6 +2840,7 @@ void free_bitmap_index(struct bitmap_index *b) */ close_midx_revindex(b->midx); } + free_pseudo_merge_map(&b->pseudo_merges); free(b); } diff --git a/pack-bitmap.h b/pack-bitmap.h index 4466b5ad0f..1171e6d989 100644 --- a/pack-bitmap.h +++ b/pack-bitmap.h @@ -142,6 +142,8 @@ int rebuild_bitmap(const uint32_t *reposition, struct bitmap *dest); struct ewah_bitmap *bitmap_for_commit(struct bitmap_index *bitmap_git, struct commit *commit); +struct ewah_bitmap *pseudo_merge_bitmap_for_commit(struct bitmap_index *bitmap_git, + struct commit *commit); void bitmap_writer_select_commits(struct bitmap_writer *writer, struct commit **indexed_commits, unsigned int indexed_commits_nr); diff --git a/pseudo-merge.c b/pseudo-merge.c index 7d13101149..a117520996 100644 --- a/pseudo-merge.c +++ b/pseudo-merge.c @@ -699,3 +699,58 @@ int cascade_pseudo_merges(const struct pseudo_merge_map *pm, return ret; } + +struct pseudo_merge *pseudo_merge_for_parents(const struct pseudo_merge_map *pm, + struct bitmap *parents) +{ + struct pseudo_merge *match = NULL; + size_t i; + + if (!pm->nr) + return NULL; + + /* + * NOTE: this loop is quadratic in the worst-case (where no + * matching pseudo-merge bitmaps are found), but in practice + * this is OK for a few reasons: + * + * - Rejecting pseudo-merge bitmaps that do not match the + * given commit is done quickly (i.e. `bitmap_equals_ewah()` + * returns early when we know the two bitmaps aren't equal. + * + * - Already matched pseudo-merge bitmaps (which we track with + * the `->satisfied` bit here) are skipped as potential + * candidates. + * + * - The number of pseudo-merges should be small (in the + * hundreds for most repositories). + * + * If in the future this semi-quadratic behavior does become a + * problem, another approach would be to keep track of which + * pseudo-merges are still "viable" after enumerating the + * pseudo-merge commit's parents: + * + * - A pseudo-merge bitmap becomes non-viable when the bit(s) + * corresponding to one or more parent(s) of the given + * commit are not set in a candidate pseudo-merge's commits + * bitmap. + * + * - After processing all bits, enumerate the remaining set of + * viable pseudo-merge bitmaps, and check that their + * popcount() matches the number of parents in the given + * commit. + */ + for (i = 0; i < pm->nr; i++) { + struct pseudo_merge *candidate = use_pseudo_merge(pm, &pm->v[i]); + if (!candidate || candidate->satisfied) + continue; + if (!bitmap_equals_ewah(parents, candidate->commits)) + continue; + + match = candidate; + match->satisfied = 1; + break; + } + + return match; +} diff --git a/pseudo-merge.h b/pseudo-merge.h index 755edc054a..2aca01d056 100644 --- a/pseudo-merge.h +++ b/pseudo-merge.h @@ -206,4 +206,11 @@ int cascade_pseudo_merges(const struct pseudo_merge_map *pm, struct bitmap *result, struct bitmap *roots); +/* + * Returns a pseudo-merge which contains the exact set of commits + * listed in the "parents" bitamp, or NULL if none could be found. + */ +struct pseudo_merge *pseudo_merge_for_parents(const struct pseudo_merge_map *pm, + struct bitmap *parents); + #endif diff --git a/t/t5333-pseudo-merge-bitmaps.sh b/t/t5333-pseudo-merge-bitmaps.sh index 4c9aebcffd..f052f395a7 100755 --- a/t/t5333-pseudo-merge-bitmaps.sh +++ b/t/t5333-pseudo-merge-bitmaps.sh @@ -22,6 +22,10 @@ test_pseudo_merges_cascades () { test_trace2_data bitmap pseudo_merges_cascades "$1" } +test_pseudo_merges_reused () { + test_trace2_data pack-bitmap-write building_bitmaps_pseudo_merge_reused "$1" +} + tag_everything () { git rev-list --all --no-object-names >in && perl -lne ' @@ -325,4 +329,65 @@ test_expect_success 'pseudo-merge overlap stale traversal' ' ) ' +test_expect_success 'pseudo-merge reuse' ' + git init pseudo-merge-reuse && + ( + cd pseudo-merge-reuse && + + stable="1641013200" && # 2022-01-01 + unstable="1672549200" && # 2023-01-01 + + GIT_COMMITTER_DATE="$stable +0000" && + export GIT_COMMITTER_DATE && + test_commit_bulk --notick 128 && + GIT_COMMITTER_DATE="$unstable +0000" && + export GIT_COMMITTER_DATE && + test_commit_bulk --notick 128 && + + tag_everything && + + git \ + -c bitmapPseudoMerge.test.pattern="refs/tags/" \ + -c bitmapPseudoMerge.test.maxMerges=1 \ + -c bitmapPseudoMerge.test.threshold=now \ + -c bitmapPseudoMerge.test.stableThreshold=$(($unstable - 1)) \ + -c bitmapPseudoMerge.test.stableSize=512 \ + repack -adb && + + test_pseudo_merges >merges && + test_line_count = 2 merges && + + test_pseudo_merge_commits 0 >stable-oids.before && + test_pseudo_merge_commits 1 >unstable-oids.before && + + : >trace2.txt && + GIT_TRACE2_EVENT=$PWD/trace2.txt git \ + -c bitmapPseudoMerge.test.pattern="refs/tags/" \ + -c bitmapPseudoMerge.test.maxMerges=2 \ + -c bitmapPseudoMerge.test.threshold=now \ + -c bitmapPseudoMerge.test.stableThreshold=$(($unstable - 1)) \ + -c bitmapPseudoMerge.test.stableSize=512 \ + repack -adb && + + test_pseudo_merges_reused 1 merges && + test_line_count = 3 merges && + + test_pseudo_merge_commits 0 >stable-oids.after && + for i in 1 2 + do + test_pseudo_merge_commits $i || return 1 + done >unstable-oids.after && + + sort -u expect && + sort -u actual && + test_cmp expect actual && + + sort -u expect && + sort -u actual && + test_cmp expect actual + ) +' + test_done From 0b7500dc66ffcb6b1ccc3332715936a59c6b5ce4 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Thu, 23 May 2024 17:27:24 -0400 Subject: [PATCH 24/26] t/perf: implement performance tests for pseudo-merge bitmaps Implement a straightforward performance test demonstrating the benefit of pseudo-merge bitmaps by measuring how long it takes to count reachable objects in a few different scenarios: - without bitmaps, to demonstrate a reasonable baseline - with bitmaps, but without pseudo-merges - with bitmaps and pseudo-merges Results from running this test on git.git are as follows: Test this tree ----------------------------------------------------------------------------------- 5333.2: git rev-list --count --all --objects (no bitmaps) 3.54(3.45+0.08) 5333.3: git rev-list --count --all --objects (no pseudo-merges) 0.43(0.40+0.03) 5333.4: git rev-list --count --all --objects (with pseudo-merges) 0.12(0.11+0.01) On a private repository which is much larger, and has many spikey parts of history that aren't merged into the 'master' branch, the results are as follows: Test this tree --------------------------------------------------------------------------------------- 5333.1: git rev-list --count --all --objects (no bitmaps) 122.29(121.31+0.97) 5333.2: git rev-list --count --all --objects (no pseudo-merges) 21.88(21.30+0.58) 5333.3: git rev-list --count --all --objects (with pseudo-merges) 5.05(4.77+0.28) Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- t/perf/p5333-pseudo-merge-bitmaps.sh | 32 ++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) create mode 100755 t/perf/p5333-pseudo-merge-bitmaps.sh diff --git a/t/perf/p5333-pseudo-merge-bitmaps.sh b/t/perf/p5333-pseudo-merge-bitmaps.sh new file mode 100755 index 0000000000..2e8b1d2635 --- /dev/null +++ b/t/perf/p5333-pseudo-merge-bitmaps.sh @@ -0,0 +1,32 @@ +#!/bin/sh + +test_description='pseudo-merge bitmaps' +. ./perf-lib.sh + +test_perf_large_repo + +test_expect_success 'setup' ' + git \ + -c bitmapPseudoMerge.all.pattern="refs/" \ + -c bitmapPseudoMerge.all.threshold=now \ + -c bitmapPseudoMerge.all.stableThreshold=never \ + -c bitmapPseudoMerge.all.maxMerges=64 \ + -c pack.writeBitmapLookupTable=true \ + repack -adb +' + +test_perf 'git rev-list --count --all --objects (no bitmaps)' ' + git rev-list --objects --all +' + +test_perf 'git rev-list --count --all --objects (no pseudo-merges)' ' + GIT_TEST_USE_PSEUDO_MERGES=0 \ + git rev-list --objects --all --use-bitmap-index +' + +test_perf 'git rev-list --count --all --objects (with pseudo-merges)' ' + GIT_TEST_USE_PSEUDO_MERGES=1 \ + git rev-list --objects --all --use-bitmap-index +' + +test_done From 20c49432e44f1c98e37f5df97c8137ecd8c1af77 Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Fri, 14 Jun 2024 15:23:55 -0400 Subject: [PATCH 25/26] Documentation/technical/bitmap-format.txt: add missing position table While investigating a benign Coverity warning on the new pseudo-merge implementation, I was struggling to understand the (paraphrased) below: ofs = index_end - 24 - (index->pseudo_merges.nr * sizeof(uint64_t)); for (i = 0; i < index->pseudo_merges.nr; i++) { index->pseudo_merges.v[i].at = get_be64(ofs); ofs += sizeof(uint64_t); } , in pack-bitmap.c::load_bitmap_header(). Looking at the documentation, the diagram describing the on-disk format (prior to this patch) suggested that the optional extended lookup table immediately preceded the trailing metadata portion. If that were the case, that would make the above code from load_bitmap_header() incorrect, as we'd be blindly reading into the extended offset table. But later on in the documentation there is a description of the pseudo-merge position table as immediately preceding the trailing metadata portion of the extension. And indeed, we do write the position table in pack-bitmap-write.c: /* write positions for all pseudo merges */ for (i = 0; i < writer->pseudo_merges_nr; i++) hashwrite_be64(f, pseudo_merge_ofs[i]); hashwrite_be32(f, writer->pseudo_merges_nr); hashwrite_be32(f, kh_size(writer->pseudo_merge_commits)); hashwrite_be64(f, table_start - start); hashwrite_be64(f, hashfile_total(f) - start + sizeof(uint64_t)); So this is purely a case of the diagram being out of sync with the textual description and actual implementation of the format specification. Add the missing component back to the format diagram to avoid further confusion in this area. Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- Documentation/technical/bitmap-format.txt | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt index ee7775a258..bfb0ec7beb 100644 --- a/Documentation/technical/bitmap-format.txt +++ b/Documentation/technical/bitmap-format.txt @@ -312,6 +312,15 @@ the end of a `.bitmap` file. The format is as follows: | | +-------------------------------------------+ | | +| Pseudo-merge position table | +| +----+----------+----------+----------+ | +| | N | Offset 1 | .... | Offset N | | +| +----+----------+----------+----------+ | +| | | 8 bytes | .... | 8 bytes | | +| +----+----------+----------+----------+ | +| | ++-------------------------------------------+ +| | | Pseudo-merge Metadata | | +-----------------------------------+ | | | # pseudo-merges (4 bytes) | | From a83e21de6b7630c1cdf3298d68b120dd9eaecd0f Mon Sep 17 00:00:00 2001 From: Taylor Blau Date: Fri, 14 Jun 2024 15:23:58 -0400 Subject: [PATCH 26/26] pack-bitmap.c: ensure pseudo-merge offset reads are bounded After reading the pseudo-merge extension's metadata table, we allocate an array to store information about each pseudo-merge, including its byte offset within the .bitmap file itself. This is done like so: pseudo_merge_ofs = index_end - 24 - (index->pseudo_merges.nr * sizeof(uint64_t)); for (i = 0; i < index->pseudo_merges.nr; i++) { index->pseudo_merges.v[i].at = get_be64(pseudo_merge_ofs); pseudo_merge_ofs += sizeof(uint64_t); } But if the pseudo-merge table is corrupt, we'll keep calling get_be64() past the end of the pseudo-merge extension, potentially reading off the end of the mmap'd region. Prevent this by ensuring that we have at least `table_size - 24` many bytes available to read (adding 24 to the left-hand side of our inequality to account for the length of the metadata component). This is sufficient to prevent us from reading off the end of the pseudo-merge extension, and ensures that all of the get_be64() calls below are in bounds. Helped-by: Jeff King Signed-off-by: Taylor Blau Signed-off-by: Junio C Hamano --- pack-bitmap.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/pack-bitmap.c b/pack-bitmap.c index 70230e2647..ae13d7ee3b 100644 --- a/pack-bitmap.c +++ b/pack-bitmap.c @@ -238,6 +238,11 @@ static int load_bitmap_header(struct bitmap_index *index) index->pseudo_merges.commits_nr = get_be32(index_end - 20); index->pseudo_merges.nr = get_be32(index_end - 24); + if (st_add(st_mult(index->pseudo_merges.nr, + sizeof(uint64_t)), + 24) > table_size) + return error(_("corrupted bitmap index file, pseudo-merge table too short")); + CALLOC_ARRAY(index->pseudo_merges.v, index->pseudo_merges.nr);