Commit graph

21398 commits

Author SHA1 Message Date
Junio C Hamano 922cc26e41 Merge branch 'ps/do-not-trust-commit-graph-blindly-for-existence' into kn/rev-list-missing-fix
* ps/do-not-trust-commit-graph-blindly-for-existence:
  commit: detect commits that exist in commit-graph but not in the ODB
  commit-graph: introduce envvar to disable commit existence checks
2023-11-01 12:06:55 +09:00
Patrick Steinhardt 7a5d604443 commit: detect commits that exist in commit-graph but not in the ODB
Commit graphs can become stale and contain references to commits that do
not exist in the object database anymore. Theoretically, this can lead
to a scenario where we are able to successfully look up any such commit
via the commit graph even though such a lookup would fail if done via
the object database directly.

As the commit graph is mostly intended as a sort of cache to speed up
parsing of commits we do not want to have diverging behaviour in a
repository with and a repository without commit graphs, no matter
whether they are stale or not. As commits are otherwise immutable, the
only thing that we really need to care about is thus the presence or
absence of a commit.

To address potentially stale commit data that may exist in the graph,
our `lookup_commit_in_graph()` function will check for the commit's
existence in both the commit graph, but also in the object database. So
even if we were able to look up the commit's data in the graph, we would
still pretend as if the commit didn't exist if it is missing in the
object database.

We don't have the same safety net in `parse_commit_in_graph_one()`
though. This function is mostly used internally in "commit-graph.c"
itself to validate the commit graph, and this usage is fine. We do
expose its functionality via `parse_commit_in_graph()` though, which
gets called by `repo_parse_commit_internal()`, and that function is in
turn used in many places in our codebase.

For all I can see this function is never used to directly turn an object
ID into a commit object without additional safety checks before or after
this lookup. What it is being used for though is to walk history via the
parent chain of commits. So when commits in the parent chain of a graph
walk are missing it is possible that we wouldn't notice if that missing
commit was part of the commit graph. Thus, a query like `git rev-parse
HEAD~2` can succeed even if the intermittent commit is missing.

It's unclear whether there are additional ways in which such stale
commit graphs can lead to problems. In any case, it feels like this is a
bigger bug waiting to happen when we gain additional direct or indirect
callers of `repo_parse_commit_internal()`. So let's fix the inconsistent
behaviour by checking for object existence via the object database, as
well.

This check of course comes with a performance penalty. The following
benchmarks have been executed in a clone of linux.git with stable tags
added:

    Benchmark 1: git -c core.commitGraph=true rev-list --topo-order --all (git = master)
      Time (mean ± σ):      2.913 s ±  0.018 s    [User: 2.363 s, System: 0.548 s]
      Range (min … max):    2.894 s …  2.950 s    10 runs

    Benchmark 2: git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
      Time (mean ± σ):      3.834 s ±  0.052 s    [User: 3.276 s, System: 0.556 s]
      Range (min … max):    3.780 s …  3.961 s    10 runs

    Benchmark 3: git -c core.commitGraph=false rev-list --topo-order --all (git = master)
      Time (mean ± σ):     13.841 s ±  0.084 s    [User: 13.152 s, System: 0.687 s]
      Range (min … max):   13.714 s … 13.995 s    10 runs

    Benchmark 4: git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
      Time (mean ± σ):     13.762 s ±  0.116 s    [User: 13.094 s, System: 0.667 s]
      Range (min … max):   13.645 s … 14.038 s    10 runs

    Summary
      git -c core.commitGraph=true rev-list --topo-order --all (git = master) ran
        1.32 ± 0.02 times faster than git -c core.commitGraph=true rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
        4.72 ± 0.05 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = pks-commit-graph-inconsistency)
        4.75 ± 0.04 times faster than git -c core.commitGraph=false rev-list --topo-order --all (git = master)

We look at a ~30% regression in general, but in general we're still a
whole lot faster than without the commit graph. To counteract this, the
new check can be turned off with the `GIT_COMMIT_GRAPH_PARANOIA` envvar.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-01 12:04:06 +09:00
Patrick Steinhardt e04838ea82 commit-graph: introduce envvar to disable commit existence checks
Our `lookup_commit_in_graph()` helper tries to look up commits from the
commit graph and, if it doesn't exist there, falls back to parsing it
from the object database instead. This is intended to speed up the
lookup of any such commit that exists in the database. There is an edge
case though where the commit exists in the graph, but not in the object
database. To avoid returning such stale commits the helper function thus
double checks that any such commit parsed from the graph also exists in
the object database. This makes the function safe to use even when
commit graphs aren't updated regularly.

We're about to introduce the same pattern into other parts of our code
base though, namely `repo_parse_commit_internal()`. Here the extra
sanity check is a bit of a tougher sell: `lookup_commit_in_graph()` was
a newly introduced helper, and as such there was no performance hit by
adding this sanity check. If we added `repo_parse_commit_internal()`
with that sanity check right from the beginning as well, this would
probably never have been an issue to begin with. But by retrofitting it
with this sanity check now we do add a performance regression to
preexisting code, and thus there is a desire to avoid this or at least
give an escape hatch.

In practice, there is no inherent reason why either of those functions
should have the sanity check whereas the other one does not: either both
of them are able to detect this issue or none of them should be. This
also means that the default of whether we do the check should likely be
the same for both. To err on the side of caution, we thus rather want to
make `repo_parse_commit_internal()` stricter than to loosen the checks
that we already have in `lookup_commit_in_graph()`.

The escape hatch is added in the form of a new GIT_COMMIT_GRAPH_PARANOIA
environment variable that mirrors GIT_REF_PARANOIA. If enabled, which is
the default, we will double check that commits looked up in the commit
graph via `lookup_commit_in_graph()` also exist in the object database.
This same check will also be added in `repo_parse_commit_internal()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-11-01 12:04:06 +09:00
Junio C Hamano 5edbcead42 Merge branch 'bc/racy-4gb-files'
The index file has room only for lower 32-bit of the file size in
the cached stat information, which means cached stat information
will have 0 in its sd_size member for a file whose size is multiple
of 4GiB.  This is mistaken for a racily clean path.  Avoid it by
storing a bogus sd_size value instead for such files.

* bc/racy-4gb-files:
  Prevent git from rehashing 4GiB files
  t: add a test helper to truncate files
2023-10-23 13:56:37 -07:00
Junio C Hamano 626f689f79 Merge branch 'jc/fail-stash-to-store-non-stash'
Feeding "git stash store" with a random commit that was not created
by "git stash create" now errors out.

* jc/fail-stash-to-store-non-stash:
  stash: be careful what we store
2023-10-23 13:56:37 -07:00
Junio C Hamano 755fb09163 Merge branch 'so/diff-merges-dd'
"git log" and friends learned "--dd" that is a short-hand for
"--diff-merges=first-parent -p".

* so/diff-merges-dd:
  completion: complete '--dd'
  diff-merges: introduce '--dd' option
  diff-merges: improve --diff-merges documentation
2023-10-23 13:56:37 -07:00
Junio C Hamano f32af12cee Merge branch 'jk/chunk-bounds'
The codepaths that read "chunk" formatted files have been corrected
to pay attention to the chunk size and notice broken files.

* jk/chunk-bounds: (21 commits)
  t5319: make corrupted large-offset test more robust
  chunk-format: drop pair_chunk_unsafe()
  commit-graph: detect out-of-order BIDX offsets
  commit-graph: check bounds when accessing BIDX chunk
  commit-graph: check bounds when accessing BDAT chunk
  commit-graph: bounds-check generation overflow chunk
  commit-graph: check size of generations chunk
  commit-graph: bounds-check base graphs chunk
  commit-graph: detect out-of-bounds extra-edges pointers
  commit-graph: check size of commit data chunk
  midx: check size of revindex chunk
  midx: bounds-check large offset chunk
  midx: check size of object offset chunk
  midx: enforce chunk alignment on reading
  midx: check size of pack names chunk
  commit-graph: check consistency of fanout table
  midx: check size of oid lookup chunk
  commit-graph: check size of oid fanout chunk
  midx: stop ignoring malformed oid fanout chunk
  t: add library for munging chunk-format files
  ...
2023-10-23 13:56:36 -07:00
Junio C Hamano 92741d83c0 Merge branch 'ak/pretty-decorate-more-fix'
Unlike "git log --pretty=%D", "git log --pretty="%(decorate)" did
not auto-initialize the decoration subsystem, which has been
corrected.

* ak/pretty-decorate-more-fix:
  pretty: fix ref filtering for %(decorate) formats
2023-10-20 16:23:11 -07:00
Junio C Hamano 6b1e2254d6 Merge branch 'vd/loose-ref-iteration-optimization'
The code to iterate over loose references have been optimized to
reduce the number of lstat() system calls.

* vd/loose-ref-iteration-optimization:
  files-backend.c: avoid stat in 'loose_fill_ref_dir'
  dir.[ch]: add 'follow_symlink' arg to 'get_dtype'
  dir.[ch]: expose 'get_dtype'
  ref-cache.c: fix prefix matching in ref iteration
2023-10-20 16:23:11 -07:00
Junio C Hamano c662038629 Merge branch 'ty/merge-tree-strategy-options'
"git merge-tree" learned to take strategy backend specific options
via the "-X" option, like "git merge" does.

* ty/merge-tree-strategy-options:
  merge: introduce {copy|clear}_merge_options()
  merge-tree: add -X strategy option
2023-10-20 16:23:11 -07:00
Junio C Hamano 7906b5c957 Merge branch 'jc/merge-ort-attr-index-fix'
Fix "git merge-tree" to stop segfaulting when the --attr-source
option is used.

* jc/merge-ort-attr-index-fix:
  merge-ort: initialize repo in index state
2023-10-18 13:25:42 -07:00
Junio C Hamano 79861babe2 Merge branch 'tb/repack-max-cruft-size'
"git repack" learned "--max-cruft-size" to prevent cruft packs from
growing without bounds.

* tb/repack-max-cruft-size:
  repack: free existing_cruft array after use
  builtin/repack.c: avoid making cruft packs preferred
  builtin/repack.c: implement support for `--max-cruft-size`
  builtin/repack.c: parse `--max-pack-size` with OPT_MAGNITUDE
  t7700: split cruft-related tests to t7704
2023-10-18 13:25:41 -07:00
Jeff King 7538f9d89b t5319: make corrupted large-offset test more robust
The test t5319.88 ("reader bounds-checks large offset table") can fail
intermittently. The failure mode looks like this:

  1. An earlier test sets up "objects64", a directory that can be used
     to produce a midx with a corrupted large-offsets table. To get the
     large offsets, it corrupts the normal ".idx" file to have a fake
     large offset, and then builds a midx from that.

     That midx now has a large offset table, which is what we want. But
     we also have a .idx on disk that has a corrupted entry. We'll call
     the object with the corrupted large-offset "X".

  2. In t5319.88, we further corrupt the midx by reducing the size of
     the large-offset chunk (because our goal is to make sure we do not
     do an out-of-bounds read on it).

  3. We then enumerate all of the objects with "cat-file --batch-check
     --batch-all-objects", expecting to see a complaint when we try to
     show object X. We use --batch-all-objects because our objects64
     repo doesn't actually have any refs (but if we check them all, one
     of them will be the failing one). The default batch-check format
     includes %(objecttype) and %(objectsize), both of which require us
     to access the actual pack data (and thus requires looking at the
     offset).

  4a. Usually, this succeeds. We try to output object X, do a lookup via
      the midx for the type/size lookup, and run into the corrupt
      large-offset table.

  4b. But sometimes we hit a different error. If another object points
      to X as a delta base, then trying to find the type of that object
      requires walking the delta chain to the base entry (since only the
      base has the concrete type; deltas themselves are either OFS_DELTA
      or REF_DELTA).

      Normally this would not require separate offset lookups at all, as
      deltas are usually stored as OFS_DELTA, specifying the relative
      offset to the base. But the corrupt idx created in step 1 is done
      directly with "git pack-objects" and does not pass the
      --delta-base-offset option, meaning we have REF_DELTA entries!
      Those do have to consult an index to find the location of the base
      object, and they use the pack .idx to do this. The same pack .idx
      that we know is corrupted from step 1!

      Git does notice the error, but it does so by seeing the corrupt
      .idx file, not the corrupt midx file, and the error it reports is
      different, causing the test to fail.

The set of objects created in the test is deterministic. But the delta
selection seems not to be (which is not too surprising, as it is
multi-threaded). I have seen the failure in Windows CI but haven't
reproduced it locally (not even with --stress). Re-running a failed
Windows CI job tends to work. But when I download and examine the trash
directory from a failed run, it shows a different set of deltas than I
get locally. But the exact source of non-determinism isn't that
important; our test should be robust against any order.

There are a few options to fix this:

  a. It would be OK for the "objects64" setup to "unbreak" the .idx file
     after generating the midx. But then it would be hard for subsequent
     tests to reuse it, since it is the corrupted idx that forces the
     midx to have a large offset table.

  b. The "objects64" setup could use --delta-base-offset. This would fix
     our problem, but earlier tests have many hard-coded offsets. Using
     OFS_DELTA would change the locations of objects in the pack (this
     might even be OK because I think most of the offsets are within the
     .idx file, but it seems brittle and I'm afraid to touch it).

  c. Our cat-file output is in oid order by default. Since we store
     bases before deltas, if we went in pack order (using the
     "--unordered" flag), we'd always see our corrupt X before any delta
     which depends on it. But using "--unordered" means we skip the midx
     entirely. That makes sense, since it is just enumerating all of
     the packs, using the offsets found in their .idx files directly.
     So it doesn't work for our test.

  d. We could ask directly about object X, rather than enumerating all
     of them. But that requires further hard-coding of the oid (both
     sha1 and sha256) of object X. I'd prefer not to introduce more
     brittleness.

  e. We can use a --batch-check format that looks at the pack data, but
     doesn't have to chase deltas. The problem in this case is
     %(objecttype), which has to walk to the base. But %(objectsize)
     does not; we can get the value directly from the delta itself.
     Another option would be %(deltabase), where we report the REF_DELTA
     name but don't look at its data.

I've gone with option (e) here. It's kind of subtle, but it's simple and
has no side effects.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-14 10:17:25 -07:00
Junio C Hamano 2920971a7f Merge branch 'jk/decoration-and-other-leak-fixes'
Leakfix.

* jk/decoration-and-other-leak-fixes:
  daemon: free listen_addr before returning
  revision: clear decoration structs during release_revisions()
  decorate: add clear_decoration() function
2023-10-13 14:18:28 -07:00
Junio C Hamano 09dcbb486d Merge branch 'ar/diff-index-merge-base-fix'
"git diff --merge-base X other args..." insisted that X must be a
commit and errored out when given an annotated tag that peels to a
commit, but we only need it to be a committish.  This has been
corrected.

* ar/diff-index-merge-base-fix:
  diff: fix --merge-base with annotated tags
2023-10-13 14:18:28 -07:00
Junio C Hamano b32f5b6b34 Merge branch 'js/submodule-fix-misuse-of-path-and-name'
In .gitmodules files, submodules are keyed by their names, and the
path to the submodule whose name is $name is specified by the
submodule.$name.path variable.  There were a few codepaths that
mixed the name and path up when consulting the submodule database,
which have been corrected.  It took long for these bugs to be found
as the name of a submodule initially is the same as its path, and
the problem does not surface until it is moved to a different path,
which apparently happens very rarely.

* js/submodule-fix-misuse-of-path-and-name:
  t7420: test that we correctly handle renamed submodules
  t7419: test that we correctly handle renamed submodules
  t7419, t7420: use test_cmp_config instead of grepping .gitmodules
  t7419: actually test the branch switching
  submodule--helper: return error from set-url when modifying failed
  submodule--helper: use submodule_from_path in set-{url,branch}
2023-10-13 14:18:28 -07:00
Junio C Hamano a45eddec40 Merge branch 'jk/commit-graph-leak-fixes'
Leakfix.

* jk/commit-graph-leak-fixes:
  commit-graph: clear oidset after finishing write
  commit-graph: free write-context base_graph_name during cleanup
  commit-graph: free write-context entries before overwriting
  commit-graph: free graph struct that was not added to chain
  commit-graph: delay base_graph assignment in add_graph_to_chain()
  commit-graph: free all elements of graph chain
  commit-graph: move slab-clearing to close_commit_graph()
  merge: free result of repo_get_merge_bases()
  commit-reach: free temporary list in get_octopus_merge_bases()
  t6700: mark test as leak-free
2023-10-13 14:18:28 -07:00
Junio C Hamano c75e91499b Merge branch 'la/trailer-test-and-doc-updates'
Test coverage for trailers has been improved.

* la/trailer-test-and-doc-updates:
  trailer doc: <token> is a <key> or <keyAlias>, not both
  trailer doc: separator within key suppresses default separator
  trailer doc: emphasize the effect of configuration variables
  trailer --unfold help: prefer "reformat" over "join"
  trailer --parse docs: add explanation for its usefulness
  trailer --only-input: prefer "configuration variables" over "rules"
  trailer --parse help: expose aliased options
  trailer --no-divider help: describe usual "---" meaning
  trailer: trailer location is a place, not an action
  trailer doc: narrow down scope of --where and related flags
  trailer: add tests to check defaulting behavior with --no-* flags
  trailer test description: this tests --where=after, not --where=before
  trailer tests: make test cases self-contained
2023-10-13 14:18:27 -07:00
Jason Hatton 5143ac07b1 Prevent git from rehashing 4GiB files
The index stores file sizes using a uint32_t. This causes any file
that is a multiple of 2^32 to have a cached file size of zero.
Zero is a special value used by racily clean. This causes git to
rehash every file that is a multiple of 2^32 every time git status
or git commit is run.

This patch mitigates the problem by making all files that are a
multiple of 2^32 appear to have a size of 1<<31 instead of zero.

The value of 1<<31 is chosen to keep it as far away from zero
as possible to help prevent things getting mixed up with unpatched
versions of git.

An example would be to have a 2^32 sized file in the index of
patched git. Patched git would save the file as 2^31 in the cache.
An unpatched git would very much see the file has changed in size
and force it to rehash the file, which is safe. The file would
have to grow or shrink by exactly 2^31 and retain all of its
ctime, mtime, and other attributes for old git to not notice
the change.

This patch does not change the behavior of any file that is not
an exact multiple of 2^32.

Signed-off-by: Jason D. Hatton <jhatton@globalfinishing.com>
Signed-off-by: brian m. carlson <bk2204@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-13 13:33:35 -07:00
brian m. carlson 678eb55f5d t: add a test helper to truncate files
In a future commit, we're going to work with some large files which will
be at least 4 GiB in size.  To take advantage of the sparseness
functionality on most Unix systems and avoid running the system out of
disk, it would be convenient to use truncate(2) to simply create a
sparse file of sufficient size.

However, the GNU truncate(1) utility isn't portable, so let's write a
tiny test helper that does the work for us.

Signed-off-by: brian m. carlson <bk2204@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-13 13:33:35 -07:00
Junio C Hamano d9b6634589 stash: be careful what we store
"git stash store" is meant to store what "git stash create"
produces, as these two are implementation details of the end-user
facing "git stash save" command.  Even though it is clearly
documented as such, users would try silly things like "git stash
store HEAD" to render their stash unusable.

Worse yet, because "git stash drop" does not allow such a stash
entry to be removed, "git stash clear" would be the only way to
recover from such a mishap.  Reuse the logic that allows "drop" to
refrain from working on such a stash entry to teach "store" to avoid
storing an object that is not a stash entry in the first place.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-11 16:27:30 -07:00
Junio C Hamano 1fdedb7c7d Merge branch 'cc/repack-sift-filtered-objects-to-separate-pack'
"git repack" machinery learns to pay attention to the "--filter="
option.

* cc/repack-sift-filtered-objects-to-separate-pack:
  gc: add `gc.repackFilterTo` config option
  repack: implement `--filter-to` for storing filtered out objects
  gc: add `gc.repackFilter` config option
  repack: add `--filter=<filter-spec>` option
  pack-bitmap-write: rebuild using new bitmap when remapping
  repack: refactor finding pack prefix
  repack: refactor finishing pack-objects command
  t/helper: add 'find-pack' test-tool
  pack-objects: allow `--filter` without `--stdout`
2023-10-10 11:39:15 -07:00
Junio C Hamano afb0d0880a Merge branch 'ds/init-diffstat-width'
Code clean-up.

* ds/init-diffstat-width:
  diff --stat: set the width defaults in a helper function
2023-10-10 11:39:14 -07:00
Junio C Hamano a7a2d10421 Merge branch 'cw/prelim-cleanup'
Shuffle some bits across headers and sources to prepare for
libification effort.

* cw/prelim-cleanup:
  parse: separate out parsing functions from config.h
  config: correct bad boolean env value error message
  wrapper: reduce scope of remove_or_warn()
  hex-ll: separate out non-hash-algo functions
2023-10-10 11:39:14 -07:00
Jeff King 12192a9db9 commit-graph: detect out-of-order BIDX offsets
The BIDX chunk tells us the offsets at which each commit's Bloom filters
can be found in the BDAT chunk. We compute the length of each filter by
checking the offsets of neighbors and subtracting them.

If the offsets are out of order, then we'll get a negative length, which
we then store as a very large unsigned value. This can cause us to read
out-of-bounds memory, as we access the hash data modulo "filter->len *
BITS_PER_WORD".

We can easily detect this case when loading the individual filters.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:02 -07:00
Jeff King 581e0f8b18 commit-graph: check bounds when accessing BIDX chunk
We load the bloom_filter_indexes chunk using pair_chunk(), so we have no
idea how big it is. This can lead to out-of-bounds reads if it is
smaller than expected, since we index it based on the number of commits
found elsewhere in the graph file.

We can check the chunk size up front, like we do for CDAT and other
chunks with one fixed-size record per commit.

The test case demonstrates the problem. It actually won't segfault,
because we end up reading random data from the follow-on chunk (BDAT in
this case), and the bounds checks added in the previous patch complain.
But this is by no means assured, and you can craft a commit-graph file
with BIDX at the end (or a smaller BDAT) that does segfault.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 920f400e91 commit-graph: check bounds when accessing BDAT chunk
When loading Bloom filters from a commit-graph file, we use the offset
values in the BIDX chunk to index into the memory mapped for the BDAT
chunk. But since we don't record how big the BDAT chunk is, we just
trust that the BIDX offsets won't cause us to read outside of the chunk
memory. A corrupted or malicious commit-graph file will cause us to
segfault (in practice this isn't a very interesting attack, since
commit-graph files are local-only, and the worst case is an
out-of-bounds read).

We can't fix this by checking the chunk size during parsing, since the
data in the BDAT chunk doesn't have a fixed size (that's why we need the
BIDX in the first place). So we'll fix it in two parts:

  1. Record the BDAT chunk size during parsing, and then later check
     that the BIDX offsets we look up are within bounds.

  2. Because the offsets are relative to the end of the BDAT header, we
     must also make sure that the BDAT chunk is at least as large as the
     expected header size. Otherwise, we overflow when trying to move
     past the header, even for an offset of "0". We can check this
     early, during the parsing stage.

The error messages are rather verbose, but since this is not something
you'd expect to see outside of severe bugs or corruption, it makes sense
to err on the side of too many details. Sadly we can't mention the
filename during the chunk-parsing stage, as we haven't set g->filename
at this point, nor passed it down through the stack.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King ee6a792412 commit-graph: bounds-check generation overflow chunk
If the generation entry in a commit-graph doesn't fit, we instead insert
an offset into a generation overflow chunk. But since we don't record
the size of the chunk, we may read outside the chunk if the offset we
find on disk is malicious or corrupted.

We can't check the size of the chunk up-front; it will vary based on how
many entries need overflow. So instead, we'll do a bounds-check before
accessing the chunk memory. Unfortunately there is no error-return from
this function, so we'll just have to die(), which is what it does for
other forms of corruption.

As with other cases, we can drop the st_mult() call, since we know our
bounds-checked value will fit within a size_t.

Before this patch, the test here actually "works" because we read
garbage data from the next chunk. And since that garbage data happens
not to provide a generation number which changes the output, it appears
to work. We could construct a case that actually segfaults or produces
wrong output, but it would be a bit tricky. For our purposes its
sufficient to check that we've detected the bounds error.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 4a3c34662b commit-graph: check size of generations chunk
We neither check nor record the size of the generations chunk we parse
from a commit-graph file. This should have one uint32_t for each commit
in the file; if it is smaller (due to corruption, etc), we may read
outside the mapped memory.

The included test segfaults without this patch, as it shrinks the size
considerably (and the chunk is near the end of the file, so we read off
the end of the array rather than accidentally reading another chunk).

We can fix this by checking the size up front (like we do for other
fixed-size chunks, like CDAT).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 6cf61d0db5 commit-graph: bounds-check base graphs chunk
When we are loading a commit-graph chain, we check that each slice of the
chain points to the appropriate set of base graphs via its BASE chunk.
But since we don't record the size of the chunk, we may access
out-of-bounds memory if the file is corrupted.

Since we know the number of entries we expect to find (based on the
position within the commit-graph-chain file), we can just check the size
up front.

In theory this would also let us drop the st_mult() call a few lines
later when we actually access the memory, since we know that the
computed offset will fit in a size_t. But because the operands
"g->hash_len" and "n" have types "unsigned char" and "int", we'd have to
cast to size_t first. Leaving the st_mult() does that cast, and makes it
more obvious that we don't have an overflow problem.

Note that the test does not actually segfault before this patch, since
it just reads garbage from the chunk after BASE (and indeed, it even
rejects the file because that garbage does not have the expected hash
value). You could construct a file with BASE at the end that did
segfault, but corrupting the existing one is easy, and we can check
stderr for the expected message.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 9622610e55 commit-graph: detect out-of-bounds extra-edges pointers
If an entry in a commit-graph file has more than 2 parents, the
fixed-size parent fields instead point to an offset within an "extra
edges" chunk. We blindly follow these, assuming that the chunk is
present and sufficiently large; this can lead to an out-of-bounds read
for a corrupt or malicious file.

We can fix this by recording the size of the chunk and adding a
bounds-check in fill_commit_in_graph(). There are a few tricky bits:

  1. We'll switch from working with a pointer to an offset. This makes
     some corner cases just fall out naturally:

      a. If we did not find an EDGE chunk at all, our size will
         correctly be zero (so everything is "out of bounds").

      b. Comparing "size / 4" lets us make sure we have at least 4 bytes
         to read, and we never compute a pointer more than one element
         past the end of the array (computing a larger pointer is
         probably OK in practice, but is technically undefined
         behavior).

      c. The current code casts to "uint32_t *". Replacing it with an
         offset avoids any comparison between different types of pointer
         (since the chunk is stored as "unsigned char *").

  2. This is the first case in which fill_commit_in_graph() may return
     anything but success. We need to make sure to roll back the
     "parsed" flag (and any parents we might have added before running
     out of buffer) so that the caller can cleanly fall back to
     loading the commit object itself.

     It's a little non-trivial to do this, and we might benefit from
     factoring it out. But we can wait on that until we actually see a
     second case where we return an error.

As a bonus, this lets us drop the st_mult() call. Since we've already
done a bounds check, we know there won't be any integer overflow (it
would imply our buffer is larger than a size_t can hold).

The included test does not actually segfault before this patch (though
you could construct a case where it does). Instead, it reads garbage
from the next chunk which results in it complaining about a bogus parent
id. This is sufficient for our needs, though (we care that the fallback
succeeds, and that stderr mentions the out-of-bounds read).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King b72df612af commit-graph: check size of commit data chunk
We expect a commit-graph file to have a fixed-size data record for each
commit in the file (and we know the number of commits to expct from the
size of the lookup table). If we encounter a file where this is too
small, we'll look past the end of the chunk (and possibly even off the
mapped memory).

We can fix this by checking the size up front when we record the
pointer.

The included test doesn't segfault, since it ends up reading bytes
from another chunk. But it produces nonsense results, since the values
it reads are garbage. Our test notices this by comparing the output to a
non-corrupted run of the same command (and of course we also check that
the expected error is printed to stderr).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King c0fe9b2da5 midx: check size of revindex chunk
When we load a revindex from disk, we check the size of the file
compared to the number of objects we expect it to have. But when we use
a RIDX chunk stored directly in the midx, we just access the memory
directly. This can lead to out-of-bounds memory access for a corrupted
or malicious multi-pack-index file.

We can catch this by recording the RIDX chunk size, and then checking it
against the expected size when we "load" the revindex. Note that this
check is much simpler than the one that load_revindex_from_disk() does,
because we just have the data array with no header (so we do not need
to account for the header size, and nor do we need to bother validating
the header values).

The test confirms both that we catch this case, and that we continue the
process (the revindex is required to use the midx bitmaps, but we
fallback to a non-bitmap traversal).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 2abd56e9b2 midx: bounds-check large offset chunk
When we see a large offset bit in the regular midx offset table, we
use the entry as an index into a separate large offset table (just like
a pack idx does). But we don't bounds-check the access to that large
offset table (nor even record its size when we parse the chunk!).

The equivalent code for a regular pack idx is in check_pack_index_ptr().
But things are a bit simpler here because of the chunked format: we can
just check our array index directly.

As a bonus, we can get rid of the st_mult() here. If our array
bounds-check is successful, then we know that the result will fit in a
size_t (and the bounds check uses a division to avoid overflow
entirely).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 0924869b4e midx: check size of object offset chunk
The object offset chunk has one fixed-size entry for each object in the
midx. But since we don't check its size, we may access out-of-bounds
memory if we see a corrupt or malicious midx file.

Sine the entries are fixed-size, the total length can be known up-front,
and we can just check it while parsing the chunk (this is similar to
what we do when opening pack idx files, which contain a similar offset
table).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King c9b9fefc13 midx: enforce chunk alignment on reading
The midx reader assumes chunks are aligned to a 4-byte boundary: we
treat the fanout chunk as an array of uint32_t, indexing it to feed the
results to ntohl(). Without aligning the chunks, we may violate the
CPU's alignment constraints. Though many platforms allow this, some do
not. And certanily UBSan will complain, since it is undefined behavior.

Even though most chunks are naturally 4-byte-aligned (because they are
storing uint32_t or larger types), PNAM is not. It stores NUL-terminated
pack names, so you can have a valid chunk with any length. The writing
side handles this by 4-byte-aligning the chunk, introducing a few extra
NULs as necessary. But since we don't check this on the reading side, we
may end up with a misaligned fanout and trigger the undefined behavior.

We have two options here:

  1. Swap out ntohl(fanout[i]) for get_be32(fanout+i) everywhere. The
     latter handles alignment itself. It's possible that it's slightly
     slower (though in practice I'm not sure how true that is,
     especially for these code paths which then go on to do a binary
     search).

  2. Enforce the alignment when reading the chunks. This is easy to do,
     since the table-of-contents reader can check it in one spot.

I went with the second option here, just because it places less burden
on maintenance going forward (it is OK to continue using ntohl), and we
know it can't have any performance impact on the actual reads.

The commit-graph code uses the same chunk API. It's usually also 4-byte
aligned, but some chunks are not (like Bloom filter BDAT chunks). So
we'll pass "1" here to allow any alignment. It doesn't suffer from the
same problem as midx with its fanout because the fanout chunk is always
the first (and the rest of the format dictates that the first chunk will
start aligned).

The new test shows the effect on a midx with a misaligned PNAM chunk.
Note that the midx-reading code treats chunk-toc errors as soft, falling
back to the non-midx path rather than calling die(), as we do for other
parsing errors. Arguably we should make all of these behave the same,
but that's out of scope for this patch. For now the test just expects
the fallback behavior.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 72a9a08283 midx: check size of pack names chunk
We parse the pack-name chunk as a series of NUL-terminated strings. But
since we don't look at the chunk size, there's nothing to guarantee that
we don't parse off the end of the chunk (or even off the end of the
mapped file).

We can record the length, and then as we parse make sure that we never
walk past it.

The new test exercises the case, though note that it does not actually
segfault before this patch. It hits a NUL byte somewhere in one of the
other chunks, and comes up with a garbage pack name. You could construct
one that reads out-of-bounds (e.g., a PNAM chunk at the end of file),
but this case is simple and sufficient to check that we detect the
problem.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:01 -07:00
Jeff King 4169d89645 commit-graph: check consistency of fanout table
We use bsearch_hash() to look up items in the oid index of a
commit-graph. It also has a fanout table to reduce the initial range in
which we'll search. But since the fanout comes from the on-disk file, a
corrupted or malicious file can cause us to look outside of the
allocated index memory.

One solution here would be to pass the total table size to
bsearch_hash(), which could then bounds check the values it reads from
the fanout. But there's an inexpensive up-front check we can do, and
it's the same one used by the midx and pack idx code (both of which
likewise have fanout tables and use bsearch_hash(), but are not affected
by this bug):

  1. We can check the value of the final fanout entry against the size
     of the table we got from the index chunk. These must always match,
     since the fanout is just slicing up the index.

       As a side note, the midx and pack idx code compute it the other
       way around: they use the final fanout value as the object count, and
       check the index size against it. Either is valid; if they
       disagree we cannot know which is wrong (a corrupted fanout value,
       or a too-small table of oids).

  2. We can quickly scan the fanout table to make sure it is
     monotonically increasing. If it is, then we know that every value
     is less than or equal to the final value, and therefore less than
     or equal to the table size.

     It would also be sufficient to just check that each fanout value is
     smaller than the final one, but the midx and pack idx code both do
     a full monotonicity check. It's the same cost, and it catches some
     other corruptions (though not all; the checks done by "commit-graph
     verify" are more complete but more expensive, and our goal here is
     to be fast and memory-safe).

There are two new tests. One just checks the final fanout value (this is
the mirror image of the "too small oid lookup" case added for the midx
in the previous commit; it's flipped here because commit-graph considers
the oid lookup chunk to be the source of truth).

The other actually creates a fanout with many out-of-bounds entries, and
prior to this patch, it does cause the segfault you'd expect. But note
that the error is not "your fanout entry is out-of-bounds", but rather
"fanout value out of order". That's because we leave the final fanout
value in place (to get past the table size check), making the index
non-monotonic (the second-to-last entry is big, but the last one must
remain small to match the actual table).

We need adjustments to a few existing tests, as well:

  - an earlier test in t5318 corrupts the fanout and runs "commit-graph
    verify". Its message is now changed, since we catch the problem
    earlier (during the load step, rather than the careful validation
    step).

  - in t5324, we test that "commit-graph verify --shallow" does not do
    expensive verification on the base file of the chain. But the
    corruption it uses (munging a byte at offset 1000) happens to be in
    the middle of the fanout table. And now we detect that problem in
    the cheaper checks that are performed for every part of the graph.
    We'll push this back to offset 1500, which is only caught by the
    more expensive checksum validation.

    Likewise, there's a later test in t5324 which munges an offset 100
    bytes into a file (also in the fanout table) that is referenced by
    an alternates file. So we now find that corruption during the load
    step, rather than the verification step. At the very least we need
    to change the error message (like the case above in t5318). But it
    is probably good to make sure we handle all parts of the
    verification even for alternate graph files. So let's likewise
    corrupt byte 1500 and make sure we found the invalid checksum.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:00 -07:00
Jeff King fc926567ed midx: check size of oid lookup chunk
When reading an on-disk multi-pack-index, we take the number of objects
in the midx from the final value of the fanout table. But we just
blindly assume that the chunk containing the actual oid entries is the
correct size. This can lead to us reading out-of-bounds memory if the
lookup chunk is too small (or if the fanout is corrupted; when they
don't agree we cannot tell which one is wrong).

Note that we bump the assignment of m->num_objects into the fanout
parser callback, so that it's set when we parse the lookup table
(otherwise we'd have to manually record the lookup table size and check
it later).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:00 -07:00
Jeff King 52e2e8d43d commit-graph: check size of oid fanout chunk
We load the oid fanout chunk with pair_chunk(), which means we never see
the size of the chunk. We just assume the on-disk file uses the
appropriate size, and if it's too small we'll access random memory.

It's easy to check this up-front; the fanout always consists of 256
uint32's, since it is a fanout of the first byte of the hash pointing
into the oid index. These parameters can't be changed without
introducing a new chunk type.

This matches the similar check in the midx OIDF chunk (but note that
rather than checking for the error immediately, the graph code just
leaves parts of the struct NULL and checks for required fields later).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:00 -07:00
Jeff King e3c9600397 midx: stop ignoring malformed oid fanout chunk
When we load the oid-fanout chunk, our callback checks that its size is
reasonable and returns an error if not. However, the caller only checks
our return value against CHUNK_NOT_FOUND, so we end up ignoring the
error completely! Using a too-small fanout table means we end up
accessing random memory for the fanout and segfault.

We can fix this by checking for any non-zero return value, rather than
just CHUNK_NOT_FOUND, and adjusting our error message to cover both
cases. We could handle each error code individually, but there's not
much point for such a rare case. The extra message produced in the
callback makes it clear what is going on.

The same pattern is used in the adjacent code. Those cases are actually
OK for now because they do not use a custom callback, so the only error
they can get is CHUNK_NOT_FOUND. But let's convert them, as this is an
accident waiting to happen (especially as we convert some of them away
from pair_chunk). The error messages are more verbose, but it should be
rare for a user to see these anyway.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:00 -07:00
Jeff King 86b008ee61 t: add library for munging chunk-format files
When testing corruption of files using the chunk format (like
commit-graphs and midx files), it's helpful to be able to modify bytes
in specific chunks. This requires being able both to read the
table-of-contents (to find the chunk to modify) but also to adjust it
(to account for size changes in the offsets of subsequent chunks).

We have some tests already which corrupt chunk files, but they have some
downsides:

  1. They are very brittle, as they manually compute the expected size
     of a particular instance of the file (e.g., see the definitions
     starting with NUM_OBJECTS in t5319).

  2. Because they rely on manual offsets and don't read the
     table-of-contents, they're limited to overwriting bytes. But there
     are many interesting corruptions that involve changing the sizes of
     chunks (especially smaller-than-expected ones).

This patch adds a perl script which makes such corruptions easy. We'll
use it in subsequent patches.

Note that we could get by with just a big "perl -e" inside the helper
function. I chose to put it in a separate script for two reasons. One,
so we don't have to worry about the extra layer of shell quoting. And
two, the script is kind of big, and running the tests with "-x" would
repeatedly dump it into the log output.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:55:00 -07:00
Victoria Dye 5305474ec4 ref-cache.c: fix prefix matching in ref iteration
Update 'cache_ref_iterator_advance' to skip over refs that are not matched
by the given prefix.

Currently, a ref entry is considered "matched" if the entry name is fully
contained within the prefix:

* prefix: "refs/heads/v1"
* entry: "refs/heads/v1.0"

OR if the prefix is fully contained in the entry name:

* prefix: "refs/heads/v1.0"
* entry: "refs/heads/v1"

The first case is always correct, but the second is only correct if the ref
cache entry is a directory, for example:

* prefix: "refs/heads/example"
* entry: "refs/heads/"

Modify the logic in 'cache_ref_iterator_advance' to reflect these
expectations:

1. If 'overlaps_prefix' returns 'PREFIX_EXCLUDES_DIR', then the prefix and
   ref cache entry do not overlap at all. Skip this entry.
2. If 'overlaps_prefix' returns 'PREFIX_WITHIN_DIR', then the prefix matches
   inside this entry if it is a directory. Skip if the entry is not a
   directory, otherwise iterate over it.
3. Otherwise, 'overlaps_prefix' returned 'PREFIX_CONTAINS_DIR', indicating
   that the cache entry (directory or not) is fully contained by or equal to
   the prefix. Iterate over this entry.

Note that condition 2 relies on the names of directory entries having the
appropriate trailing slash. The existing function documentation of
'create_dir_entry' explicitly calls out the trailing slash requirement, so
this is a safe assumption to make.

This bug generally doesn't have any user-facing impact, since it requires:

1. using a non-empty prefix without a trailing slash in an iteration like
   'for_each_fullref_in',
2. the callback to said iteration not reapplying the original filter (as
   for-each-ref does) to ensure unmatched refs are skipped, and
3. the repository having one or more refs that match part of, but not all
   of, the prefix.

However, there are some niche scenarios that meet those criteria
(specifically, 'rev-parse --bisect' and '(log|show|shortlog) --bisect'). Add
tests covering those cases to demonstrate the fix in this patch.

Signed-off-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 15:53:13 -07:00
John Cai e95bafc52f merge-ort: initialize repo in index state
initialize_attr_index() does not initialize the repo member of
attr_index. Starting in 44451a2e5e (attr: teach "--attr-source=<tree>"
global option to "git", 2023-05-06), this became a problem because
istate->repo gets passed down the call chain starting in
git_check_attr(). This gets passed all the way down to
replace_refs_enabled(), which segfaults when accessing r->gitdir.

Fix this by initializing the repository in the index state.

Signed-off-by: John Cai <johncai86@gmail.com>
Helped-by: Christian Couder <christian.couder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 14:42:02 -07:00
Sergey Organov c8e5cb0658 diff-merges: introduce '--dd' option
This option provides a shortcut to request diff with respect to first
parent for any kind of commit, universally. It's implemented as pure
synonym for "--diff-merges=first-parent --patch".

Gives user quick and universal way to see what changes, exactly, were
brought to a branch by merges as well as by regular commits.

Signed-off-by: Sergey Organov <sorganov@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 12:47:29 -07:00
Andy Koppe 2b09d16aba pretty: fix ref filtering for %(decorate) formats
Mark pretty formats containing "%(decorate" as requiring decoration in
userformat_find_requirements(), same as "%d" and "%D".

Without this, cmd_log_init_finish() didn't invoke load_ref_decorations()
with the decoration_filter it puts together, and hence filtering options
such as --decorate-refs were quietly ignored.

Amend one of the %(decorate) checks in t4205-log-pretty-formats.sh to
test this.

Signed-off-by: Andy Koppe <andy.koppe@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-09 11:25:13 -07:00
Jeff King badf2fe1c3 daemon: free listen_addr before returning
We build up a string list of listen addresses from the command-line
arguments, but never free it. This causes t5811 to complain of a leak
(though curiously it seems to do so only when compiled with gcc, not
with clang).

To handle this correctly, we have to do a little refactoring:

  - there are two exit points from the main function, depending on
    whether we are entering the main loop or serving a single client
    (since rather than a traditional fork model, we re-exec ourselves
    with the extra "--serve" argument to accommodate Windows).

    We don't need --listen at all in the --serve case, of course, but it
    is passed along by the parent daemon, which simply copies all of the
    command-line options it got.

  - we just "return serve()" to run the main loop, giving us no chance
    to do any cleanup

So let's use a "ret" variable to store the return code, and give
ourselves a single exit point at the end. That gives us one place to do
cleanup.

Note that this code also uses the "use a no-dup string-list, but
allocate strings we add to it" trick, meaning string_list_clear() will
not realize it should free them. We can fix this by switching to a "dup"
string-list, but using the "append_nodup" function to add to it (this is
preferable to tweaking the strdup_strings flag before clearing, as it
puts all the subtle memory-ownership code together).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-05 14:54:58 -07:00
Jeff King 8ef8da4842 revision: clear decoration structs during release_revisions()
The point of release_revisions() is to free memory associated with the
rev_info struct, but we have several "struct decoration" members that
are left untouched. Since the previous commit introduced a function to
do that, we can just call it.

We do have to provide some specialized callbacks to map the void
pointers onto real ones (the alternative would be casting the existing
function pointers; this generally works because "void *" is usually
interchangeable with a struct pointer, but it is technically forbidden
by the standard).

Since the line-log code does not expose the type it stores in the
decoration (nor of course the function to free it), I put this behind a
generic line_log_free() entry point. It's possible we may need to add
more line-log specific bits anyway (running t4211 shows a number of
other leaks in the line-log code).

While this doubtless cleans up many leaks triggered by the test suite,
the only script which becomes leak-free is t4217, as it does very little
beyond a simple traversal (its existing leak was from the use of
--children, which is now fixed).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-05 14:54:57 -07:00
Jeff King 771868243c decorate: add clear_decoration() function
There's not currently any way to free the resources associated with a
decoration struct. As a result, we have several memory leaks which
cannot easily be plugged.

Let's add a "clear" function and make use of it in the example code of
t9004. This removes the only leak from that script, so we can mark it as
passing the leak sanitizer.

Curiously this leak is found only when running SANITIZE=leak with clang,
but not with gcc.  But it is a bog-standard leak: we allocate some
memory in a local variable struct, and then exit main() without
releasing it. I'm not sure why gcc doesn't find it. After this
patch, both compilers report it as leak-free.

Note that the clear function takes a callback to free the individual
entries. That's not needed for our example (which is just decorating
with ints), but will be for real callers.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-05 14:54:55 -07:00
Taylor Blau 3c1e2c2113 builtin/repack.c: avoid making cruft packs preferred
When doing a `--geometric` repack, we make sure that the preferred pack
(if writing a MIDX) is the largest pack that we *didn't* repack. That
has the effect of keeping the preferred pack in sync with the pack
containing a majority of the repository's reachable objects.

But if the repository happens to double in size, we'll repack
everything. Here we don't specify any `--preferred-pack`, and instead
let the MIDX code choose.

In the past, that worked fine, since there would only be one pack to
choose from: the one we just wrote. But it's no longer necessarily the
case that there is one pack to choose from. It's possible that the
repository also has a cruft pack, too.

If the cruft pack happens to come earlier in lexical order (and has an
earlier mtime than any non-cruft pack), we'll pick that pack as
preferred. This makes it impossible to reuse chunks of the reachable
pack verbatim from pack-objects, so is sub-optimal.

Luckily, this is a somewhat rare circumstance to be in, since we would
have to repack the entire repository during a `--geometric` repack, and
the cruft pack would have to sort ahead of the pack we just created.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-05 13:26:11 -07:00