1
0
mirror of https://github.com/git/git synced 2024-07-07 19:39:27 +00:00
Commit Graph

201 Commits

Author SHA1 Message Date
Junio C Hamano
a50036da1a Merge branch 'tb/cruft-packs'
A mechanism to pack unreachable objects into a "cruft pack",
instead of ejecting them into loose form to be reclaimed later, has
been introduced.

* tb/cruft-packs:
  sha1-file.c: don't freshen cruft packs
  builtin/gc.c: conditionally avoid pruning objects via loose
  builtin/repack.c: add cruft packs to MIDX during geometric repack
  builtin/repack.c: use named flags for existing_packs
  builtin/repack.c: allow configuring cruft pack generation
  builtin/repack.c: support generating a cruft pack
  builtin/pack-objects.c: --cruft with expiration
  reachable: report precise timestamps from objects in cruft packs
  reachable: add options to add_unseen_recent_objects_to_traversal
  builtin/pack-objects.c: --cruft without expiration
  builtin/pack-objects.c: return from create_object_entry()
  t/helper: add 'pack-mtimes' test-tool
  pack-mtimes: support writing pack .mtimes files
  chunk-format.h: extract oid_version()
  pack-write: pass 'struct packing_data' to 'stage_tmp_packfiles'
  pack-mtimes: support reading .mtimes files
  Documentation/technical: add cruft-packs.txt
2022-06-03 14:30:37 -07:00
Taylor Blau
d9fef9d90d chunk-format.h: extract oid_version()
There are three definitions of an identical function which converts
`the_hash_algo` into either 1 (for SHA-1) or 2 (for SHA-256). There is a
copy of this function for writing both the commit-graph and
multi-pack-index file, and another inline definition used to write the
.rev header.

Consolidate these into a single definition in chunk-format.h. It's not
clear that this is the best header to define this function in, but it
should do for now.

(Worth noting, the .rev caller expects a 4-byte unsigned, but the other
two callers work with a single unsigned byte. The consolidated version
uses the latter type, and lets the compiler widen it when required).

Another caller will be added in a subsequent patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-26 15:48:26 -07:00
Junio C Hamano
8b28e2e2e4 Merge branch 'ds/midx-normalize-pathname-before-comparison'
The path taken by "git multi-pack-index" command from the end user
was compared with path internally prepared by the tool withut first
normalizing, which lead to duplicated paths not being noticed,
which has been corrected.

* ds/midx-normalize-pathname-before-comparison:
  cache: use const char * for get_object_directory()
  multi-pack-index: use --object-dir real path
  midx: use real paths in lookup_multi_pack_index()
2022-05-04 09:51:29 -07:00
Derrick Stolee
eafcc6de52 midx: use real paths in lookup_multi_pack_index()
This helper looks for a parsed multi-pack-index whose object directory
matches the given object_dir. Before going into the loop over the parsed
multi-pack-indexes, it calls find_odb() to ensure that the given
object_dir is actually a known object directory.

However, find_odb() uses real-path manipulations to compare the input to
the alternate directories. This same real-path comparison is not used in
the loop, leading to potential issues with the strcmp().

Update the method to use the real-path values instead.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-25 11:31:11 -07:00
Ævar Arnfjörð Bjarmason
2108fe4a19 revisions API users: add straightforward release_revisions()
Add a release_revisions() to various users of "struct rev_list" in
those straightforward cases where we only need to add the
release_revisions() call to the end of a block, and don't need to
e.g. refactor anything to use a "goto cleanup" pattern.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-13 23:56:08 -07:00
Neeraj Singh
020406eaa5 core.fsync: introduce granular fsync control infrastructure
This commit introduces the infrastructure for the core.fsync
configuration knob. The repository components we want to sync
are identified by flags so that we can turn on or off syncing
for specific components.

If core.fsyncObjectFiles is set and the core.fsync configuration
also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any
loose objects. This picks the strictest data integrity behavior
if core.fsync and core.fsyncObjectFiles are set to conflicting values.

This change introduces the currently unused fsync_component
helper, which will be used by a later patch that adds fsyncing to
the refs backend.

Actual configuration and documentation of the fsync components
list are in other patches in the series to separate review of
the underlying mechanism from the policy of how it's configured.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10 15:10:22 -08:00
Junio C Hamano
9b7e531f94 Merge branch 'tb/midx-no-bitmap-for-no-objects'
When there is no object to write .bitmap file for, "git
multi-pack-index" triggered an error, instead of just skipping,
which has been corrected.

* tb/midx-no-bitmap-for-no-objects:
  midx: prevent writing a .bitmap without any objects
2022-02-18 13:53:30 -08:00
Taylor Blau
eb57277ba3 midx: prevent writing a .bitmap without any objects
When trying to write a MIDX, we already prevent the case where there
weren't any packs present, and thus we would have written an empty MIDX.

But there is another "empty" case, which is more interesting, and we
don't yet handle. If we try to write a MIDX which has at least one pack,
but those packs together don't contain any objects, we will encounter a
BUG() when trying to use the bitmap corresponding to that MIDX, like so:

    $ git rev-parse HEAD | git pack-objects --revs --use-bitmap-index --stdout >/dev/null
    BUG: pack-revindex.c:394: pack_pos_to_midx: out-of-bounds object at 0

(note that in the above reproduction, both `--use-bitmap-index` and
`--stdout` are important, since without the former we won't even both to
load the .bitmap, and without the latter we wont attempt pack reuse).

The problem occurs when we try to discover the identity of the
preferred pack to determine which range if any of existing packs we can
reuse verbatim. This path is: `reuse_packfile_objects()` ->
`reuse_partial_packfile_from_bitmap()` -> `midx_preferred_pack()`.

    #4  0x000055555575401f in pack_pos_to_midx (m=0x555555997160, pos=0) at pack-revindex.c:394
    #5  0x00005555557502c8 in midx_preferred_pack (bitmap_git=0x55555599c280) at pack-bitmap.c:1431
    #6  0x000055555575036c in reuse_partial_packfile_from_bitmap (bitmap_git=0x55555599c280, packfile_out=0x5555559666b0 <reuse_packfile>,
        entries=0x5555559666b8 <reuse_packfile_objects>, reuse_out=0x5555559666c0 <reuse_packfile_bitmap>) at pack-bitmap.c:1452
    #7  0x00005555556041f6 in get_object_list_from_bitmap (revs=0x7fffffffcbf0) at builtin/pack-objects.c:3658
    #8  0x000055555560465c in get_object_list (ac=2, av=0x555555997050) at builtin/pack-objects.c:3765
    #9  0x0000555555605e4e in cmd_pack_objects (argc=0, argv=0x7fffffffe920, prefix=0x0) at builtin/pack-objects.c:4154

Since neither the .bitmap or MIDX stores the identity of the
preferred pack, we infer it by trying to load the first object in
pseudo-pack order, and then asking the MIDX which pack was chosen to
represent that object.

But this fails our bounds check, since there are zero objects in the
MIDX to begin with, which results in the BUG().

We could catch this more carefully in `midx_preferred_pack()`, but
signaling the absence of a preferred pack out to all of its callers is
somewhat awkward.

Instead, let's avoid writing a MIDX .bitmap without any objects
altogether. We catch this case in `write_midx_internal()`, and emit a
warning if the caller indicated they wanted to write a bitmap before
clearing out the relevant flags. If we somehow got to
write_midx_bitmap(), then we will call BUG(), but this should now be an
unreachable path.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-09 13:08:06 -08:00
Taylor Blau
7f514b7a5e midx: read RIDX chunk when present
When a MIDX contains the new `RIDX` chunk, ensure that the reverse index
is read from it instead of the on-disk .rev file. Since we need to
encode the object order in the MIDX itself for correctness reasons,
there is no point in storing the same data again outside of the MIDX.

So, this patch stops writing separate .rev files, and reads it out of
the MIDX itself. This is possible to do with relatively little new code,
since the format of the RIDX chunk is identical to the data in the .rev
file. In other words, we can implement this by pointing the
`revindex_data` field at the reverse index chunk of the MIDX instead of
the .rev file without any other changes.

Note that we have two knobs that are adjusted for the new tests:
GIT_TEST_MIDX_WRITE_REV and GIT_TEST_MIDX_READ_RIDX. The former controls
whether the MIDX .rev is written at all, and the latter controls whether
we read the MIDX's RIDX chunk.

Both are necessary to ensure that the test added at the beginning of
this series continues to work. This is because we always need to write
the RIDX chunk in the MIDX in order to change its checksum, but we want
to make sure reading the existing .rev file still works (since the RIDX
chunk takes precedence by default).

Arguably this isn't a very interesting mode to test, because the
precedence rules mean that we'll always read the RIDX chunk over the
.rev file. But it makes it impossible for a user to induce corruption in
their repository by adjusting the test knobs (since if we had an
either/or knob they could stop writing the RIDX chunk, allowing them to
tweak the MIDX's object order without changing its checksum).

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27 12:07:53 -08:00
Taylor Blau
95e8383bac midx.c: make changing the preferred pack safe
The previous patch demonstrates a bug where a MIDX's auxiliary object
order can become out of sync with a MIDX bitmap.

This is because of two confounding factors:

  - First, the object order is stored in a file which is named according
    to the multi-pack index's checksum, and the MIDX does not store the
    object order. This means that the object order can change without
    altering the checksum.

  - But the .rev file is moved into place with finalize_object_file(),
    which link(2)'s the file into place instead of renaming it. For us,
    that means that a modified .rev file will not be moved into place if
    MIDX's checksum was unchanged.

This fix is to force the MIDX's checksum to change when the preferred
pack changes but the set of packs contained in the MIDX does not. In
other words, when the object order changes, the MIDX's checksum needs to
change with it (regardless of whether the MIDX is tracking the same or
different packs).

This prevents a race whereby changing the object order (but not the
packs themselves) enables a reader to see the new .rev file with the old
MIDX, or similarly seeing the new bitmap with the old object order.

But why can't we just stop hardlinking the .rev into place instead
adding additional data to the MIDX? Suppose that's what we did. Then
when we go to generate the new bitmap, we'll load the old MIDX bitmap,
along with the MIDX that it references. That's fine, since the new MIDX
isn't moved into place until after the new bitmap is generated. But the
new object order *has* been moved into place. So we'll read the old
bitmaps in the new order when generating the new bitmap file, meaning
that without this secondary change, bitmap generation itself would
become a victim of the race described here.

This can all be prevented by forcing the MIDX's checksum to change when
the object order does. By embedding the entire object order into the
MIDX, we do just that. That is, the MIDX's checksum will change in
response to any perturbation of the underlying object order. In t5326,
this will cause the MIDX's checksum to update (even without changing the
set of packs in the MIDX), preventing the stale read problem.

Note that this makes it safe to continue to link(2) the MIDX .rev file
into place, since it is now impossible to have a .rev file that is
out-of-sync with the MIDX whose checksum it references. (But we will do
away with MIDX .rev files later in this series anyway, so this is
somewhat of a moot point).

In theory, it is possible to store a "fingerprint" of the full object
order here, so long as that fingerprint changes at least as often as the
full object order does. Some possibilities here include storing the
identity of the preferred pack, along with the mtimes of the
non-preferred packs in a consistent order. But storing a limited part of
the information makes it difficult to reason about whether or not there
are gaps between the two that would cause us to get bitten by this bug
again.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-27 12:07:52 -08:00
Junio C Hamano
49767c3d9f Merge branch 'tb/plug-pack-bitmap-leaks'
Leakfix.

* tb/plug-pack-bitmap-leaks:
  pack-bitmap.c: more aggressively free in free_bitmap_index()
  pack-bitmap.c: don't leak type-level bitmaps
  midx.c: write MIDX filenames to strbuf
  builtin/multi-pack-index.c: don't leak concatenated options
  builtin/repack.c: avoid leaking child arguments
  builtin/pack-objects.c: don't leak memory via arguments
  t/helper/test-read-midx.c: free MIDX within read_midx_file()
  midx.c: don't leak MIDX from verify_midx_file
  midx.c: clean up chunkfile after reading the MIDX
2021-11-29 15:41:49 -08:00
Taylor Blau
60980aed78 midx.c: write MIDX filenames to strbuf
To ask for the name of a MIDX and its corresponding .rev file, callers
invoke get_midx_filename() and get_midx_rev_filename(), respectively.
These both invoke xstrfmt(), allocating a chunk of memory which must be
freed later on.

This makes callers in pack-bitmap.c somewhat awkward. Specifically,
midx_bitmap_filename(), which is implemented like:

    return xstrfmt("%s-%s.bitmap",
                   get_midx_filename(midx->object_dir),
                   hash_to_hex(get_midx_checksum(midx)));

this leaks the second argument to xstrfmt(), which itself was allocated
with xstrfmt(). This caller could assign both the result of
get_midx_filename() and the outer xstrfmt() to a temporary variable,
remembering to free() the former before returning. But that involves a
wasteful copy.

Instead, get_midx_filename() and get_midx_rev_filename() take a strbuf
as an output parameter. This way midx_bitmap_filename() can manipulate
and pass around a temporary buffer which it detaches back to its caller.

That allows us to implement the function without copying or open-coding
get_midx_filename() in a way that doesn't leak.

Update the other callers of get_midx_filename() and
get_midx_rev_filename() accordingly.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-28 15:32:14 -07:00
Taylor Blau
492cb394fb midx.c: don't leak MIDX from verify_midx_file
The function midx.c:verify_midx_file() allocates a MIDX struct by
calling load_multi_pack_index(). But when cleaning up, it calls free()
without freeing any resources associated with the MIDX.

Call the more appropriate close_midx() which does free those resources,
which causes t5319.3 to pass when Git is compiled with SANITIZE=leak.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-27 16:26:37 -07:00
Taylor Blau
692305ec67 midx.c: clean up chunkfile after reading the MIDX
In order to read the contents of a MIDX, we initialize a chunkfile
structure which can read the table of contents and assign pointers into
different sections of the file for us.

We do call free(), since the chunkfile struct is heap allocated, but not
the more appropriate free_chunkfile(), which also frees memory that the
structure itself owns.

Call that instead to avoid leaking memory in this function.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-21 09:07:55 -07:00
Junio C Hamano
0b69bb0fb1 Merge branch 'tb/repack-write-midx'
"git repack" has been taught to generate multi-pack reachability
bitmaps.

* tb/repack-write-midx:
  test-read-midx: fix leak of bitmap_index struct
  builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  builtin/repack.c: make largest pack preferred
  builtin/repack.c: support writing a MIDX while repacking
  builtin/repack.c: extract showing progress to a variable
  builtin/repack.c: rename variables that deal with non-kept packs
  builtin/repack.c: keep track of existing packs unconditionally
  midx: preliminary support for `--refs-snapshot`
  builtin/multi-pack-index.c: support `--stdin-packs` mode
  midx: expose `write_midx_file_only()` publicly
2021-10-18 15:47:57 -07:00
Taylor Blau
ae22e8415d midx.c: guard against commit_lock_file() failures
When writing a MIDX, we atomically move the new MIDX into place via
commit_lock_file(), but do not check to see if that call was successful.

Make sure that we do check in order to prevent us from incorrectly
reporting that we wrote a new MIDX if we actually encountered an error.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15 13:08:11 -07:00
Taylor Blau
c0f1f9dec4 midx.c: lookup MIDX by object directory during repack
Apply similar treatment as in the last commit to the MIDX `repack`
operation.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15 13:08:11 -07:00
Taylor Blau
98926e0d01 midx.c: lookup MIDX by object directory during expire
Before a new MIDX can be written, expire_midx_packs() first loads the
existing MIDX, figures out which packs can be expired, and then writes a
new MIDX based on that information.

In order to load the existing MIDX, it uses load_multi_pack_index(),
which mmaps the multi-pack-index file, but does not store the resulting
`struct multi_pack_index *` in the object store.

write_midx_internal() also needs to open the existing MIDX, and it does
so by iterating the results of get_multi_pack_index(), so that it reuses
the same pointer held by the object store. But before it can move the
new MIDX into place, it close_object_store() to munmap() the
multi-pack-index file to accommodate platforms like Windows which don't
allow overwriting files which are memory mapped.

That's where things get weird. Since expire_midx_packs has its own
*separate* memory mapped copy of the MIDX, the MIDX file is still memory
mapped! Interestingly, this doesn't seem to cause a problem in our
tests. (I believe that this has much more to do with my own lack of
familiarity with Windows than it does a lack of coverage in our tests).

In any case, we can side-step the whole issue by teaching
expire_midx_packs() to use the `struct multi_pack_index` pointer it
found via the object store instead of maintain its own copy. That way,
when write_midx_internal() calls `close_object_store()`, we know that
there are no memory mapped copies of the MIDX laying around.

A couple of other small notes about this patch:

  - As far as I can tell, passing `local == 1` to the call to
    load_multi_pack_index() was an error, since object_dir could be an
    alternate. But it doesn't matter, since even though we write
    `m->local = 1`, we never read that field back later on.

  - Setting `m = NULL` after write_midx_internal() was likely to prevent
    a double-free back from when that function took a `struct
    multi_pack_index *` that it called close_midx() on itself. We can
    rely on write_midx_internal() to call that for us now.

Finally, this enforces the same "the value of --object-dir must be the
local object store, or an alternate" rule from f57a739691 (midx: avoid
opening multiple MIDXs when writing, 2021-09-01) to the `expire`
sub-command, too.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15 13:08:11 -07:00
Taylor Blau
504131ac26 midx.c: extract MIDX lookup by object_dir
The first thing that write_midx_internal() does is load the MIDX
corresponding to the given object directory, if one is present.

Prepare for other functions in midx.c to do the same thing by extracting
that operation out to a small helper function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-15 13:08:11 -07:00
Junio C Hamano
823b4281ca Merge branch 'tb/repack-write-midx' into tb/fix-midx-rename-while-mapped
* tb/repack-write-midx:
  test-read-midx: fix leak of bitmap_index struct
  builtin/repack.c: pass `--refs-snapshot` when writing bitmaps
  builtin/repack.c: make largest pack preferred
  builtin/repack.c: support writing a MIDX while repacking
  builtin/repack.c: extract showing progress to a variable
  builtin/repack.c: rename variables that deal with non-kept packs
  builtin/repack.c: keep track of existing packs unconditionally
  midx: preliminary support for `--refs-snapshot`
  builtin/multi-pack-index.c: support `--stdin-packs` mode
  midx: expose `write_midx_file_only()` publicly
2021-10-15 13:07:37 -07:00
Junio C Hamano
9567a670d2 Merge branch 'tb/midx-write-propagate-namehash'
"git multi-pack-index write --bitmap" learns to propagate the
hashcache from original bitmap to resulting bitmap.

* tb/midx-write-propagate-namehash:
  t5326: test propagating hashcache values
  p5326: generate pack bitmaps before writing the MIDX bitmap
  p5326: don't set core.multiPackIndex unnecessarily
  p5326: create missing 'perf-tag' tag
  midx.c: respect 'pack.writeBitmapHashcache' when writing bitmaps
  pack-bitmap.c: propagate namehash values from existing bitmaps
  t/helper/test-bitmap.c: add 'dump-hashes' mode
2021-10-11 10:21:46 -07:00
Taylor Blau
08944d1c22 midx: preliminary support for --refs-snapshot
To figure out which commits we can write a bitmap for, the multi-pack
index/bitmap code does a reachability traversal, marking any commit
which can be found in the MIDX as eligible to receive a bitmap.

This approach will cause a problem when multi-pack bitmaps are able to
be generated from `git repack`, since the reference tips can change
during the repack. Even though we ignore commits that don't exist in
the MIDX (when doing a scan of the ref tips), it's possible that a
commit in the MIDX reaches something that isn't.

This can happen when a multi-pack index contains some pack which refers
to loose objects (e.g., if a pack was pushed after starting the repack
but before generating the MIDX which depends on an object which is
stored as loose in the repository, and by definition isn't included in
the multi-pack index).

By taking a snapshot of the references before we start repacking, we can
close that race window. In the above scenario (where we have a packed
object pointing at a loose one), we'll either (a) take a snapshot of the
references before seeing the packed one, or (b) take it after, at which
point we can guarantee that the loose object will be packed and included
in the MIDX.

This patch does just that. It writes a temporary "reference snapshot",
which is a list of OIDs that are at the ref tips before writing a
multi-pack bitmap. References that are "preferred" (i.e,. are a suffix
of at least one value of the 'pack.preferBitmapTips' configuration) are
marked with a special '+'.

The format is simple: one line per commit at each tip, with an optional
'+' at the beginning (for preferred references, as described above).

When provided, the reference snapshot is used to drive bitmap selection
instead of the MIDX code doing its own traversal. When it isn't
provided, the usual traversal takes place instead.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-28 21:20:56 -07:00
Taylor Blau
56d863e979 midx: expose write_midx_file_only() publicly
Expose a variant of the write_midx_file() function which ignores packs
that aren't included in an explicit "allow" list.

This will be used in an upcoming patch to power a new `--stdin-packs`
mode of `git multi-pack-index write` for callers that only want to
include certain packs in a MIDX (and ignore any packs which may have
happened to enter the repository independently, e.g., from pushes).

Those patches will provide test coverage for this new function.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-28 21:20:55 -07:00
Junio C Hamano
28caad63d0 Merge branch 'rs/packfile-bad-object-list-in-oidset'
Replace a handcrafted data structure used to keep track of bad
objects in the packfile API by an oidset.

* rs/packfile-bad-object-list-in-oidset:
  packfile: use oidset for bad objects
  packfile: convert has_packed_and_bad() to object_id
  packfile: convert mark_bad_packed_object() to object_id
  midx: inline nth_midxed_pack_entry()
  oidset: make oidset_size() an inline function
2021-09-23 13:44:46 -07:00
Taylor Blau
caca3c9f07 midx.c: respect 'pack.writeBitmapHashcache' when writing bitmaps
In the previous commit, the bitmap writing code learned to propagate
values from an existing hash-cache extension into the bitmap that it is
writing.

Now that that functionality exists, let's expose it by teaching the 'git
multi-pack-index' builtin to respect the `pack.writeBitmapHashCache`
option so that the hash-cache may be written at all.

Two minor points worth noting here:

  - The 'git multi-pack-index write' sub-command didn't previously read
    any configuration (instead this is handled in the base command). A
    separate handler is added here to respect this write-specific
    config option.

  - I briefly considered adding a 'bitmap_flags' field to the static
    options struct, but decided against it since it would require
    plumbing through a new parameter to the write_midx_file() function.

    Instead, a new MIDX-specific flag is added, which is translated to
    the corresponding bitmap one.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-14 16:34:18 -07:00
René Scharfe
09ef66179b packfile: use oidset for bad objects
Store the object ID of broken pack entries in an oidset instead of
keeping only their hashes in an unsorted array.  The resulting code is
shorter and easier to read.  It also handles the (hopefully) very rare
case of having a high number of bad objects better.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-12 16:14:32 -07:00
René Scharfe
893b563505 midx: inline nth_midxed_pack_entry()
fill_midx_entry() finds the position of an object ID and passes it to
nth_midxed_pack_entry(), which uses the position to look up the object
ID for its own purposes.  Inline the latter into the former to avoid
that lookup.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-12 16:14:32 -07:00
Jeff King
bfbb60d328 pack-bitmap: drop repository argument from prepare_midx_bitmap_git()
We never look at the repository argument which is passed. This makes
sense, since the multi_pack_index struct already tells us everything we
need to access the files in its associated object directory.

Signed-off-by: Jeff King <peff@peff.net>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-09 17:32:37 -07:00
Taylor Blau
c528e17966 pack-bitmap: write multi-pack bitmaps
Write multi-pack bitmaps in the format described by
Documentation/technical/bitmap-format.txt, inferring their presence with
the absence of '--bitmap'.

To write a multi-pack bitmap, this patch attempts to reuse as much of
the existing machinery from pack-objects as possible. Specifically, the
MIDX code prepares a packing_data struct that pretends as if a single
packfile has been generated containing all of the objects contained
within the MIDX.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 13:56:43 -07:00
Taylor Blau
0f533c7284 pack-bitmap: read multi-pack bitmaps
This prepares the code in pack-bitmap to interpret the new multi-pack
bitmaps described in Documentation/technical/bitmap-format.txt, which
mostly involves converting bit positions to accommodate looking them up
in a MIDX.

Note that there are currently no writers who write multi-pack bitmaps,
and that this will be implemented in the subsequent commit. Note also
that get_midx_checksum() and get_midx_filename() are made non-static so
they can be called from pack-bitmap.c.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 13:56:43 -07:00
Taylor Blau
f57a739691 midx: avoid opening multiple MIDXs when writing
Opening multiple instance of the same MIDX can lead to problems like two
separate packed_git structures which represent the same pack being added
to the repository's object store.

The above scenario can happen because prepare_midx_pack() checks if
`m->packs[pack_int_id]` is NULL in order to determine if a pack has been
opened and installed in the repository before. But a caller can
construct two copies of the same MIDX by calling get_multi_pack_index()
and load_multi_pack_index() since the former manipulates the
object store directly but the latter is a lower-level routine which
allocates a new MIDX for each call.

So if prepare_midx_pack() is called on multiple MIDXs with the same
pack_int_id, then that pack will be installed twice in the object
store's packed_git pointer.

This can lead to problems in, for e.g., the pack-bitmap code, which does
something like the following (in pack-bitmap.c:open_pack_bitmap()):

    struct bitmap_index *bitmap_git = ...;
    for (p = get_all_packs(r); p; p = p->next) {
      if (open_pack_bitmap_1(bitmap_git, p) == 0)
        ret = 0;
    }

which is a problem if two copies of the same pack exist in the
packed_git list because pack-bitmap.c:open_pack_bitmap_1() contains a
conditional like the following:

    if (bitmap_git->pack || bitmap_git->midx) {
      /* ignore extra bitmap file; we can only handle one */
      warning("ignoring extra bitmap file: %s", packfile->pack_name);
      close(fd);
      return -1;
    }

Avoid this scenario by not letting write_midx_internal() open a MIDX
that isn't also pointed at by the object store. So long as this is the
case, other routines should prefer to open MIDXs with
get_multi_pack_index() or reprepare_packed_git() instead of creating
instances on their own. Because get_multi_pack_index() returns
`r->object_store->multi_pack_index` if it is non-NULL, we'll only have
one instance of a MIDX open at one time, avoiding these problems.

To encourage this, drop the `struct multi_pack_index *` parameter from
`write_midx_internal()`, and rely instead on the `object_dir` to find
(or initialize) the correct MIDX instance.

Likewise, replace the call to `close_midx()` with
`close_object_store()`, since we're about to replace the MIDX with a new
one and should invalidate the object store's memory of any MIDX that
might have existed beforehand.

Note that this now forbids passing object directories that don't belong
to alternate repositories over `--object-dir`, since before we would
have happily opened a MIDX in any directory, but now restrict ourselves
to only those reachable by `r->objects->multi_pack_index` (and alternate
MIDXs that we can see by walking the `next` pointer).

As far as I can tell, supporting arbitrary directories with
`--object-dir` was a historical accident, since even the documentation
says `<alt>` when referring to the value passed to this option.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 13:56:43 -07:00
Taylor Blau
9bb6c2e54f midx: close linked MIDXs, avoid leaking memory
When a repository has at least one alternate, the MIDX belonging to each
alternate is accessed through the `next` pointer on the main object
store's copy of the MIDX. close_midx() didn't bother to close any
of the linked MIDXs. It likewise didn't free the memory pointed to by
`m`, leaving uninitialized bytes with live pointers to them left around
in the heap.

Clean this up by closing linked MIDXs, and freeing up the memory pointed
to by each of them. When callers call close_midx(), then they can
discard the entire linked list of MIDXs and set their pointer to the
head of that list to NULL.

This isn't strictly required for the upcoming patches, but it makes it
much more difficult (though still possible, for e.g., by calling
`close_midx(m->next)` which leaves `m->next` pointing at uninitialized
bytes) to have pointers to uninitialized memory.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 10:58:43 -07:00
Taylor Blau
177c0d6e63 midx: infer preferred pack when not given one
In 9218c6a40c (midx: allow marking a pack as preferred, 2021-03-30), the
multi-pack index code learned how to select a pack which all duplicate
objects are selected from. That is, if an object appears in multiple
packs, select the copy in the preferred pack before breaking ties
according to the other rules like pack mtime and readdir() order.

Not specifying a preferred pack can cause serious problems with
multi-pack reachability bitmaps, because these bitmaps rely on having at
least one pack from which all duplicates are selected. Not having such a
pack causes problems with the code in pack-objects to reuse packs
verbatim (e.g., that code assumes that a delta object in a chunk of pack
sent verbatim will have its base object sent from the same pack).

So why does not marking a pack preferred cause problems here? The reason
is roughly as follows:

  - Ties are broken (when handling duplicate objects) by sorting
    according to midx_oid_compare(), which sorts objects by OID,
    preferred-ness, pack mtime, and finally pack ID (more on that
    later).

  - The psuedo pack-order (described in
    Documentation/technical/pack-format.txt under the section
    "multi-pack-index reverse indexes") is computed by
    midx_pack_order(), and sorts by pack ID and pack offset, with
    preferred packs sorting first.

  - But! Pack IDs come from incrementing the pack count in
    add_pack_to_midx(), which is a callback to
    for_each_file_in_pack_dir(), meaning that pack IDs are assigned in
    readdir() order.

When specifying a preferred pack, all of that works fine, because
duplicate objects are correctly resolved in favor of the copy in the
preferred pack, and the preferred pack sorts first in the object order.

"Sorting first" is critical, because the bitmap code relies on finding
out which pack holds the first object in the MIDX's pseudo pack-order to
determine which pack is preferred.

But if we didn't specify a preferred pack, and the pack which comes
first in readdir() order does not also have the lowest timestamp, then
it's possible that that pack (the one that sorts first in pseudo-pack
order, which the bitmap code will treat as the preferred one) did *not*
have all duplicate objects resolved in its favor, resulting in breakage.

The fix is simple: pick a (semi-arbitrary, non-empty) preferred pack
when none was specified. This forces that pack to have duplicates
resolved in its favor, and (critically) to sort first in pseudo-pack
order.  Unfortunately, testing this behavior portably isn't possible,
since it depends on readdir() order which isn't guaranteed by POSIX.

(Note that multi-pack reachability bitmaps have yet to be implemented;
so in that sense this patch is fixing a bug which does not yet exist.
But by having this patch beforehand, we can prevent the bug from ever
materializing.)

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 10:58:43 -07:00
Taylor Blau
5d3cd09a80 midx: reject empty --preferred-pack's
The soon-to-be-implemented multi-pack bitmap treats object in the first
bit position specially by assuming that all objects in the pack it was
selected from are also represented from that pack in the MIDX. In other
words, the pack from which the first object was selected must also have
all of its other objects selected from that same pack in the MIDX in
case of any duplicates.

But this assumption relies on the fact that there is at least one object
in that pack to begin with; otherwise the object in the first bit
position isn't from a preferred pack, in which case we can no longer
assume that all objects in that pack were also selected from the same
pack.

Guard this assumption by checking the number of objects in the given
preferred pack, and failing if the given pack is empty.

To make sure we can safely perform this check, open any packs which are
contained in an existing MIDX via prepare_midx_pack(). The same is done
for new packs via the add_pack_to_midx() callback, but packs picked up
from a previous MIDX will not yet have these opened.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 10:58:43 -07:00
Taylor Blau
f5909d34ca midx: clear auxiliary .rev after replacing the MIDX
When writing a new multi-pack index, write_midx_internal() attempts to
clean up any auxiliary files (currently just the MIDX's `.rev` file, but
soon to include a `.bitmap`, too) corresponding to the MIDX it's
replacing.

This step should happen after the new MIDX is written into place, since
doing so beforehand means that the old MIDX could be read without its
corresponding .rev file.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 10:58:43 -07:00
Taylor Blau
426c00e454 midx: fix *.rev cleanups with --object-dir
If using --object-dir to point into an object directory which belongs to
a different repository than the one in the current working directory,
such as:

  git init repo
  git -C repo ... # add some objects
  cd alternate
  git multi-pack-index --object-dir ../repo/.git/objects write

the binary will segfault trying to access the object-dir via the repo it
found, but that's not fully initialized. Worse, if we later call
clear_midx_files_ext(), we will use `the_repository` and remove files
out of the wrong object directory.

Fix this by using the given object_dir (or the object directory of
`the_repository` if `--object-dir` wasn't given) to properly to clean up
the *.rev files, avoiding the crash.

Original-patch-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-01 10:58:43 -07:00
Junio C Hamano
dd6d3c90ee Merge branch 'ab/attribute-format'
Many "printf"-like helper functions we have have been annotated
with __attribute__() to catch placeholder/parameter mismatches.

* ab/attribute-format:
  advice.h: add missing __attribute__((format)) & fix usage
  *.h: add a few missing __attribute__((format))
  *.c static functions: add missing __attribute__((format))
  sequencer.c: move static function to avoid forward decl
  *.c static functions: don't forward-declare __attribute__
2021-07-28 13:17:59 -07:00
Ævar Arnfjörð Bjarmason
48ca53cac4 *.c static functions: add missing __attribute__((format))
Add missing __attribute__((format)) function attributes to various
"static" functions that take printf arguments.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-07-13 15:20:20 -07:00
Taylor Blau
f89ecf7988 midx: report checksum mismatches during 'verify'
'git multi-pack-index verify' inspects the data in an existing MIDX for
correctness by checking that the recorded object offsets are correct,
and so on.

But it does not check that the file's trailing checksum matches the data
that it records. So, if an on-disk corruption happened to occur in the
final few bytes (and all other data was recorded correctly), we would:

  - get a clean result from 'git multi-pack-index verify', but
  - be unable to reuse the existing MIDX when writing a new one (since
    we now check for checksum mismatches before reusing a MIDX)

Teach the 'verify' sub-command to recognize corruption in the checksum
by calling midx_checksum_valid().

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-28 20:36:17 -07:00
Taylor Blau
ec1e28ef9c midx: don't reuse corrupt MIDXs when writing
When writing a new multi-pack index, Git tries to reuse as much of the
data from an existing MIDX as possible, like object offsets. This is
done to avoid re-opening a bunch of *.idx files unnecessarily, but can
lead to problems if the data we are reusing is corrupt.

That's because we'll blindly reuse data from an existing MIDX without
checking its trailing checksum for validity. So if there is memory
corruption while writing a MIDX, or disk corruption in the intervening
period between writing and reuse, we'll blindly propagate those bad
values forward.

Suppose we experience a memory corruption while writing a MIDX such that
we write an incorrect object offset (or alternatively, the disk corrupts
the data after being written, but before being reused). Then when we go
to write a new MIDX, we'll reuse the bad object offset without checking
its validity. This means that the MIDX we just wrote is broken, but its
trailing checksum is in-tact, since we never bothered to look at the
values before writing.

In the above, a "git multi-pack-index verify" would have caught the
problem before writing, but writing a new MIDX wouldn't have noticed
anything wrong, blindly carrying forward the corrupt offset.

Individual pack indexes check their validity by verifying the crc32
attached to each entry when carrying data forward during a repack.
We could solve this problem for MIDXs in the same way, but individual
crc32's don't make much sense, since their entries are so small.
Likewise, checking the whole file on every read may be prohibitively
expensive if a repository has a lot of objects, packs, or both.

But we can check the trailing checksum when reusing an existing MIDX
when writing a new one. And a corrupt MIDX need not stop us from writing
a new one, since we can just avoid reusing the existing one at all and
pretend as if we are writing a new MIDX from scratch.

Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-06-28 20:36:17 -07:00
brian m. carlson
92e2cab96b Always use oidread to read into struct object_id
In the future, we'll want oidread to automatically set the hash
algorithm member for an object ID we read into it, so ensure we use
oidread instead of hashcpy everywhere we're copying a hash value into a
struct object_id.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-27 16:31:38 +09:00
Junio C Hamano
e6b971fcf5 Merge branch 'tb/reverse-midx'
An on-disk reverse-index to map the in-pack location of an object
back to its object name across multiple packfiles is introduced.

* tb/reverse-midx:
  midx.c: improve cache locality in midx_pack_order_cmp()
  pack-revindex: write multi-pack reverse indexes
  pack-write.c: extract 'write_rev_file_order'
  pack-revindex: read multi-pack reverse indexes
  Documentation/technical: describe multi-pack reverse indexes
  midx: make some functions non-static
  midx: keep track of the checksum
  midx: don't free midx_name early
  midx: allow marking a pack as preferred
  t/helper/test-read-midx.c: add '--show-objects'
  builtin/multi-pack-index.c: display usage on unrecognized command
  builtin/multi-pack-index.c: don't enter bogus cmd_mode
  builtin/multi-pack-index.c: split sub-commands
  builtin/multi-pack-index.c: define common usage with a macro
  builtin/multi-pack-index.c: don't handle 'progress' separately
  builtin/multi-pack-index.c: inline 'flags' with options
2021-04-08 13:23:25 -07:00
Jeff King
3007752461 midx.c: improve cache locality in midx_pack_order_cmp()
There is a lot of pointer dereferencing in the pre-image version of
'midx_pack_order_cmp()', which this patch gets rid of.

Instead of comparing the pack preferred-ness and then the pack id, both
of these checks are done at the same time by using the high-order bit of
the pack id to represent whether it's preferred. Then the pack id and
offset are compared as usual.

This produces the same result so long as there are less than 2^31 packs,
which seems like a likely assumption to make in practice.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01 13:07:37 -07:00
Taylor Blau
38ff7cabb6 pack-revindex: write multi-pack reverse indexes
Implement the writing half of multi-pack reverse indexes. This is
nothing more than the format describe a few patches ago, with a new set
of helper functions that will be used to clear out stale .rev files
corresponding to old MIDXs.

Unfortunately, a very similar comparison function as the one implemented
recently in pack-revindex.c is reimplemented here, this time accepting a
MIDX-internal type. An effort to DRY these up would create more
indirection and overhead than is necessary, so it isn't pursued here.

Currently, there are no callers which pass the MIDX_WRITE_REV_INDEX
flag, meaning that this is all dead code. But, that won't be the case
for long, since subsequent patches will introduce the multi-pack bitmap,
which will begin passing this field.

(In midx.c:write_midx_internal(), the two adjacent if statements share a
conditional, but are written separately since the first one will
eventually also handle the MIDX_WRITE_BITMAP flag, which does not yet
exist.)

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01 13:07:37 -07:00
Taylor Blau
f894081dea pack-revindex: read multi-pack reverse indexes
Implement reading for multi-pack reverse indexes, as described in the
previous patch.

Note that these functions don't yet have any callers, and won't until
multi-pack reachability bitmaps are introduced in a later patch series.
In the meantime, this patch implements some of the infrastructure
necessary to support multi-pack bitmaps.

There are three new functions exposed by the revindex API:

  - load_midx_revindex(): loads the reverse index corresponding to the
    given multi-pack index.

  - midx_to_pack_pos() and pack_pos_to_midx(): these convert between the
    multi-pack index and pseudo-pack order.

load_midx_revindex() and pack_pos_to_midx() are both relatively
straightforward.

load_midx_revindex() needs a few functions to be exposed from the midx
API. One to get the checksum of a midx, and another to get the .rev's
filename. Similar to recent changes in the packed_git struct, three new
fields are added to the multi_pack_index struct: one to keep track of
the size, one to keep track of the mmap'd pointer, and another to point
past the header and at the reverse index's data.

pack_pos_to_midx() simply reads the corresponding entry out of the
table.

midx_to_pack_pos() is the trickiest, since it needs to find an object's
position in the psuedo-pack order, but that order can only be recovered
in the .rev file itself. This mapping can be implemented with a binary
search, but note that the thing we're binary searching over isn't an
array of values, but rather a permuted order of those values.

So, when comparing two items, it's helpful to keep in mind the
difference. Instead of a traditional binary search, where you are
comparing two things directly, here we're comparing a (pack, offset)
tuple with an index into the multi-pack index. That index describes
another (pack, offset) tuple, and it is _those_ two tuples that are
compared.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01 13:07:37 -07:00
Taylor Blau
62f2c1b509 midx: make some functions non-static
In a subsequent commit, pack-revindex.c will become responsible for
sorting a list of objects in the "MIDX pack order" (which will be
defined in the following patch). To do so, it will need to be know the
pack identifier and offset within that pack for each object in the MIDX.

The MIDX code already has functions for doing just that
(nth_midxed_offset() and nth_midxed_pack_int_id()), but they are
statically declared.

Since there is no reason that they couldn't be exposed publicly, and
because they are already doing exactly what the caller in
pack-revindex.c will want, expose them publicly so that they can be
reused there.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01 13:07:37 -07:00
Taylor Blau
9f19161172 midx: keep track of the checksum
write_midx_internal() uses a hashfile to write the multi-pack index, but
discards its checksum. This makes sense, since nothing that takes place
after writing the MIDX cares about its checksum.

That is about to change in a subsequent patch, when the optional
reverse index corresponding to the MIDX will want to include the MIDX's
checksum.

Store the checksum of the MIDX in preparation for that.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01 13:07:37 -07:00
Taylor Blau
7240cc4b65 midx: don't free midx_name early
A subsequent patch will need to refer back to 'midx_name' later on in
the function. In fact, this variable is already free()'d later on, so
this makes the later free() no longer redundant.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01 13:07:37 -07:00
Taylor Blau
9218c6a40c midx: allow marking a pack as preferred
When multiple packs in the multi-pack index contain the same object, the
MIDX machinery must make a choice about which pack it associates with
that object. Prior to this patch, the lowest-ordered[1] pack was always
selected.

Pack selection for duplicate objects is relatively unimportant today,
but it will become important for multi-pack bitmaps. This is because we
can only invoke the pack-reuse mechanism when all of the bits for reused
objects come from the reuse pack (in order to ensure that all reused
deltas can find their base objects in the same pack).

To encourage the pack selection process to prefer one pack over another
(the pack to be preferred is the one a caller would like to later use as
a reuse pack), introduce the concept of a "preferred pack". When
provided, the MIDX code will always prefer an object found in a
preferred pack over any other.

No format changes are required to store the preferred pack, since it
will be able to be inferred with a corresponding MIDX bitmap, by looking
up the pack associated with the object in the first bit position (this
ordering is described in detail in a subsequent commit).

[1]: the ordering is specified by MIDX internals; for our purposes we
can consider the "lowest ordered" pack to be "the one with the
most-recent mtime.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-01 13:07:37 -07:00
René Scharfe
ca56dadb4b use CALLOC_ARRAY
Add and apply a semantic patch for converting code that open-codes
CALLOC_ARRAY to use it instead.  It shortens the code and infers the
element size automatically.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-13 16:00:09 -08:00
Junio C Hamano
660dd97a62 Merge branch 'ds/chunked-file-api'
The common code to deal with "chunked file format" that is shared
by the multi-pack-index and commit-graph files have been factored
out, to help codepaths for both filetypes to become more robust.

* ds/chunked-file-api:
  commit-graph.c: display correct number of chunks when writing
  chunk-format: add technical docs
  chunk-format: restore duplicate chunk checks
  midx: use 64-bit multiplication for chunk sizes
  midx: use chunk-format read API
  commit-graph: use chunk-format read API
  chunk-format: create read chunk API
  midx: use chunk-format API in write_midx_internal()
  midx: drop chunk progress during write
  midx: return success/failure in chunk write methods
  midx: add num_large_offsets to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add entries to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: rename pack_info to write_midx_context
  commit-graph: use chunk-format write API
  chunk-format: create chunk format write API
  commit-graph: anonymize data in chunk_write_fn
2021-03-01 14:02:57 -08:00
Junio C Hamano
11875561bf Merge branch 'ds/chunked-file-api' into tb/reverse-midx
* ds/chunked-file-api:
  commit-graph.c: display correct number of chunks when writing
  chunk-format: add technical docs
  chunk-format: restore duplicate chunk checks
  midx: use 64-bit multiplication for chunk sizes
  midx: use chunk-format read API
  commit-graph: use chunk-format read API
  chunk-format: create read chunk API
  midx: use chunk-format API in write_midx_internal()
  midx: drop chunk progress during write
  midx: return success/failure in chunk write methods
  midx: add num_large_offsets to write_midx_context
  midx: add pack_perm to write_midx_context
  midx: add entries to write_midx_context
  midx: use context in write_midx_pack_names()
  midx: rename pack_info to write_midx_context
  commit-graph: use chunk-format write API
  chunk-format: create chunk format write API
  commit-graph: anonymize data in chunk_write_fn
2021-02-24 15:26:14 -08:00
Derrick Stolee
329fac3a36 midx: use 64-bit multiplication for chunk sizes
When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.

Other possible overflows were discovered by evaluating each
multiplication in midx.c and ensuring that at least one side of the
operator was of type size_t or off_t.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
6ab3b8b8b8 midx: use chunk-format read API
Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). In particular, we
can use the return value of pair_chunk() to generate an error when a
required chunk is missing.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
63a8f0e9b9 midx: use chunk-format API in write_midx_internal()
The chunk-format API allows writing the table of contents and all chunks
using the anonymous 'struct chunkfile' type. We only need to convert our
local chunk logic to this API for the multi-pack-index writes to share
that logic with the commit-graph file writes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
c1442410d8 midx: drop chunk progress during write
Most expensive operations in write_midx_internal() use the context
struct's progress member, and these indicate the process of the
expensive operations within the chunk writing methods. However, there is
a competing progress struct that counts the progress over all chunks.
This is not very helpful compared to the others, so drop it.

This also reduces our barriers to combining the chunk writing code with
chunk-format.c.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
0ccd713cb6 midx: return success/failure in chunk write methods
Historically, the chunk-writing methods in midx.c have returned the
amount of data written so the writer method could compare this with the
table of contents. This presents with some interesting issues:

1. If a chunk writing method has a bug that miscalculates the written
   bytes, then we can satisfy the table of contents without actually
   writing the right amount of data to the hashfile. The commit-graph
   writing code checks the hashfile struct directly for a more robust
   verification.

2. There is no way for a chunk writing method to gracefully fail.
   Returning an int presents an opportunity to fail without a die().

3. The current pattern doesn't match chunk_write_fn type exactly, so we
   cannot share code with commit-graph.c

For these reasons, convert the midx chunk writer methods to return an
'int'. Since none of them fail at the moment, they all return 0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
980f525c3c midx: add num_large_offsets to write_midx_context
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t num_large_offsets" into the context. With
this new data, write_midx_large_offsets() now matches the
chunk_write_fn type.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
7a3ada1192 midx: add pack_perm to write_midx_context
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t *pack_perm" and large_offsets_needed bit
into the context.

Update write_midx_object_offsets() to match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
31bda9a237 midx: add entries to write_midx_context
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "struct pack_midx_entry *entries" list and its count
into the context.

Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the
context directly, as these are easy conversions with this new data.

Only the callers of write_midx_object_offsets() and
write_midx_large_offsets() are updated here, since additional data in
the context before those methods can match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
b4d941420b midx: use context in write_midx_pack_names()
In an effort to align the write_midx_internal() to use the chunk-format
API, start converting chunk writing methods to match chunk_write_fn. The
first case is to convert write_midx_pack_names() to take "void *data".
We already have the necessary data in "struct write_midx_context", so
this conversion is rather mechanical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Derrick Stolee
577dc49696 midx: rename pack_info to write_midx_context
In an effort to streamline our chunk-based file formats, align some of
the code structure in write_midx_internal() to be similar to the
patterns in write_commit_graph_file().

Specifically, let's create a "struct write_midx_context" that can be
used as a data parameter to abstract function types.

This change only renames "struct pack_info" to "struct
write_midx_context" and the names of instances from "packs" to "ctx". In
future changes, we will expand the data inside "struct
write_midx_context" and align our chunk-writing method with the
chunk-format API.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-18 13:38:16 -08:00
Junio C Hamano
58e2ce9112 Merge branch 'ma/more-opaque-lock-file'
Code clean-up.

* ma/more-opaque-lock-file:
  read-cache: try not to peek into `struct {lock_,temp}file`
  refs/files-backend: don't peek into `struct lock_file`
  midx: don't peek into `struct lock_file`
  commit-graph: don't peek into `struct lock_file`
  builtin/gc: don't peek into `struct lock_file`
2021-01-25 14:19:17 -08:00
Martin Ågren
acd7160201 midx: don't peek into struct lock_file
Similar to the previous commits, avoid peeking into the `struct
lock_file`. Use the lock file API instead.

The two functions we're calling here double-check that the tempfile is
indeed "active", which is arguably overkill considering how we took the
lock on the line immediately above. More importantly, this future-proofs
us against, e.g., other code appearing between these two lines or the
lock file and/or tempfile internals changing.

Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-06 13:53:32 -08:00
Martin Ågren
bc62692757 hash-lookup: rename from sha1-lookup
Change all remnants of "sha1" in hash-lookup.c and .h and rename them to
reflect that we're not just able to handle SHA-1 these days.

Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-04 13:01:55 -08:00
Junio C Hamano
6bac6a1ef9 Merge branch 'tb/idx-midx-race-fix'
Processes that access packdata while the .idx file gets removed
(e.g. while repacking) did not fail or fall back gracefully as they
could.

* tb/idx-midx-race-fix:
  midx.c: protect against disappearing packs
  packfile.c: protect against disappearing indexes
2020-12-08 15:11:18 -08:00
Junio C Hamano
455e8d18f8 Merge branch 'rs/hashwrite-be64'
Code simplification.

* rs/hashwrite-be64:
  pack-write: use hashwrite_be64()
  midx: use hashwrite_be64()
  csum-file: add hashwrite_be64()
2020-11-25 15:24:52 -08:00
Taylor Blau
506ec2fbda midx.c: protect against disappearing packs
When a packed object is stored in a multi-pack index, but that pack has
racily gone away, the MIDX code simply calls die(), when it could be
returning an error to the caller, which would in turn lead to
re-scanning the pack directory.

A pack can racily disappear, for example, due to a simultaneous 'git
repack -ad',

You can also reproduce this with two terminals, where one is running:

    git init
    while true; do
      git commit -q --allow-empty -m foo
      git repack -ad
      git multi-pack-index write
    done

(in effect, constantly writing new MIDXs), and the other is running:

    obj=$(git rev-parse HEAD)
    while true; do
      echo $obj | git cat-file --batch-check='%(objectsize:disk)' || break
    done

That will sometimes hit the error preparing packfile from
multi-pack-index message, which this patch fixes.

Right now, that path to discovering a missing pack looks something like
'find_pack_entry()' calling 'fill_midx_entry()' and eventually making
its way to call 'nth_midxed_pack_entry()'.

'nth_midxed_pack_entry()' already checks 'is_pack_valid()' and
propagates an error if the pack is invalid. So, this works if the pack
has gone away between calling 'prepare_midx_pack()' and before calling
'is_pack_valid()', but not if it disappears before then.

Catch the case where the pack has already disappeared before
'prepare_midx_pack()' by returning an error in that case, too.

Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-25 13:15:56 -08:00
René Scharfe
ef1b853c15 midx: use hashwrite_be64()
Call hashwrite_be64() to write 64-bit values instead of open-coding it
using hashwrite_be32() and sizeof.  This shortens the code and makes its
intent clearer.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-12 09:40:08 -08:00
Junio C Hamano
52b8c8c716 Merge branch 'ds/maintenance-part-2'
"git maintenance", an extended big brother of "git gc", continues
to evolve.

* ds/maintenance-part-2:
  maintenance: add incremental-repack auto condition
  maintenance: auto-size incremental-repack batch
  maintenance: add incremental-repack task
  midx: use start_delayed_progress()
  midx: enable core.multiPackIndex by default
  maintenance: create auto condition for loose-objects
  maintenance: add loose-objects task
  maintenance: add prefetch task
2020-10-27 15:09:47 -07:00
Derrick Stolee
efdd2f0d4c midx: use start_delayed_progress()
Now that the multi-pack-index may be written as part of auto maintenance
at the end of a command, reduce the progress output when the operations
are quick. Use start_delayed_progress() instead of start_progress().

Update t5319-multi-pack-index.sh to use GIT_PROGRESS_DELAY=0 now that
the progress indicators are conditional.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:53:04 -07:00
Derrick Stolee
18e449f86b midx: enable core.multiPackIndex by default
The core.multiPackIndex setting has been around since c4d25228eb
(config: create core.multiPackIndex setting, 2018-07-12), but has been
disabled by default. If a user wishes to use the multi-pack-index
feature, then they must enable this config and run 'git multi-pack-index
write'.

The multi-pack-index feature is relatively stable now, so make the
config option true by default. For users that do not use a
multi-pack-index, the only extra cost will be a file lookup to see if a
multi-pack-index file exists (once per process, per object directory).

Also, this config option will be referenced by an upcoming
"incremental-repack" task in the maintenance builtin, so move the config
option into the repository settings struct. Note that if
GIT_TEST_MULTI_PACK_INDEX=1, then we want to ignore the config option
and treat core.multiPackIndex as enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-25 10:53:04 -07:00
Junio C Hamano
9b8074427b Merge branch 'rs/misc-cleanups'
Misc cleanups.

* rs/misc-cleanups:
  pack-bitmap-write: use hashwrite_be32() in write_hash_cache()
  midx: use hashwrite_u8() in write_midx_header()
  fast-import: use write_pack_header()
2020-09-18 17:58:00 -07:00
Junio C Hamano
a31677dde3 Merge branch 'tb/repack-clearing-midx'
When a packfile is removed by "git repack", multi-pack-index gets
cleared; the code was taught to do so less aggressively by first
checking if the midx actually refers to a pack that no longer
exists.

* tb/repack-clearing-midx:
  midx: traverse the local MIDX first
  builtin/repack.c: invalidate MIDX only when necessary
2020-09-09 13:53:06 -07:00
René Scharfe
014f1447f0 midx: use hashwrite_u8() in write_midx_header()
Emit byte-sized values using hashwrite_u8() instead of buffering them
locally first.  The hashwrite functions already do their own buffering,
so this double-buffering does not reduce the number of system calls.
Getting rid of it shortens and simplifies the code a bit.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-09-06 13:40:40 -07:00
Taylor Blau
59552fb3e2 midx: traverse the local MIDX first
When a repository has an alternate object directory configured, callers
can traverse through each alternate's MIDX by walking the '->next'
pointer.

But, when 'prepare_multi_pack_index_one()' loads multiple MIDXs, it
places the new ones at the front of this pointer chain, not at the end.
This can be confusing for callers such as 'git repack -ad', causing test
failures like in t7700.6 with 'GIT_TEST_MULTI_PACK_INDEX=1'.

The occurs when dropping a pack known to the local MIDX with alternates
configured that have their own MIDX. Since the alternate's MIDX is
returned via 'get_multi_pack_index()', 'midx_contains_pack()' returns
true (which is correct, since it traverses through the '->next' pointer
to find the MIDX in the chain that does contain the requested object).
But, we call 'clear_midx_file()' on 'the_repository', which drops the
MIDX at the path of the first MIDX in the chain, which (in the case of
t7700.6 is the one in the alternate).

This patch addresses that by:

  - placing the local MIDX first in the chain when calling
    'prepare_multi_pack_index_one()', and

  - introducing a new 'get_local_multi_pack_index()', which explicitly
    returns the repository-local MIDX, if any.

Don't impose an additional order on the MIDX's '->next' pointer beyond
that the first item in the chain must be local if one exists so that we
avoid a quadratic insertion.

Likewise, use 'get_local_multi_pack_index()' in
'remove_redundant_pack()' to fix the formerly broken t7700.6 when run
with 'GIT_TEST_MULTI_PACK_INDEX=1'.

Finally, note that the MIDX ordering invariant is only preserved by the
insertion order in 'prepare_packed_git()', which traverses through the
ODB's '->next' pointer, meaning we visit the local object store first.
This fragility makes this an undesirable long-term solution if more
callers are added, but it is acceptable for now since this is the only
caller.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-28 14:07:09 -07:00
Junio C Hamano
d8488b9e86 Merge branch 'rs/more-buffered-io'
Use more buffered I/O where we used to call many small write(2)s.

* rs/more-buffered-io:
  upload-pack: use buffered I/O to talk to rev-list
  midx: use buffered I/O to talk to pack-objects
  connected: use buffered I/O to talk to rev-list
2020-08-24 14:54:31 -07:00
Junio C Hamano
ff20794402 Merge branch 'jk/unleak-fixes'
Fix some incorrect UNLEAK() annotations.

* jk/unleak-fixes:
  ls-remote: simplify UNLEAK() usage
  stop calling UNLEAK() before die()
2020-08-24 14:54:30 -07:00
Junio C Hamano
9e8c7542cb Merge branch 'ds/midx-repack-to-batch-size'
The "--batch-size" option of "git multi-pack-index repack" command
is now used to specify that very small packfiles are collected into
one until the total size roughly exceeds it.

* ds/midx-repack-to-batch-size:
  multi-pack-index: repack batches below --batch-size
2020-08-24 14:54:28 -07:00
Derrick Stolee
d96075428a multi-pack-index: use hash version byte
Similar to the commit-graph format, the multi-pack-index format has a
byte in the header intended to track the hash version used to write the
file. This allows one to interpret the hash length without having the
context of the repository config specifying the hash length. This was
not modified as part of the SHA-256 work because the hash length was
automatically up-shifted due to that config.

Since we have this byte available, we can make the file formats more
obviously incompatible instead of relying on other context from the
repository.

Add a new oid_version() method in midx.c similar to the one in
commit-graph.c. This is specifically made separate from that
implementation to avoid artificially linking the formats.

The test impact requires a few more things than the corresponding change
in the commit-graph format. Specifically, 'test-tool read-midx' was not
writing anything about this header value to output. Since the value
available in 'struct multi_pack_index' is hash_len instead of a version
value, we output "20" or "32" instead of "1" or "2".

Since we want a user to not have their Git commands fail if their
multi-pack-index has the incorrect hash version compared to the
repository's hash version, we relax the die() to an error() in
load_multi_pack_index(). This has some effect on 'git multi-pack-index
verify' as we need to check that a failed parse of a file that exists is
actually a verify error. For that test that checks the hash version
matches, we change the corrupted byte from "2" to "3" to ensure the test
fails for both hash algorithms.

Helped-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 16:45:20 -07:00
René Scharfe
6af3b00abc midx: use buffered I/O to talk to pack-objects
Like f0bca72dc7 (send-pack: use buffered I/O to talk to pack-objects,
2016-06-08), significantly reduce the number of system calls and
simplify the code for sending object IDs to pack-objects by using
stdio's buffering.

Helped-by: Chris Torek <chris.torek@gmail.com>
Helped-by: Johannes Sixt <j6t@kdbg.org>
Encouraged-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-17 10:29:39 -07:00
Jeff King
d5e1961c19 stop calling UNLEAK() before die()
The point of UNLEAK() is to make a reference to a variable that is about
to go out of scope so that leak-checkers will consider it to be
not-leaked. Doing so right before die() is therefore pointless; even
though we are about to exit the program, the variable will still be on
the stack and accessible to leak-checkers.

These annotations aren't really hurting anything, but they clutter the
code and set a bad example of how to use UNLEAK().

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-13 11:04:58 -07:00
Derrick Stolee
1eb22c7dd8 multi-pack-index: repack batches below --batch-size
The --batch-size=<size> option of 'git multi-pack-index repack' is
intended to limit the amount of work done by the repack. In the case of
a large repository, this command should repack a number of small
pack-files but leave the large pack-files alone. Most often, the
repository has one large pack-file from a 'git clone' operation and
number of smaller pack-files from incremental 'git fetch' operations.

The issue with '--batch-size' is that it also _prevents_ the repack from
happening if the expected size of the resulting pack-file is too small.
This was intended as a way to avoid frequent churn of small pack-files,
but it has mostly caused confusion when a repository is of "medium"
size. That is, not enormous like the Windows OS repository, but also not
so small that this incremental repack isn't valuable.

The solution presented here is to collect pack-files for repack if their
expected size is smaller than the batch-size parameter until either the
total expected size exceeds the batch-size or all pack-files are
considered. If there are at least two pack-files, then these are
combined to a new pack-file whose size should not be too much larger
than the batch-size.

This new strategy should succeed in keeping the number of pack-files
small in these "medium" size repositories. The concern about churn is
likely not interesting, as the real control over that is the frequency
in which the repack command is run.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Reviewed-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-08-11 14:05:26 -07:00
Jeff King
c972bf4cf5 strvec: convert remaining callers away from argv_array name
We eventually want to drop the argv_array name and just use strvec
consistently. There's no particular reason we have to do it all at once,
or care about interactions between converted and unconverted bits.
Because of our preprocessor compat layer, the names are interchangeable
to the compiler (so even a definition and declaration using different
names is OK).

This patch converts all of the remaining files, as the resulting diff is
reasonably sized.

The conversion was done purely mechanically with:

  git ls-files '*.c' '*.h' |
  xargs perl -i -pe '
    s/ARGV_ARRAY/STRVEC/g;
    s/argv_array/strvec/g;
  '

We'll deal with any indentation/style fallouts separately.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28 15:02:18 -07:00
Derrick Stolee
3ce4ca0a56 multi-pack-index: respect repack.packKeptObjects=false
When selecting a batch of pack-files to repack in the "git
multi-pack-index repack" command, Git should respect the
repack.packKeptObjects config option. When false, this option says that
the pack-files with an associated ".keep" file should not be repacked.
This config value is "false" by default.

There are two cases for selecting a batch of objects. The first is the
case where the input batch-size is zero, which specifies "repack
everything". The second is with a non-zero batch size, which selects
pack-files using a greedy selection criteria. Both of these cases are
updated and tested.

Reported-by: Son Luong Ngoc <sluongng@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-10 09:50:55 -07:00
Son Luong Ngoc
e11d86de13 midx: teach "git multi-pack-index repack" honor "git repack" configurations
When the "repack" subcommand of "git multi-pack-index" command
creates new packfile(s), it does not call the "git repack"
command but instead directly calls the "git pack-objects"
command, and the configuration variables meant for the "git
repack" command, like "repack.usedaeltabaseoffset", are ignored.

Check the configuration variables used by "git repack" ourselves
in "git multi-index-pack" and pass the corresponding options to
underlying "git pack-objects".

Note that `repack.writeBitmaps` configuration is ignored, as the
pack bitmap facility is useful only with a single packfile.

Signed-off-by: Son Luong Ngoc <sluongng@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-05-10 09:50:53 -07:00
Junio C Hamano
29d74275c4 Merge branch 'ds/multi-pack-index'
The multi-pack-index left mmapped file descriptors open when it
does not have to.

* ds/multi-pack-index:
  multi-pack-index: close file descriptor after mmap
2020-05-01 13:39:55 -07:00
Derrick Stolee
6c7ff7cf7f multi-pack-index: close file descriptor after mmap
The multi-pack-index subsystem was not closing its file descriptor
after memory-mapping the file contents. After this mmap() succeeds,
there is no need to keep the file descriptor open.  In fact, there
is signficant reason to close it so we do not run out of
descriptors.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-24 13:09:49 -07:00
Damien Robert
796d61cdc0 midx.c: fix an integer underflow
When verifying a midx index with 0 objects, the
    m->num_objects - 1
underflows and wraps around to 4294967295.

Fix this both by checking that the midx contains at least one oid,
and also that we don't write any midx when there is no packfiles.

Update the tests to check that `git multi-pack-index write` does
not write an midx when there is no objects, and another to check
that `git multi-pack-index verify` warns when it verifies an midx with no
objects. For this last test, use t5319/no-objects.midx which was
generated by an older version of git.

Signed-off-by: Damien Robert <damien.olivier.robert+git@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-28 16:50:40 -07:00
Jeff King
0763671b8e nth_packed_object_oid(): use customary integer return
Our nth_packed_object_sha1() function returns NULL for error. So when we
wrapped it with nth_packed_object_oid(), we kept the same semantics. But
it's a bit funny, because the caller actually passes in an out
parameter, and the pointer we return is just that same struct they
passed to us (or NULL).

It's not too terrible, but it does make the interface a little
non-idiomatic. Let's switch to our usual "0 for success, negative for
error" return value. Most callers either don't check it, or are
trivially converted. The one that requires the biggest change is
actually improved, as we can ditch an extra aliased pointer variable.

Since we are changing the interface in a subtle way that the compiler
wouldn't catch, let's also change the name to catch any topics in
flight. We can drop the 'o' and make it nth_packed_object_id(). That's
slightly shorter, but also less redundant since the 'o' stands for
"object" already.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-02-24 12:55:42 -08:00
William Baker
64d80e7d52 midx: honor the MIDX_PROGRESS flag in midx_repack
Update midx_repack to only display progress when
the MIDX_PROGRESS flag is set.

Signed-off-by: William Baker <William.Baker@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-23 12:05:06 +09:00
William Baker
ad60096d1c midx: honor the MIDX_PROGRESS flag in verify_midx_file
Update verify_midx_file to only display progress when
the MIDX_PROGRESS flag is set.

Signed-off-by: William Baker <William.Baker@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-23 12:05:05 +09:00
William Baker
8dc18f8937 midx: add progress to expire_midx_packs
Add progress to expire_midx_packs.  Progress is
displayed when the MIDX_PROGRESS flag is set.

Signed-off-by: William Baker <William.Baker@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-23 12:05:05 +09:00
William Baker
840cef0c70 midx: add progress to write_midx_file
Add progress to write_midx_file.  Progress is displayed
when the MIDX_PROGRESS flag is set.

Signed-off-by: William Baker <William.Baker@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-23 12:05:05 +09:00
William Baker
efbc3aee08 midx: add MIDX_PROGRESS flag
Add the MIDX_PROGRESS flag and update the
write|verify|expire|repack functions in midx.h
to accept a flags parameter.  The MIDX_PROGRESS
flag indicates whether the caller of the function
would like progress information to be displayed.
This patch only changes the method prototypes
and does not change the functionality. The
functionality change will be handled by a later patch.

Signed-off-by: William Baker <William.Baker@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-23 12:05:05 +09:00
brian m. carlson
aaa95dfa05 midx: switch to using the_hash_algo
Instead of hard-coding the hash size, use the_hash_algo to look up the
hash size at runtime.  Remove the #define constant which was used to
hold the hash length, since writing the expression with the_hash_algo
provide enough documentary value on its own.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-08-19 15:05:00 -07:00
Derrick Stolee
ce1e4a105b midx: implement midx_repack()
To repack with a non-zero batch-size, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, compute their expected size, and add the packs to a list
if they are smaller than the given batch-size. Stop when the total
expected size is at least the batch size.

If the batch size is zero, select all packs in the multi-pack-index.

Finally, collect the objects from the multi-pack-index that are in
the selected packs and send them to 'git pack-objects'. Write a new
multi-pack-index that includes the new pack.

Using a batch size of zero is very similar to a standard 'git repack'
command, except that we do not delete the old packs and instead rely
on the new multi-pack-index to prevent new processes from reading the
old packs. This does not disrupt other Git processes that are currently
reading the old packs based on the old multi-pack-index.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the actual size of the
objects instead of the size of the pack-files. This allows repacking
a large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. The "expected size" version provides similar
behavior, but could skip a pack-file if the average object size is
much larger than the actual size of the referenced objects, or
can create a large pack if the actual size of the referenced objects
is larger than the expected size.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11 10:34:40 -07:00
Derrick Stolee
2af890bb28 multi-pack-index: prepare 'repack' subcommand
In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The subcommand will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

In addition, we hard-code the modified times of the packs in
the pack directory to ensure the list of packs sorted by modified
time matches the order if sorted by size (ascending). This will
be important in a future test.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11 10:34:40 -07:00
Derrick Stolee
19575c7c8e multi-pack-index: implement 'expire' subcommand
The 'git multi-pack-index expire' subcommand looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire subcommand. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11 10:34:40 -07:00
Derrick Stolee
d01bf2e609 midx: refactor permutation logic and pack sorting
In anticipation of the expire subcommand, refactor the way we sort
the packfiles by name. This will greatly simplify our approach to
dropping expired packs from the list.

First, create 'struct pack_info' to replace 'struct pack_pair'.
This struct contains the necessary information about a pack,
including its name, a pointer to its packfile struct (if not
already in the multi-pack-index), and the original pack-int-id.

Second, track the pack information using an array of pack_info
structs in the pack_list struct. This simplifies the logic around
the multiple arrays we were tracking in that struct.

Finally, update get_sorted_entries() to not permute the pack-int-id
and instead supply the permutation to write_midx_object_offsets().
This requires sorting the packs after get_sorted_entries().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-11 10:34:40 -07:00