2013-09-15 15:33:20 +00:00
|
|
|
#include "builtin.h"
|
2017-06-14 18:07:36 +00:00
|
|
|
#include "config.h"
|
2013-09-15 15:33:20 +00:00
|
|
|
#include "dir.h"
|
2023-03-21 06:26:03 +00:00
|
|
|
#include "environment.h"
|
2023-03-21 06:25:54 +00:00
|
|
|
#include "gettext.h"
|
2023-02-24 00:09:27 +00:00
|
|
|
#include "hex.h"
|
2013-09-15 15:33:20 +00:00
|
|
|
#include "parse-options.h"
|
2023-05-16 06:33:59 +00:00
|
|
|
#include "path.h"
|
2013-09-15 15:33:20 +00:00
|
|
|
#include "run-command.h"
|
2023-04-22 20:17:11 +00:00
|
|
|
#include "server-info.h"
|
2013-09-15 15:33:20 +00:00
|
|
|
#include "strbuf.h"
|
|
|
|
#include "string-list.h"
|
2020-07-28 20:23:39 +00:00
|
|
|
#include "strvec.h"
|
2018-07-12 19:39:40 +00:00
|
|
|
#include "midx.h"
|
2018-08-08 22:34:06 +00:00
|
|
|
#include "packfile.h"
|
2020-03-24 01:07:52 +00:00
|
|
|
#include "prune-packed.h"
|
2023-05-16 06:34:06 +00:00
|
|
|
#include "object-store-ll.h"
|
2019-06-25 13:40:31 +00:00
|
|
|
#include "promisor-remote.h"
|
2020-04-30 19:48:50 +00:00
|
|
|
#include "shallow.h"
|
2021-01-12 08:21:59 +00:00
|
|
|
#include "pack.h"
|
2021-10-01 22:38:10 +00:00
|
|
|
#include "pack-bitmap.h"
|
|
|
|
#include "refs.h"
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
#include "list-objects-filter-options.h"
|
2013-09-15 15:33:20 +00:00
|
|
|
|
2022-05-20 23:18:03 +00:00
|
|
|
#define ALL_INTO_ONE 1
|
|
|
|
#define LOOSEN_UNREACHABLE 2
|
|
|
|
#define PACK_CRUFT 4
|
|
|
|
|
2022-05-20 23:18:08 +00:00
|
|
|
#define DELETE_PACK 1
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
#define RETAIN_PACK 2
|
2022-05-20 23:18:08 +00:00
|
|
|
|
2022-05-20 23:18:03 +00:00
|
|
|
static int pack_everything;
|
2013-09-15 15:33:20 +00:00
|
|
|
static int delta_base_offset = 1;
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 20:04:20 +00:00
|
|
|
static int pack_kept_objects = -1;
|
2019-03-14 09:12:54 +00:00
|
|
|
static int write_bitmaps = -1;
|
2018-08-16 06:13:10 +00:00
|
|
|
static int use_delta_islands;
|
2022-03-14 07:42:51 +00:00
|
|
|
static int run_update_server_info = 1;
|
repack: avoid loosening promisor objects in partial clones
When `git repack -A -d` is run in a partial clone, `pack-objects`
is invoked twice: once to repack all promisor objects, and once to
repack all non-promisor objects. The latter `pack-objects` invocation
is with --exclude-promisor-objects and --unpack-unreachable, which
loosens all objects unused during this invocation. Unfortunately,
this includes promisor objects.
Because the -d argument to `git repack` subsequently deletes all loose
objects also in packs, these just-loosened promisor objects will be
immediately deleted. However, this extra disk churn is unnecessary in
the first place. For example, in a newly-cloned partial repo that
filters all blob objects (e.g. `--filter=blob:none`), `repack` ends up
unpacking all trees and commits into the filesystem because every
object, in this particular case, is a promisor object. Depending on
the repo size, this increases the disk usage considerably: In my copy
of the linux.git, the object directory peaked 26GB of more disk usage.
In order to avoid this extra disk churn, pass the names of the promisor
packfiles as --keep-pack arguments to the second invocation of
`pack-objects`. This informs `pack-objects` that the promisor objects
are already in a safe packfile and, therefore, do not need to be
loosened.
For testing, we need to validate whether any object was loosened.
However, the "evidence" (loosened objects) is deleted during the
process which prevents us from inspecting the object directory.
Instead, let's teach `pack-objects` to count loosened objects and
emit via trace2 thus allowing inspecting the debug events after the
process is finished. This new event is used on the added regression
test.
Lastly, add a new perf test to evaluate the performance impact
made by this changes (tested on git.git):
Test HEAD^ HEAD
----------------------------------------------------------
5600.3: gc 134.38(41.93+90.95) 7.80(6.72+1.35) -94.2%
For a bigger repository, such as linux.git, the improvement is
even bigger:
Test HEAD^ HEAD
-------------------------------------------------------------------
5600.3: gc 6833.00(918.07+3162.74) 268.79(227.02+39.18) -96.1%
These improvements are particular big because every object in the
newly-cloned partial repository is a promisor object.
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-21 19:32:12 +00:00
|
|
|
static char *packdir, *packtmp_name, *packtmp;
|
2013-09-15 15:33:20 +00:00
|
|
|
|
|
|
|
static const char *const git_repack_usage[] = {
|
2015-01-13 07:44:47 +00:00
|
|
|
N_("git repack [<options>]"),
|
2013-09-15 15:33:20 +00:00
|
|
|
NULL
|
|
|
|
};
|
|
|
|
|
2016-12-28 22:45:42 +00:00
|
|
|
static const char incremental_bitmap_conflict_error[] = N_(
|
|
|
|
"Incremental repacks are incompatible with bitmap indexes. Use\n"
|
2022-06-17 10:03:09 +00:00
|
|
|
"--no-write-bitmap-index or disable the pack.writeBitmaps configuration."
|
2016-12-28 22:45:42 +00:00
|
|
|
);
|
|
|
|
|
2022-05-20 23:18:06 +00:00
|
|
|
struct pack_objects_args {
|
|
|
|
const char *window;
|
|
|
|
const char *window_memory;
|
|
|
|
const char *depth;
|
|
|
|
const char *threads;
|
2023-10-03 00:44:29 +00:00
|
|
|
unsigned long max_pack_size;
|
2022-05-20 23:18:06 +00:00
|
|
|
int no_reuse_delta;
|
|
|
|
int no_reuse_object;
|
|
|
|
int quiet;
|
|
|
|
int local;
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
struct list_objects_filter_options filter_options;
|
2022-05-20 23:18:06 +00:00
|
|
|
};
|
2016-12-28 22:45:42 +00:00
|
|
|
|
config: add ctx arg to config_fn_t
Add a new "const struct config_context *ctx" arg to config_fn_t to hold
additional information about the config iteration operation.
config_context has a "struct key_value_info kvi" member that holds
metadata about the config source being read (e.g. what kind of config
source it is, the filename, etc). In this series, we're only interested
in .kvi, so we could have just used "struct key_value_info" as an arg,
but config_context makes it possible to add/adjust members in the future
without changing the config_fn_t signature. We could also consider other
ways of organizing the args (e.g. moving the config name and value into
config_context or key_value_info), but in my experiments, the
incremental benefit doesn't justify the added complexity (e.g. a
config_fn_t will sometimes invoke another config_fn_t but with a
different config value).
In subsequent commits, the .kvi member will replace the global "struct
config_reader" in config.c, making config iteration a global-free
operation. It requires much more work for the machinery to provide
meaningful values of .kvi, so for now, merely change the signature and
call sites, pass NULL as a placeholder value, and don't rely on the arg
in any meaningful way.
Most of the changes are performed by
contrib/coccinelle/config_fn_ctx.pending.cocci, which, for every
config_fn_t:
- Modifies the signature to accept "const struct config_context *ctx"
- Passes "ctx" to any inner config_fn_t, if needed
- Adds UNUSED attributes to "ctx", if needed
Most config_fn_t instances are easily identified by seeing if they are
called by the various config functions. Most of the remaining ones are
manually named in the .cocci patch. Manual cleanups are still needed,
but the majority of it is trivial; it's either adjusting config_fn_t
that the .cocci patch didn't catch, or adding forward declarations of
"struct config_context ctx" to make the signatures make sense.
The non-trivial changes are in cases where we are invoking a config_fn_t
outside of config machinery, and we now need to decide what value of
"ctx" to pass. These cases are:
- trace2/tr2_cfg.c:tr2_cfg_set_fl()
This is indirectly called by git_config_set() so that the trace2
machinery can notice the new config values and update its settings
using the tr2 config parsing function, i.e. tr2_cfg_cb().
- builtin/checkout.c:checkout_main()
This calls git_xmerge_config() as a shorthand for parsing a CLI arg.
This might be worth refactoring away in the future, since
git_xmerge_config() can call git_default_config(), which can do much
more than just parsing.
Handle them by creating a KVI_INIT macro that initializes "struct
key_value_info" to a reasonable default, and use that to construct the
"ctx" arg.
Signed-off-by: Glen Choo <chooglen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-28 19:26:22 +00:00
|
|
|
static int repack_config(const char *var, const char *value,
|
|
|
|
const struct config_context *ctx, void *cb)
|
2013-09-15 15:33:20 +00:00
|
|
|
{
|
2022-05-20 23:18:06 +00:00
|
|
|
struct pack_objects_args *cruft_po_args = cb;
|
2013-09-15 15:33:20 +00:00
|
|
|
if (!strcmp(var, "repack.usedeltabaseoffset")) {
|
|
|
|
delta_base_offset = git_config_bool(var, value);
|
|
|
|
return 0;
|
|
|
|
}
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 20:04:20 +00:00
|
|
|
if (!strcmp(var, "repack.packkeptobjects")) {
|
|
|
|
pack_kept_objects = git_config_bool(var, value);
|
|
|
|
return 0;
|
|
|
|
}
|
2014-06-10 20:20:30 +00:00
|
|
|
if (!strcmp(var, "repack.writebitmaps") ||
|
|
|
|
!strcmp(var, "pack.writebitmaps")) {
|
2014-06-10 20:10:07 +00:00
|
|
|
write_bitmaps = git_config_bool(var, value);
|
2014-06-10 20:09:23 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2018-08-16 06:13:10 +00:00
|
|
|
if (!strcmp(var, "repack.usedeltaislands")) {
|
|
|
|
use_delta_islands = git_config_bool(var, value);
|
|
|
|
return 0;
|
|
|
|
}
|
2022-03-14 07:42:51 +00:00
|
|
|
if (strcmp(var, "repack.updateserverinfo") == 0) {
|
|
|
|
run_update_server_info = git_config_bool(var, value);
|
|
|
|
return 0;
|
|
|
|
}
|
2022-05-20 23:18:06 +00:00
|
|
|
if (!strcmp(var, "repack.cruftwindow"))
|
|
|
|
return git_config_string(&cruft_po_args->window, var, value);
|
|
|
|
if (!strcmp(var, "repack.cruftwindowmemory"))
|
|
|
|
return git_config_string(&cruft_po_args->window_memory, var, value);
|
|
|
|
if (!strcmp(var, "repack.cruftdepth"))
|
|
|
|
return git_config_string(&cruft_po_args->depth, var, value);
|
|
|
|
if (!strcmp(var, "repack.cruftthreads"))
|
|
|
|
return git_config_string(&cruft_po_args->threads, var, value);
|
config: add ctx arg to config_fn_t
Add a new "const struct config_context *ctx" arg to config_fn_t to hold
additional information about the config iteration operation.
config_context has a "struct key_value_info kvi" member that holds
metadata about the config source being read (e.g. what kind of config
source it is, the filename, etc). In this series, we're only interested
in .kvi, so we could have just used "struct key_value_info" as an arg,
but config_context makes it possible to add/adjust members in the future
without changing the config_fn_t signature. We could also consider other
ways of organizing the args (e.g. moving the config name and value into
config_context or key_value_info), but in my experiments, the
incremental benefit doesn't justify the added complexity (e.g. a
config_fn_t will sometimes invoke another config_fn_t but with a
different config value).
In subsequent commits, the .kvi member will replace the global "struct
config_reader" in config.c, making config iteration a global-free
operation. It requires much more work for the machinery to provide
meaningful values of .kvi, so for now, merely change the signature and
call sites, pass NULL as a placeholder value, and don't rely on the arg
in any meaningful way.
Most of the changes are performed by
contrib/coccinelle/config_fn_ctx.pending.cocci, which, for every
config_fn_t:
- Modifies the signature to accept "const struct config_context *ctx"
- Passes "ctx" to any inner config_fn_t, if needed
- Adds UNUSED attributes to "ctx", if needed
Most config_fn_t instances are easily identified by seeing if they are
called by the various config functions. Most of the remaining ones are
manually named in the .cocci patch. Manual cleanups are still needed,
but the majority of it is trivial; it's either adjusting config_fn_t
that the .cocci patch didn't catch, or adding forward declarations of
"struct config_context ctx" to make the signatures make sense.
The non-trivial changes are in cases where we are invoking a config_fn_t
outside of config machinery, and we now need to decide what value of
"ctx" to pass. These cases are:
- trace2/tr2_cfg.c:tr2_cfg_set_fl()
This is indirectly called by git_config_set() so that the trace2
machinery can notice the new config values and update its settings
using the tr2 config parsing function, i.e. tr2_cfg_cb().
- builtin/checkout.c:checkout_main()
This calls git_xmerge_config() as a shorthand for parsing a CLI arg.
This might be worth refactoring away in the future, since
git_xmerge_config() can call git_default_config(), which can do much
more than just parsing.
Handle them by creating a KVI_INIT macro that initializes "struct
key_value_info" to a reasonable default, and use that to construct the
"ctx" arg.
Signed-off-by: Glen Choo <chooglen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-28 19:26:22 +00:00
|
|
|
return git_default_config(var, value, ctx, cb);
|
2013-09-15 15:33:20 +00:00
|
|
|
}
|
|
|
|
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
struct existing_packs {
|
|
|
|
struct string_list kept_packs;
|
|
|
|
struct string_list non_kept_packs;
|
2023-09-13 19:17:57 +00:00
|
|
|
struct string_list cruft_packs;
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
#define EXISTING_PACKS_INIT { \
|
|
|
|
.kept_packs = STRING_LIST_INIT_DUP, \
|
|
|
|
.non_kept_packs = STRING_LIST_INIT_DUP, \
|
2023-09-13 19:17:57 +00:00
|
|
|
.cruft_packs = STRING_LIST_INIT_DUP, \
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
}
|
|
|
|
|
builtin/repack.c: extract `has_existing_non_kept_packs()`
When there is:
- at least one pre-existing packfile (which is not marked as kept),
- repacking with the `-d` flag, and
- not doing a cruft repack
, then we pass a handful of additional options to the inner
`pack-objects` process, like `--unpack-unreachable`,
`--keep-unreachable`, and `--pack-loose-unreachable`, in addition to
marking any packs we just wrote for promisor remotes as kept in-core
(with `--keep-pack`, as opposed to the presence of a ".keep" file on
disk).
Because we store both cruft and non-cruft packs together in the same
`existing.non_kept_packs` list, it suffices to check its `nr` member to
see if it is zero or not.
But a following change will store cruft- and non-cruft packs separately,
meaning this check would break as a result. Prepare for this by
extracting this part of the check into a new helper function called
`has_existing_non_kept_packs()`.
This patch does not introduce any functional changes, but prepares us to
make a more isolated change in a subsequent patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:54 +00:00
|
|
|
static int has_existing_non_kept_packs(const struct existing_packs *existing)
|
|
|
|
{
|
2023-09-13 19:17:57 +00:00
|
|
|
return existing->non_kept_packs.nr || existing->cruft_packs.nr;
|
builtin/repack.c: extract `has_existing_non_kept_packs()`
When there is:
- at least one pre-existing packfile (which is not marked as kept),
- repacking with the `-d` flag, and
- not doing a cruft repack
, then we pass a handful of additional options to the inner
`pack-objects` process, like `--unpack-unreachable`,
`--keep-unreachable`, and `--pack-loose-unreachable`, in addition to
marking any packs we just wrote for promisor remotes as kept in-core
(with `--keep-pack`, as opposed to the presence of a ".keep" file on
disk).
Because we store both cruft and non-cruft packs together in the same
`existing.non_kept_packs` list, it suffices to check its `nr` member to
see if it is zero or not.
But a following change will store cruft- and non-cruft packs separately,
meaning this check would break as a result. Prepare for this by
extracting this part of the check into a new helper function called
`has_existing_non_kept_packs()`.
This patch does not introduce any functional changes, but prepares us to
make a more isolated change in a subsequent patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:54 +00:00
|
|
|
}
|
|
|
|
|
2023-09-13 19:18:00 +00:00
|
|
|
static void pack_mark_for_deletion(struct string_list_item *item)
|
|
|
|
{
|
|
|
|
item->util = (void*)((uintptr_t)item->util | DELETE_PACK);
|
|
|
|
}
|
|
|
|
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
static void pack_unmark_for_deletion(struct string_list_item *item)
|
|
|
|
{
|
|
|
|
item->util = (void*)((uintptr_t)item->util & ~DELETE_PACK);
|
|
|
|
}
|
|
|
|
|
2023-09-13 19:18:00 +00:00
|
|
|
static int pack_is_marked_for_deletion(struct string_list_item *item)
|
|
|
|
{
|
|
|
|
return (uintptr_t)item->util & DELETE_PACK;
|
|
|
|
}
|
|
|
|
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
static void pack_mark_retained(struct string_list_item *item)
|
|
|
|
{
|
|
|
|
item->util = (void*)((uintptr_t)item->util | RETAIN_PACK);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int pack_is_retained(struct string_list_item *item)
|
|
|
|
{
|
|
|
|
return (uintptr_t)item->util & RETAIN_PACK;
|
|
|
|
}
|
|
|
|
|
2023-09-13 19:17:46 +00:00
|
|
|
static void mark_packs_for_deletion_1(struct string_list *names,
|
|
|
|
struct string_list *list)
|
|
|
|
{
|
|
|
|
struct string_list_item *item;
|
|
|
|
const int hexsz = the_hash_algo->hexsz;
|
|
|
|
|
|
|
|
for_each_string_list_item(item, list) {
|
|
|
|
char *sha1;
|
|
|
|
size_t len = strlen(item->string);
|
|
|
|
if (len < hexsz)
|
|
|
|
continue;
|
|
|
|
sha1 = item->string + len - hexsz;
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
|
|
|
|
if (pack_is_retained(item)) {
|
|
|
|
pack_unmark_for_deletion(item);
|
|
|
|
} else if (!string_list_has_string(names, sha1)) {
|
|
|
|
/*
|
|
|
|
* Mark this pack for deletion, which ensures
|
|
|
|
* that this pack won't be included in a MIDX
|
|
|
|
* (if `--write-midx` was given) and that we
|
|
|
|
* will actually delete this pack (if `-d` was
|
|
|
|
* given).
|
|
|
|
*/
|
2023-09-13 19:18:00 +00:00
|
|
|
pack_mark_for_deletion(item);
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
}
|
2023-09-13 19:17:46 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
static void retain_cruft_pack(struct existing_packs *existing,
|
|
|
|
struct packed_git *cruft)
|
|
|
|
{
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
|
|
|
struct string_list_item *item;
|
|
|
|
|
|
|
|
strbuf_addstr(&buf, pack_basename(cruft));
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
|
|
|
|
|
|
|
item = string_list_lookup(&existing->cruft_packs, buf.buf);
|
|
|
|
if (!item)
|
|
|
|
BUG("could not find cruft pack '%s'", pack_basename(cruft));
|
|
|
|
|
|
|
|
pack_mark_retained(item);
|
|
|
|
strbuf_release(&buf);
|
|
|
|
}
|
|
|
|
|
2023-09-13 19:17:46 +00:00
|
|
|
static void mark_packs_for_deletion(struct existing_packs *existing,
|
|
|
|
struct string_list *names)
|
|
|
|
|
|
|
|
{
|
|
|
|
mark_packs_for_deletion_1(names, &existing->non_kept_packs);
|
2023-09-13 19:17:57 +00:00
|
|
|
mark_packs_for_deletion_1(names, &existing->cruft_packs);
|
2023-09-13 19:17:46 +00:00
|
|
|
}
|
|
|
|
|
builtin/repack.c: extract redundant pack cleanup for existing packs
To remove redundant packs at the end of a repacking operation, Git uses
its `remove_redundant_pack()` function in a loop over the set of
pre-existing, non-kept packs.
In a later commit, we will split this list into two, one for
pre-existing cruft pack(s), and another for non-cruft pack(s). Prepare
for this by factoring out the routine to loop over and delete redundant
packs into its own function.
Instead of calling `remove_redundant_pack()` directly, we now will call
`remove_redundant_existing_packs()`, which itself dispatches a call to
`remove_redundant_packs_1()`. Note that the geometric repacking code
will still call `remove_redundant_pack()` directly, but see the previous
commit for more details.
Having `remove_redundant_packs_1()` exist as a separate function may
seem like overkill in this patch. However, a later patch will call
`remove_redundant_packs_1()` once over two separate lists, so this
refactoring sets us up for that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:51 +00:00
|
|
|
static void remove_redundant_pack(const char *dir_name, const char *base_name)
|
|
|
|
{
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
|
|
|
struct multi_pack_index *m = get_local_multi_pack_index(the_repository);
|
|
|
|
strbuf_addf(&buf, "%s.pack", base_name);
|
|
|
|
if (m && midx_contains_pack(m, buf.buf))
|
|
|
|
clear_midx_file(the_repository);
|
|
|
|
strbuf_insertf(&buf, 0, "%s/", dir_name);
|
|
|
|
unlink_pack_path(buf.buf, 1);
|
|
|
|
strbuf_release(&buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void remove_redundant_packs_1(struct string_list *packs)
|
|
|
|
{
|
|
|
|
struct string_list_item *item;
|
|
|
|
for_each_string_list_item(item, packs) {
|
2023-09-13 19:18:00 +00:00
|
|
|
if (!pack_is_marked_for_deletion(item))
|
builtin/repack.c: extract redundant pack cleanup for existing packs
To remove redundant packs at the end of a repacking operation, Git uses
its `remove_redundant_pack()` function in a loop over the set of
pre-existing, non-kept packs.
In a later commit, we will split this list into two, one for
pre-existing cruft pack(s), and another for non-cruft pack(s). Prepare
for this by factoring out the routine to loop over and delete redundant
packs into its own function.
Instead of calling `remove_redundant_pack()` directly, we now will call
`remove_redundant_existing_packs()`, which itself dispatches a call to
`remove_redundant_packs_1()`. Note that the geometric repacking code
will still call `remove_redundant_pack()` directly, but see the previous
commit for more details.
Having `remove_redundant_packs_1()` exist as a separate function may
seem like overkill in this patch. However, a later patch will call
`remove_redundant_packs_1()` once over two separate lists, so this
refactoring sets us up for that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:51 +00:00
|
|
|
continue;
|
|
|
|
remove_redundant_pack(packdir, item->string);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void remove_redundant_existing_packs(struct existing_packs *existing)
|
|
|
|
{
|
|
|
|
remove_redundant_packs_1(&existing->non_kept_packs);
|
2023-09-13 19:17:57 +00:00
|
|
|
remove_redundant_packs_1(&existing->cruft_packs);
|
builtin/repack.c: extract redundant pack cleanup for existing packs
To remove redundant packs at the end of a repacking operation, Git uses
its `remove_redundant_pack()` function in a loop over the set of
pre-existing, non-kept packs.
In a later commit, we will split this list into two, one for
pre-existing cruft pack(s), and another for non-cruft pack(s). Prepare
for this by factoring out the routine to loop over and delete redundant
packs into its own function.
Instead of calling `remove_redundant_pack()` directly, we now will call
`remove_redundant_existing_packs()`, which itself dispatches a call to
`remove_redundant_packs_1()`. Note that the geometric repacking code
will still call `remove_redundant_pack()` directly, but see the previous
commit for more details.
Having `remove_redundant_packs_1()` exist as a separate function may
seem like overkill in this patch. However, a later patch will call
`remove_redundant_packs_1()` once over two separate lists, so this
refactoring sets us up for that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:51 +00:00
|
|
|
}
|
|
|
|
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
static void existing_packs_release(struct existing_packs *existing)
|
|
|
|
{
|
|
|
|
string_list_clear(&existing->kept_packs, 0);
|
|
|
|
string_list_clear(&existing->non_kept_packs, 0);
|
2023-09-13 19:17:57 +00:00
|
|
|
string_list_clear(&existing->cruft_packs, 0);
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
}
|
|
|
|
|
2013-09-15 15:33:20 +00:00
|
|
|
/*
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
* Adds all packs hex strings (pack-$HASH) to either packs->non_kept
|
|
|
|
* or packs->kept based on whether each pack has a corresponding
|
2021-09-29 01:55:12 +00:00
|
|
|
* .keep file or not. Packs without a .keep file are not to be kept
|
|
|
|
* if we are going to pack everything into one file.
|
2013-09-15 15:33:20 +00:00
|
|
|
*/
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
static void collect_pack_filenames(struct existing_packs *existing,
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:10 +00:00
|
|
|
const struct string_list *extra_keep)
|
2013-09-15 15:33:20 +00:00
|
|
|
{
|
builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.
This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of
find "$PACKDIR" -type f -name '*.pack'
either ignoring the pack as kept, or adding it to the list of existing
packs.
So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.
However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.
This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.
Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.
One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-11 17:32:37 +00:00
|
|
|
struct packed_git *p;
|
builtin/repack.c: only collect fully-formed packs
To partition the set of packs based on which ones are "kept" (either
they have a .keep file, or were otherwise marked via the `--keep-pack`
option) and "non-kept" ones (anything else), `git repack` uses its
`collect_pack_filenames()` function.
Ordinarily, we would rely on a convenience function such as
`get_all_packs()` to enumerate and partition the set of packs. But
`collect_pack_filenames()` uses `readdir()` directly to read the
contents of the "$GIT_DIR/objects/pack" directory, and adds each entry
ending in ".pack" to the appropriate list (either kept, or non-kept as
above).
This is subtly racy, since `collect_pack_filenames()` may see a pack
that is not fully staged (i.e., it is missing its ".idx" file).
Ordinarily, this doesn't cause a problem. But it can cause issues when
generating a cruft pack.
This is because `git repack` feeds (among other things) the list of
existing kept packs down to `git pack-objects --cruft` to indicate that
any kept packs will not be removed from the repository (so that the
cruft pack machinery can avoid packing objects that appear in those
packs as cruft).
But `read_cruft_objects()` lists packfiles by calling `get_all_packs()`.
So if a ".pack" file exists (necessary to get that pack to appear to
`collect_pack_filenames()`), but doesn't have a corresponding ".idx"
file (necessary to get that pack to appear via `get_all_packs()`), we'll
complain with:
fatal: could not find pack '.tmp-5841-pack-a6b0150558609c323c496ced21de6f4b66589260.pack'
Fix the above by teaching `collect_pack_filenames()` to only collect
packs with their corresponding `*.idx` files in place, indicating that
those packs have been fully staged.
There are a couple of things worth noting:
- Since each entry in the `extra_keep` list (which contains the
`--keep-pack` names) has a `*.pack` suffix, we'll have to swap the
suffix from ".pack" to ".idx", and compare that instead.
- Since we use the the `fname_kept_list` to figure out which packs to
delete (with `git repack -d`), we would have previously deleted a
`*.pack` with no index (since the existince of a ".pack" file is
necessary and sufficient to include that pack in the list of
existing non-kept packs).
Now we will leave it alone (since that pack won't appear in the
list). This is far more correct behavior, since we don't want
to race with a pack being staged. Deleting a partially staged pack
is unlikely, however, since the window of time between staging a
pack and moving its .idx file into place is miniscule.
Note that this window does *not* include the time it takes to
receive and index the pack, since the incoming data goes into
"$GIT_DIR/objects/tmp_pack_XXXXXX", which does not end in ".pack"
and is thus ignored by collect_pack_filenames().
In the future, this function should probably be rewritten as a callback
to `for_each_file_in_pack_dir()`, but this is the simplest change we
could do in the short-term.
Reported-by: Michael Haggerty <mhagger@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-07 10:16:17 +00:00
|
|
|
struct strbuf buf = STRBUF_INIT;
|
2013-09-15 15:33:20 +00:00
|
|
|
|
builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.
This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of
find "$PACKDIR" -type f -name '*.pack'
either ignoring the pack as kept, or adding it to the list of existing
packs.
So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.
However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.
This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.
Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.
One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-11 17:32:37 +00:00
|
|
|
for (p = get_all_packs(the_repository); p; p = p->next) {
|
2018-04-15 15:36:13 +00:00
|
|
|
int i;
|
builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.
This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of
find "$PACKDIR" -type f -name '*.pack'
either ignoring the pack as kept, or adding it to the list of existing
packs.
So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.
However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.
This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.
Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.
One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-11 17:32:37 +00:00
|
|
|
const char *base;
|
2018-04-15 15:36:13 +00:00
|
|
|
|
builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.
This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of
find "$PACKDIR" -type f -name '*.pack'
either ignoring the pack as kept, or adding it to the list of existing
packs.
So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.
However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.
This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.
Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.
One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-11 17:32:37 +00:00
|
|
|
if (!p->pack_local)
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:10 +00:00
|
|
|
continue;
|
|
|
|
|
builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.
This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of
find "$PACKDIR" -type f -name '*.pack'
either ignoring the pack as kept, or adding it to the list of existing
packs.
So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.
However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.
This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.
Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.
One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-11 17:32:37 +00:00
|
|
|
base = pack_basename(p);
|
builtin/repack.c: only collect fully-formed packs
To partition the set of packs based on which ones are "kept" (either
they have a .keep file, or were otherwise marked via the `--keep-pack`
option) and "non-kept" ones (anything else), `git repack` uses its
`collect_pack_filenames()` function.
Ordinarily, we would rely on a convenience function such as
`get_all_packs()` to enumerate and partition the set of packs. But
`collect_pack_filenames()` uses `readdir()` directly to read the
contents of the "$GIT_DIR/objects/pack" directory, and adds each entry
ending in ".pack" to the appropriate list (either kept, or non-kept as
above).
This is subtly racy, since `collect_pack_filenames()` may see a pack
that is not fully staged (i.e., it is missing its ".idx" file).
Ordinarily, this doesn't cause a problem. But it can cause issues when
generating a cruft pack.
This is because `git repack` feeds (among other things) the list of
existing kept packs down to `git pack-objects --cruft` to indicate that
any kept packs will not be removed from the repository (so that the
cruft pack machinery can avoid packing objects that appear in those
packs as cruft).
But `read_cruft_objects()` lists packfiles by calling `get_all_packs()`.
So if a ".pack" file exists (necessary to get that pack to appear to
`collect_pack_filenames()`), but doesn't have a corresponding ".idx"
file (necessary to get that pack to appear via `get_all_packs()`), we'll
complain with:
fatal: could not find pack '.tmp-5841-pack-a6b0150558609c323c496ced21de6f4b66589260.pack'
Fix the above by teaching `collect_pack_filenames()` to only collect
packs with their corresponding `*.idx` files in place, indicating that
those packs have been fully staged.
There are a couple of things worth noting:
- Since each entry in the `extra_keep` list (which contains the
`--keep-pack` names) has a `*.pack` suffix, we'll have to swap the
suffix from ".pack" to ".idx", and compare that instead.
- Since we use the the `fname_kept_list` to figure out which packs to
delete (with `git repack -d`), we would have previously deleted a
`*.pack` with no index (since the existince of a ".pack" file is
necessary and sufficient to include that pack in the list of
existing non-kept packs).
Now we will leave it alone (since that pack won't appear in the
list). This is far more correct behavior, since we don't want
to race with a pack being staged. Deleting a partially staged pack
is unlikely, however, since the window of time between staging a
pack and moving its .idx file into place is miniscule.
Note that this window does *not* include the time it takes to
receive and index the pack, since the incoming data goes into
"$GIT_DIR/objects/tmp_pack_XXXXXX", which does not end in ".pack"
and is thus ignored by collect_pack_filenames().
In the future, this function should probably be rewritten as a callback
to `for_each_file_in_pack_dir()`, but this is the simplest change we
could do in the short-term.
Reported-by: Michael Haggerty <mhagger@github.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-06-07 10:16:17 +00:00
|
|
|
|
2018-04-15 15:36:13 +00:00
|
|
|
for (i = 0; i < extra_keep->nr; i++)
|
builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.
This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of
find "$PACKDIR" -type f -name '*.pack'
either ignoring the pack as kept, or adding it to the list of existing
packs.
So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.
However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.
This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.
Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.
One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-11 17:32:37 +00:00
|
|
|
if (!fspathcmp(base, extra_keep->items[i].string))
|
2018-04-15 15:36:13 +00:00
|
|
|
break;
|
2013-09-15 15:33:20 +00:00
|
|
|
|
builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.
This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of
find "$PACKDIR" -type f -name '*.pack'
either ignoring the pack as kept, or adding it to the list of existing
packs.
So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.
However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.
This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.
Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.
One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-11 17:32:37 +00:00
|
|
|
strbuf_reset(&buf);
|
|
|
|
strbuf_addstr(&buf, base);
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
2013-09-15 15:33:20 +00:00
|
|
|
|
builtin/repack.c: avoid dir traversal in `collect_pack_filenames()`
When repacking, the function `collect_pack_filenames()` is responsible
for collecting the set of existing packs in the repository, and
partitioning them into "kept" (if the pack has a ".keep" file or was
given via `--keep-pack`) and "nonkept" (otherwise) lists.
This function comes from the original C port of git-repack.sh from back
in a1bbc6c0176 (repack: rewrite the shell script in C, 2013-09-15),
where it first appears as `get_non_kept_pack_filenames()`. At the time,
the implementation was a fairly direct translation from the relevant
portion of git-repack.sh, which looped over the results of
find "$PACKDIR" -type f -name '*.pack'
either ignoring the pack as kept, or adding it to the list of existing
packs.
So the choice to directly translate this function in terms of
`readdir()` in a1bbc6c0176 made sense. At the time, it was possible to
refine the C version in terms of packed_git structs, but was never done.
However, manually enumerating a repository's packs via `readdir()` is
confusing and error-prone. It leads to frustrating inconsistencies
between which packs Git considers to be part of a repository (i.e.,
could be found in the list of packs from `get_all_packs()`), and which
packs `collect_pack_filenames()` considers to meet the same criteria.
This bit us in 73320e49ad (builtin/repack.c: only collect fully-formed
packs, 2023-06-07), and again in the previous commit.
Prevent these issues from biting us in the future by implementing the
`collect_pack_filenames()` function by looping over an array of pointers
to `packed_git` structs, ensuring that we use the same criteria to
determine the set of available packs.
One gotcha here is that we have to ignore non-local packs, since the
original version of `collect_pack_filenames()` only looks at the local
pack directory to collect existing packs.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-07-11 17:32:37 +00:00
|
|
|
if ((extra_keep->nr > 0 && i < extra_keep->nr) || p->pack_keep)
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
string_list_append(&existing->kept_packs, buf.buf);
|
2023-09-13 19:17:57 +00:00
|
|
|
else if (p->is_cruft)
|
|
|
|
string_list_append(&existing->cruft_packs, buf.buf);
|
|
|
|
else
|
|
|
|
string_list_append(&existing->non_kept_packs, buf.buf);
|
2013-09-15 15:33:20 +00:00
|
|
|
}
|
2022-05-20 19:01:45 +00:00
|
|
|
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
string_list_sort(&existing->kept_packs);
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
string_list_sort(&existing->non_kept_packs);
|
|
|
|
string_list_sort(&existing->cruft_packs);
|
2013-09-15 15:33:20 +00:00
|
|
|
strbuf_release(&buf);
|
|
|
|
}
|
|
|
|
|
2018-08-08 22:34:05 +00:00
|
|
|
static void prepare_pack_objects(struct child_process *cmd,
|
2022-10-24 18:43:03 +00:00
|
|
|
const struct pack_objects_args *args,
|
|
|
|
const char *out)
|
2018-08-08 22:34:05 +00:00
|
|
|
{
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd->args, "pack-objects");
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->window)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_pushf(&cmd->args, "--window=%s", args->window);
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->window_memory)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_pushf(&cmd->args, "--window-memory=%s", args->window_memory);
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->depth)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_pushf(&cmd->args, "--depth=%s", args->depth);
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->threads)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_pushf(&cmd->args, "--threads=%s", args->threads);
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->max_pack_size)
|
2023-10-03 00:44:29 +00:00
|
|
|
strvec_pushf(&cmd->args, "--max-pack-size=%lu", args->max_pack_size);
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->no_reuse_delta)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_pushf(&cmd->args, "--no-reuse-delta");
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->no_reuse_object)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_pushf(&cmd->args, "--no-reuse-object");
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->local)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd->args, "--local");
|
2018-08-08 22:34:05 +00:00
|
|
|
if (args->quiet)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd->args, "--quiet");
|
2018-08-08 22:34:05 +00:00
|
|
|
if (delta_base_offset)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd->args, "--delta-base-offset");
|
2022-10-24 18:43:03 +00:00
|
|
|
strvec_push(&cmd->args, out);
|
2018-08-08 22:34:05 +00:00
|
|
|
cmd->git_cmd = 1;
|
|
|
|
cmd->out = -1;
|
|
|
|
}
|
|
|
|
|
2018-08-08 22:34:06 +00:00
|
|
|
/*
|
|
|
|
* Write oid to the given struct child_process's stdin, starting it first if
|
|
|
|
* necessary.
|
|
|
|
*/
|
2023-02-24 06:39:24 +00:00
|
|
|
static int write_oid(const struct object_id *oid,
|
|
|
|
struct packed_git *pack UNUSED,
|
|
|
|
uint32_t pos UNUSED, void *data)
|
2018-08-08 22:34:06 +00:00
|
|
|
{
|
|
|
|
struct child_process *cmd = data;
|
|
|
|
|
|
|
|
if (cmd->in == -1) {
|
|
|
|
if (start_command(cmd))
|
2018-11-10 05:16:10 +00:00
|
|
|
die(_("could not start pack-objects to repack promisor objects"));
|
2018-08-08 22:34:06 +00:00
|
|
|
}
|
|
|
|
|
2024-03-02 19:03:48 +00:00
|
|
|
if (write_in_full(cmd->in, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
|
|
|
|
write_in_full(cmd->in, "\n", 1) < 0)
|
|
|
|
die(_("failed to feed promisor objects to pack-objects"));
|
2018-08-08 22:34:06 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-11-16 18:41:12 +00:00
|
|
|
static struct {
|
|
|
|
const char *name;
|
|
|
|
unsigned optional:1;
|
|
|
|
} exts[] = {
|
|
|
|
{".pack"},
|
packfile: prepare for the existence of '*.rev' files
Specify the format of the on-disk reverse index 'pack-*.rev' file, as
well as prepare the code for the existence of such files.
The reverse index maps from pack relative positions (i.e., an index into
the array of object which is sorted by their offsets within the
packfile) to their position within the 'pack-*.idx' file. Today, this is
done by building up a list of (off_t, uint32_t) tuples for each object
(the off_t corresponding to that object's offset, and the uint32_t
corresponding to its position in the index). To convert between pack and
index position quickly, this array of tuples is radix sorted based on
its offset.
This has two major drawbacks:
First, the in-memory cost scales linearly with the number of objects in
a pack. Each 'struct revindex_entry' is sizeof(off_t) +
sizeof(uint32_t) + padding bytes for a total of 16.
To observe this, force Git to load the reverse index by, for e.g.,
running 'git cat-file --batch-check="%(objectsize:disk)"'. When asking
for a single object in a fresh clone of the kernel, Git needs to
allocate 120+ MB of memory in order to hold the reverse index in memory.
Second, the cost to sort also scales with the size of the pack.
Luckily, this is a linear function since 'load_pack_revindex()' uses a
radix sort, but this cost still must be paid once per pack per process.
As an example, it takes ~60x longer to print the _size_ of an object as
it does to print that entire object's _contents_:
Benchmark #1: git.compile cat-file --batch <obj
Time (mean ± σ): 3.4 ms ± 0.1 ms [User: 3.3 ms, System: 2.1 ms]
Range (min … max): 3.2 ms … 3.7 ms 726 runs
Benchmark #2: git.compile cat-file --batch-check="%(objectsize:disk)" <obj
Time (mean ± σ): 210.3 ms ± 8.9 ms [User: 188.2 ms, System: 23.2 ms]
Range (min … max): 193.7 ms … 224.4 ms 13 runs
Instead, avoid computing and sorting the revindex once per process by
writing it to a file when the pack itself is generated.
The format is relatively straightforward. It contains an array of
uint32_t's, the length of which is equal to the number of objects in the
pack. The ith entry in this table contains the index position of the
ith object in the pack, where "ith object in the pack" is determined by
pack offset.
One thing that the on-disk format does _not_ contain is the full (up to)
eight-byte offset corresponding to each object. This is something that
the in-memory revindex contains (it stores an off_t in 'struct
revindex_entry' along with the same uint32_t that the on-disk format
has). Omit it in the on-disk format, since knowing the index position
for some object is sufficient to get a constant-time lookup in the
pack-*.idx file to ask for an object's offset within the pack.
This trades off between the on-disk size of the 'pack-*.rev' file for
runtime to chase down the offset for some object. Even though the lookup
is constant time, the constant is heavier, since it can potentially
involve two pointer walks in v2 indexes (one to access the 4-byte offset
table, and potentially a second to access the double wide offset table).
Consider trying to map an object's pack offset to a relative position
within that pack. In a cold-cache scenario, more page faults occur while
switching between binary searching through the reverse index and
searching through the *.idx file for an object's offset. Sure enough,
with a cold cache (writing '3' into '/proc/sys/vm/drop_caches' after
'sync'ing), printing out the entire object's contents is still
marginally faster than printing its size:
Benchmark #1: git.compile cat-file --batch-check="%(objectsize:disk)" <obj >/dev/null
Time (mean ± σ): 22.6 ms ± 0.5 ms [User: 2.4 ms, System: 7.9 ms]
Range (min … max): 21.4 ms … 23.5 ms 41 runs
Benchmark #2: git.compile cat-file --batch <obj >/dev/null
Time (mean ± σ): 17.2 ms ± 0.7 ms [User: 2.8 ms, System: 5.5 ms]
Range (min … max): 15.6 ms … 18.2 ms 45 runs
(Numbers taken in the kernel after cheating and using the next patch to
generate a reverse index). There are a couple of approaches to improve
cold cache performance not pursued here:
- We could include the object offsets in the reverse index format.
Predictably, this does result in fewer page faults, but it triples
the size of the file, while simultaneously duplicating a ton of data
already available in the .idx file. (This was the original way I
implemented the format, and it did show
`--batch-check='%(objectsize:disk)'` winning out against `--batch`.)
On the other hand, this increase in size also results in a large
block-cache footprint, which could potentially hurt other workloads.
- We could store the mapping from pack to index position in more
cache-friendly way, like constructing a binary search tree from the
table and writing the values in breadth-first order. This would
result in much better locality, but the price you pay is trading
O(1) lookup in 'pack_pos_to_index()' for an O(log n) one (since you
can no longer directly index the table).
So, neither of these approaches are taken here. (Thankfully, the format
is versioned, so we are free to pursue these in the future.) But, cold
cache performance likely isn't interesting outside of one-off cases like
asking for the size of an object directly. In real-world usage, Git is
often performing many operations in the revindex (i.e., asking about
many objects rather than a single one).
The trade-off is worth it, since we will avoid the vast majority of the
cost of generating the revindex that the extra pointer chase will look
like noise in the following patch's benchmarks.
This patch describes the format and prepares callers (like in
pack-revindex.c) to be able to read *.rev files once they exist. An
implementation of the writer will appear in the next patch, and callers
will gradually begin to start using the writer in the patches that
follow after that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-01-25 23:37:14 +00:00
|
|
|
{".rev", 1},
|
2022-05-20 23:17:35 +00:00
|
|
|
{".mtimes", 1},
|
2020-11-16 18:41:12 +00:00
|
|
|
{".bitmap", 1},
|
|
|
|
{".promisor", 1},
|
2021-09-09 23:24:44 +00:00
|
|
|
{".idx"},
|
2020-11-16 18:41:12 +00:00
|
|
|
};
|
|
|
|
|
2022-10-22 00:21:45 +00:00
|
|
|
struct generated_pack_data {
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 00:21:54 +00:00
|
|
|
struct tempfile *tempfiles[ARRAY_SIZE(exts)];
|
2022-10-22 00:21:45 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct generated_pack_data *populate_pack_exts(const char *name)
|
2020-11-16 18:41:17 +00:00
|
|
|
{
|
|
|
|
struct stat statbuf;
|
|
|
|
struct strbuf path = STRBUF_INIT;
|
2022-10-22 00:21:45 +00:00
|
|
|
struct generated_pack_data *data = xcalloc(1, sizeof(*data));
|
2020-11-16 18:41:17 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < ARRAY_SIZE(exts); i++) {
|
|
|
|
strbuf_reset(&path);
|
|
|
|
strbuf_addf(&path, "%s-%s%s", packtmp, name, exts[i].name);
|
|
|
|
|
|
|
|
if (stat(path.buf, &statbuf))
|
|
|
|
continue;
|
|
|
|
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 00:21:54 +00:00
|
|
|
data->tempfiles[i] = register_tempfile(path.buf);
|
2020-11-16 18:41:17 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
strbuf_release(&path);
|
2022-10-22 00:21:45 +00:00
|
|
|
return data;
|
2020-11-16 18:41:17 +00:00
|
|
|
}
|
|
|
|
|
builtin/repack.c: avoid making cruft packs preferred
When doing a `--geometric` repack, we make sure that the preferred pack
(if writing a MIDX) is the largest pack that we *didn't* repack. That
has the effect of keeping the preferred pack in sync with the pack
containing a majority of the repository's reachable objects.
But if the repository happens to double in size, we'll repack
everything. Here we don't specify any `--preferred-pack`, and instead
let the MIDX code choose.
In the past, that worked fine, since there would only be one pack to
choose from: the one we just wrote. But it's no longer necessarily the
case that there is one pack to choose from. It's possible that the
repository also has a cruft pack, too.
If the cruft pack happens to come earlier in lexical order (and has an
earlier mtime than any non-cruft pack), we'll pick that pack as
preferred. This makes it impossible to reuse chunks of the reachable
pack verbatim from pack-objects, so is sub-optimal.
Luckily, this is a somewhat rare circumstance to be in, since we would
have to repack the entire repository during a `--geometric` repack, and
the cruft pack would have to sort ahead of the pack we just created.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 21:54:19 +00:00
|
|
|
static int has_pack_ext(const struct generated_pack_data *data,
|
|
|
|
const char *ext)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
for (i = 0; i < ARRAY_SIZE(exts); i++) {
|
|
|
|
if (strcmp(exts[i].name, ext))
|
|
|
|
continue;
|
|
|
|
return !!data->tempfiles[i];
|
|
|
|
}
|
|
|
|
BUG("unknown pack extension: '%s'", ext);
|
|
|
|
}
|
|
|
|
|
2018-08-08 22:34:06 +00:00
|
|
|
static void repack_promisor_objects(const struct pack_objects_args *args,
|
|
|
|
struct string_list *names)
|
|
|
|
{
|
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
|
|
|
FILE *out;
|
|
|
|
struct strbuf line = STRBUF_INIT;
|
|
|
|
|
2022-10-24 18:43:03 +00:00
|
|
|
prepare_pack_objects(&cmd, args, packtmp);
|
2018-08-08 22:34:06 +00:00
|
|
|
cmd.in = -1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* NEEDSWORK: Giving pack-objects only the OIDs without any ordering
|
|
|
|
* hints may result in suboptimal deltas in the resulting pack. See if
|
|
|
|
* the OIDs can be sent with fake paths such that pack-objects can use a
|
|
|
|
* {type -> existing pack order} ordering when computing deltas instead
|
|
|
|
* of a {type -> size} ordering, which may produce better deltas.
|
|
|
|
*/
|
|
|
|
for_each_packed_object(write_oid, &cmd,
|
|
|
|
FOR_EACH_OBJECT_PROMISOR_ONLY);
|
|
|
|
|
2021-10-28 20:25:48 +00:00
|
|
|
if (cmd.in == -1) {
|
2018-08-08 22:34:06 +00:00
|
|
|
/* No packed objects; cmd was never started */
|
2021-10-28 20:25:48 +00:00
|
|
|
child_process_clear(&cmd);
|
2018-08-08 22:34:06 +00:00
|
|
|
return;
|
2021-10-28 20:25:48 +00:00
|
|
|
}
|
2018-08-08 22:34:06 +00:00
|
|
|
|
|
|
|
close(cmd.in);
|
|
|
|
|
|
|
|
out = xfdopen(cmd.out, "r");
|
|
|
|
while (strbuf_getline_lf(&line, out) != EOF) {
|
2020-11-16 18:41:17 +00:00
|
|
|
struct string_list_item *item;
|
2018-08-08 22:34:06 +00:00
|
|
|
char *promisor_name;
|
2021-01-12 08:21:59 +00:00
|
|
|
|
2018-10-15 00:01:50 +00:00
|
|
|
if (line.len != the_hash_algo->hexsz)
|
2019-01-04 21:33:31 +00:00
|
|
|
die(_("repack: Expecting full hex object ID lines only from pack-objects."));
|
2020-11-16 18:41:17 +00:00
|
|
|
item = string_list_append(names, line.buf);
|
2018-08-08 22:34:06 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* pack-objects creates the .pack and .idx files, but not the
|
|
|
|
* .promisor file. Create the .promisor file, which is empty.
|
2019-10-15 00:12:31 +00:00
|
|
|
*
|
|
|
|
* NEEDSWORK: fetch-pack sometimes generates non-empty
|
|
|
|
* .promisor files containing the ref names and associated
|
|
|
|
* hashes at the point of generation of the corresponding
|
|
|
|
* packfile, but this would not preserve their contents. Maybe
|
|
|
|
* concatenate the contents of all .promisor files instead of
|
|
|
|
* just creating a new empty file.
|
2018-08-08 22:34:06 +00:00
|
|
|
*/
|
|
|
|
promisor_name = mkpathdup("%s-%s.promisor", packtmp,
|
|
|
|
line.buf);
|
2021-01-12 08:21:59 +00:00
|
|
|
write_promisor_file(promisor_name, NULL, 0);
|
2020-11-16 18:41:17 +00:00
|
|
|
|
2022-10-22 00:21:45 +00:00
|
|
|
item->util = populate_pack_exts(item->string);
|
2020-11-16 18:41:17 +00:00
|
|
|
|
2018-08-08 22:34:06 +00:00
|
|
|
free(promisor_name);
|
|
|
|
}
|
|
|
|
fclose(out);
|
|
|
|
if (finish_command(&cmd))
|
2018-11-10 05:16:10 +00:00
|
|
|
die(_("could not finish pack-objects to repack promisor objects"));
|
2018-08-08 22:34:06 +00:00
|
|
|
}
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
struct pack_geometry {
|
|
|
|
struct packed_git **pack;
|
|
|
|
uint32_t pack_nr, pack_alloc;
|
|
|
|
uint32_t split;
|
2023-08-09 20:32:42 +00:00
|
|
|
|
|
|
|
int split_factor;
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static uint32_t geometry_pack_weight(struct packed_git *p)
|
|
|
|
{
|
|
|
|
if (open_pack_index(p))
|
|
|
|
die(_("cannot open index for %s"), p->pack_name);
|
|
|
|
return p->num_objects;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int geometry_cmp(const void *va, const void *vb)
|
|
|
|
{
|
|
|
|
uint32_t aw = geometry_pack_weight(*(struct packed_git **)va),
|
|
|
|
bw = geometry_pack_weight(*(struct packed_git **)vb);
|
|
|
|
|
|
|
|
if (aw < bw)
|
|
|
|
return -1;
|
|
|
|
if (aw > bw)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
static void init_pack_geometry(struct pack_geometry *geometry,
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
struct existing_packs *existing,
|
2023-04-14 06:02:08 +00:00
|
|
|
const struct pack_objects_args *args)
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
{
|
|
|
|
struct packed_git *p;
|
2022-05-20 19:01:45 +00:00
|
|
|
struct strbuf buf = STRBUF_INIT;
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
|
|
|
|
for (p = get_all_packs(the_repository); p; p = p->next) {
|
2023-04-14 06:02:08 +00:00
|
|
|
if (args->local && !p->pack_local)
|
|
|
|
/*
|
|
|
|
* When asked to only repack local packfiles we skip
|
|
|
|
* over any packfiles that are borrowed from alternate
|
|
|
|
* object directories.
|
|
|
|
*/
|
|
|
|
continue;
|
|
|
|
|
2022-05-20 19:01:45 +00:00
|
|
|
if (!pack_kept_objects) {
|
|
|
|
/*
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
* Any pack that has its pack_keep bit set will
|
|
|
|
* appear in existing->kept_packs below, but
|
|
|
|
* this saves us from doing a more expensive
|
|
|
|
* check.
|
2022-05-20 19:01:45 +00:00
|
|
|
*/
|
|
|
|
if (p->pack_keep)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
* The pack may be kept via the --keep-pack
|
|
|
|
* option; check 'existing->kept_packs' to
|
|
|
|
* determine whether to ignore it.
|
2022-05-20 19:01:45 +00:00
|
|
|
*/
|
|
|
|
strbuf_reset(&buf);
|
|
|
|
strbuf_addstr(&buf, pack_basename(p));
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
|
|
|
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
if (string_list_has_string(&existing->kept_packs, buf.buf))
|
2022-05-20 19:01:45 +00:00
|
|
|
continue;
|
|
|
|
}
|
2022-05-20 23:18:03 +00:00
|
|
|
if (p->is_cruft)
|
|
|
|
continue;
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
|
|
|
|
ALLOC_GROW(geometry->pack,
|
|
|
|
geometry->pack_nr + 1,
|
|
|
|
geometry->pack_alloc);
|
|
|
|
|
|
|
|
geometry->pack[geometry->pack_nr] = p;
|
|
|
|
geometry->pack_nr++;
|
|
|
|
}
|
|
|
|
|
|
|
|
QSORT(geometry->pack, geometry->pack_nr, geometry_cmp);
|
2022-05-20 19:01:45 +00:00
|
|
|
strbuf_release(&buf);
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
}
|
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
static void split_pack_geometry(struct pack_geometry *geometry)
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
{
|
|
|
|
uint32_t i;
|
|
|
|
uint32_t split;
|
|
|
|
off_t total_size = 0;
|
|
|
|
|
2021-03-05 15:21:37 +00:00
|
|
|
if (!geometry->pack_nr) {
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
geometry->split = geometry->pack_nr;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* First, count the number of packs (in descending order of size) which
|
|
|
|
* already form a geometric progression.
|
|
|
|
*/
|
|
|
|
for (i = geometry->pack_nr - 1; i > 0; i--) {
|
|
|
|
struct packed_git *ours = geometry->pack[i];
|
|
|
|
struct packed_git *prev = geometry->pack[i - 1];
|
2021-03-05 15:21:56 +00:00
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
if (unsigned_mult_overflows(geometry->split_factor,
|
|
|
|
geometry_pack_weight(prev)))
|
2021-03-05 15:21:56 +00:00
|
|
|
die(_("pack %s too large to consider in geometric "
|
|
|
|
"progression"),
|
|
|
|
prev->pack_name);
|
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
if (geometry_pack_weight(ours) <
|
|
|
|
geometry->split_factor * geometry_pack_weight(prev))
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2021-03-05 15:21:50 +00:00
|
|
|
split = i;
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
if (split) {
|
|
|
|
/*
|
|
|
|
* Move the split one to the right, since the top element in the
|
|
|
|
* last-compared pair can't be in the progression. Only do this
|
|
|
|
* when we split in the middle of the array (otherwise if we got
|
|
|
|
* to the end, then the split is in the right place).
|
|
|
|
*/
|
|
|
|
split++;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Then, anything to the left of 'split' must be in a new pack. But,
|
|
|
|
* creating that new pack may cause packs in the heavy half to no longer
|
|
|
|
* form a geometric progression.
|
|
|
|
*
|
|
|
|
* Compute an expected size of the new pack, and then determine how many
|
|
|
|
* packs in the heavy half need to be joined into it (if any) to restore
|
|
|
|
* the geometric progression.
|
|
|
|
*/
|
2021-03-05 15:21:56 +00:00
|
|
|
for (i = 0; i < split; i++) {
|
|
|
|
struct packed_git *p = geometry->pack[i];
|
|
|
|
|
|
|
|
if (unsigned_add_overflows(total_size, geometry_pack_weight(p)))
|
|
|
|
die(_("pack %s too large to roll up"), p->pack_name);
|
|
|
|
total_size += geometry_pack_weight(p);
|
|
|
|
}
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
for (i = split; i < geometry->pack_nr; i++) {
|
|
|
|
struct packed_git *ours = geometry->pack[i];
|
2021-03-05 15:21:56 +00:00
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
if (unsigned_mult_overflows(geometry->split_factor,
|
|
|
|
total_size))
|
2021-03-05 15:21:56 +00:00
|
|
|
die(_("pack %s too large to roll up"), ours->pack_name);
|
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
if (geometry_pack_weight(ours) <
|
|
|
|
geometry->split_factor * total_size) {
|
2021-03-05 15:21:56 +00:00
|
|
|
if (unsigned_add_overflows(total_size,
|
|
|
|
geometry_pack_weight(ours)))
|
|
|
|
die(_("pack %s too large to roll up"),
|
|
|
|
ours->pack_name);
|
|
|
|
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
split++;
|
|
|
|
total_size += geometry_pack_weight(ours);
|
|
|
|
} else
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
geometry->split = split;
|
|
|
|
}
|
|
|
|
|
2023-04-14 06:01:36 +00:00
|
|
|
static struct packed_git *get_preferred_pack(struct pack_geometry *geometry)
|
2021-09-29 01:55:20 +00:00
|
|
|
{
|
2023-04-14 06:01:36 +00:00
|
|
|
uint32_t i;
|
|
|
|
|
2021-09-29 01:55:20 +00:00
|
|
|
if (!geometry) {
|
|
|
|
/*
|
|
|
|
* No geometry means either an all-into-one repack (in which
|
|
|
|
* case there is only one pack left and it is the largest) or an
|
|
|
|
* incremental one.
|
|
|
|
*
|
|
|
|
* If repacking incrementally, then we could check the size of
|
|
|
|
* all packs to determine which should be preferred, but leave
|
|
|
|
* this for later.
|
|
|
|
*/
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
if (geometry->split == geometry->pack_nr)
|
|
|
|
return NULL;
|
2023-04-14 06:01:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The preferred pack is the largest pack above the split line. In
|
|
|
|
* other words, it is the largest pack that does not get rolled up in
|
|
|
|
* the geometric repack.
|
|
|
|
*/
|
|
|
|
for (i = geometry->pack_nr; i > geometry->split; i--)
|
|
|
|
/*
|
|
|
|
* A pack that is not local would never be included in a
|
|
|
|
* multi-pack index. We thus skip over any non-local packs.
|
|
|
|
*/
|
|
|
|
if (geometry->pack[i - 1]->pack_local)
|
|
|
|
return geometry->pack[i - 1];
|
|
|
|
|
|
|
|
return NULL;
|
2021-09-29 01:55:20 +00:00
|
|
|
}
|
|
|
|
|
2023-09-13 19:17:49 +00:00
|
|
|
static void geometry_remove_redundant_packs(struct pack_geometry *geometry,
|
|
|
|
struct string_list *names,
|
|
|
|
struct existing_packs *existing)
|
|
|
|
{
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
|
|
|
uint32_t i;
|
|
|
|
|
|
|
|
for (i = 0; i < geometry->split; i++) {
|
|
|
|
struct packed_git *p = geometry->pack[i];
|
|
|
|
if (string_list_has_string(names, hash_to_hex(p->hash)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
strbuf_reset(&buf);
|
|
|
|
strbuf_addstr(&buf, pack_basename(p));
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
|
|
|
|
|
|
|
if ((p->pack_keep) ||
|
|
|
|
(string_list_has_string(&existing->kept_packs, buf.buf)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
remove_redundant_pack(packdir, buf.buf);
|
|
|
|
}
|
|
|
|
|
|
|
|
strbuf_release(&buf);
|
|
|
|
}
|
|
|
|
|
repack: free geometry struct
When the program is ending, we call clear_pack_geometry() to free any
resources in the pack_geometry struct. But the struct itself is
allocated on the heap, and leak-checkers will complain about the
resulting small leak.
This one was marked by Coverity as a "new" leak, though it has existed
since 0fabafd0b9 (builtin/repack.c: add '--geometric' option,
2021-02-22). This might be because recent unrelated changes in the file
confused it about what is new and what is not. But regardless, it is
worth addressing.
We can fix it easily by free-ing the struct. We'll convert our "clear"
function to "free", since the allocation happens in the matching init()
function (though since there is only one call to each, and the struct is
local to this file, it's mostly academic).
Another option would be to put the struct on the stack rather than the
heap. However, this gets tricky, as we check the pointer against NULL in
several places to decide whether we're in geometric mode.
Signed-off-by: Jeff King <peff@peff.net>
Acked-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-08 18:50:23 +00:00
|
|
|
static void free_pack_geometry(struct pack_geometry *geometry)
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
{
|
|
|
|
if (!geometry)
|
|
|
|
return;
|
|
|
|
|
|
|
|
free(geometry->pack);
|
|
|
|
}
|
|
|
|
|
2021-10-01 22:38:10 +00:00
|
|
|
struct midx_snapshot_ref_data {
|
|
|
|
struct tempfile *f;
|
|
|
|
struct oidset seen;
|
|
|
|
int preferred;
|
|
|
|
};
|
|
|
|
|
2022-08-25 17:09:48 +00:00
|
|
|
static int midx_snapshot_ref_one(const char *refname UNUSED,
|
2021-10-01 22:38:10 +00:00
|
|
|
const struct object_id *oid,
|
2022-08-25 17:09:48 +00:00
|
|
|
int flag UNUSED, void *_data)
|
2021-10-01 22:38:10 +00:00
|
|
|
{
|
|
|
|
struct midx_snapshot_ref_data *data = _data;
|
|
|
|
struct object_id peeled;
|
|
|
|
|
|
|
|
if (!peel_iterated_oid(oid, &peeled))
|
|
|
|
oid = &peeled;
|
|
|
|
|
|
|
|
if (oidset_insert(&data->seen, oid))
|
|
|
|
return 0; /* already seen */
|
|
|
|
|
|
|
|
if (oid_object_info(the_repository, oid, NULL) != OBJ_COMMIT)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
fprintf(data->f->fp, "%s%s\n", data->preferred ? "+" : "",
|
|
|
|
oid_to_hex(oid));
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void midx_snapshot_refs(struct tempfile *f)
|
|
|
|
{
|
|
|
|
struct midx_snapshot_ref_data data;
|
|
|
|
const struct string_list *preferred = bitmap_preferred_tips(the_repository);
|
|
|
|
|
|
|
|
data.f = f;
|
|
|
|
data.preferred = 0;
|
|
|
|
oidset_init(&data.seen, 0);
|
|
|
|
|
|
|
|
if (!fdopen_tempfile(f, "w"))
|
|
|
|
die(_("could not open tempfile %s for writing"),
|
|
|
|
get_tempfile_path(f));
|
|
|
|
|
|
|
|
if (preferred) {
|
|
|
|
struct string_list_item *item;
|
|
|
|
|
|
|
|
data.preferred = 1;
|
|
|
|
for_each_string_list_item(item, preferred)
|
|
|
|
for_each_ref_in(item->string, midx_snapshot_ref_one, &data);
|
|
|
|
data.preferred = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
for_each_ref(midx_snapshot_ref_one, &data);
|
|
|
|
|
|
|
|
if (close_tempfile_gently(f)) {
|
|
|
|
int save_errno = errno;
|
|
|
|
delete_tempfile(&f);
|
|
|
|
errno = save_errno;
|
|
|
|
die_errno(_("could not close refs snapshot tempfile"));
|
|
|
|
}
|
|
|
|
|
|
|
|
oidset_clear(&data.seen);
|
|
|
|
}
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
static void midx_included_packs(struct string_list *include,
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
struct existing_packs *existing,
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
struct string_list *names,
|
|
|
|
struct pack_geometry *geometry)
|
|
|
|
{
|
|
|
|
struct string_list_item *item;
|
|
|
|
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
for_each_string_list_item(item, &existing->kept_packs)
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
string_list_insert(include, xstrfmt("%s.idx", item->string));
|
|
|
|
for_each_string_list_item(item, names)
|
|
|
|
string_list_insert(include, xstrfmt("pack-%s.idx", item->string));
|
2023-08-09 20:32:42 +00:00
|
|
|
if (geometry->split_factor) {
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
struct strbuf buf = STRBUF_INIT;
|
|
|
|
uint32_t i;
|
|
|
|
for (i = geometry->split; i < geometry->pack_nr; i++) {
|
|
|
|
struct packed_git *p = geometry->pack[i];
|
|
|
|
|
2023-04-14 06:01:40 +00:00
|
|
|
/*
|
|
|
|
* The multi-pack index never refers to packfiles part
|
|
|
|
* of an alternate object database, so we skip these.
|
|
|
|
* While git-multi-pack-index(1) would silently ignore
|
|
|
|
* them anyway, this allows us to skip executing the
|
|
|
|
* command completely when we have only non-local
|
|
|
|
* packfiles.
|
|
|
|
*/
|
|
|
|
if (!p->pack_local)
|
|
|
|
continue;
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
strbuf_addstr(&buf, pack_basename(p));
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
|
|
|
strbuf_addstr(&buf, ".idx");
|
|
|
|
|
|
|
|
string_list_insert(include, strbuf_detach(&buf, NULL));
|
|
|
|
}
|
|
|
|
} else {
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
for_each_string_list_item(item, &existing->non_kept_packs) {
|
2023-09-13 19:18:00 +00:00
|
|
|
if (pack_is_marked_for_deletion(item))
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
continue;
|
|
|
|
string_list_insert(include, xstrfmt("%s.idx", item->string));
|
|
|
|
}
|
|
|
|
}
|
2023-09-13 19:17:57 +00:00
|
|
|
|
2023-09-13 19:18:03 +00:00
|
|
|
for_each_string_list_item(item, &existing->cruft_packs) {
|
|
|
|
/*
|
|
|
|
* When doing a --geometric repack, there is no need to check
|
|
|
|
* for deleted packs, since we're by definition not doing an
|
|
|
|
* ALL_INTO_ONE repack (hence no packs will be deleted).
|
|
|
|
* Otherwise we must check for and exclude any packs which are
|
|
|
|
* enqueued for deletion.
|
|
|
|
*
|
|
|
|
* So we could omit the conditional below in the --geometric
|
|
|
|
* case, but doing so is unnecessary since no packs are marked
|
|
|
|
* as pending deletion (since we only call
|
|
|
|
* `mark_packs_for_deletion()` when doing an all-into-one
|
|
|
|
* repack).
|
|
|
|
*/
|
|
|
|
if (pack_is_marked_for_deletion(item))
|
|
|
|
continue;
|
|
|
|
string_list_insert(include, xstrfmt("%s.idx", item->string));
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int write_midx_included_packs(struct string_list *include,
|
2021-09-29 01:55:20 +00:00
|
|
|
struct pack_geometry *geometry,
|
builtin/repack.c: avoid making cruft packs preferred
When doing a `--geometric` repack, we make sure that the preferred pack
(if writing a MIDX) is the largest pack that we *didn't* repack. That
has the effect of keeping the preferred pack in sync with the pack
containing a majority of the repository's reachable objects.
But if the repository happens to double in size, we'll repack
everything. Here we don't specify any `--preferred-pack`, and instead
let the MIDX code choose.
In the past, that worked fine, since there would only be one pack to
choose from: the one we just wrote. But it's no longer necessarily the
case that there is one pack to choose from. It's possible that the
repository also has a cruft pack, too.
If the cruft pack happens to come earlier in lexical order (and has an
earlier mtime than any non-cruft pack), we'll pick that pack as
preferred. This makes it impossible to reuse chunks of the reachable
pack verbatim from pack-objects, so is sub-optimal.
Luckily, this is a somewhat rare circumstance to be in, since we would
have to repack the entire repository during a `--geometric` repack, and
the cruft pack would have to sort ahead of the pack we just created.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 21:54:19 +00:00
|
|
|
struct string_list *names,
|
2021-10-01 22:38:10 +00:00
|
|
|
const char *refs_snapshot,
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
int show_progress, int write_bitmaps)
|
|
|
|
{
|
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
|
|
|
struct string_list_item *item;
|
2023-04-14 06:01:36 +00:00
|
|
|
struct packed_git *preferred = get_preferred_pack(geometry);
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
FILE *in;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!include->nr)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
cmd.in = -1;
|
|
|
|
cmd.git_cmd = 1;
|
|
|
|
|
|
|
|
strvec_push(&cmd.args, "multi-pack-index");
|
|
|
|
strvec_pushl(&cmd.args, "write", "--stdin-packs", NULL);
|
|
|
|
|
|
|
|
if (show_progress)
|
|
|
|
strvec_push(&cmd.args, "--progress");
|
|
|
|
else
|
|
|
|
strvec_push(&cmd.args, "--no-progress");
|
|
|
|
|
|
|
|
if (write_bitmaps)
|
|
|
|
strvec_push(&cmd.args, "--bitmap");
|
|
|
|
|
2023-04-14 06:01:36 +00:00
|
|
|
if (preferred)
|
2021-09-29 01:55:20 +00:00
|
|
|
strvec_pushf(&cmd.args, "--preferred-pack=%s",
|
2023-04-14 06:01:36 +00:00
|
|
|
pack_basename(preferred));
|
builtin/repack.c: avoid making cruft packs preferred
When doing a `--geometric` repack, we make sure that the preferred pack
(if writing a MIDX) is the largest pack that we *didn't* repack. That
has the effect of keeping the preferred pack in sync with the pack
containing a majority of the repository's reachable objects.
But if the repository happens to double in size, we'll repack
everything. Here we don't specify any `--preferred-pack`, and instead
let the MIDX code choose.
In the past, that worked fine, since there would only be one pack to
choose from: the one we just wrote. But it's no longer necessarily the
case that there is one pack to choose from. It's possible that the
repository also has a cruft pack, too.
If the cruft pack happens to come earlier in lexical order (and has an
earlier mtime than any non-cruft pack), we'll pick that pack as
preferred. This makes it impossible to reuse chunks of the reachable
pack verbatim from pack-objects, so is sub-optimal.
Luckily, this is a somewhat rare circumstance to be in, since we would
have to repack the entire repository during a `--geometric` repack, and
the cruft pack would have to sort ahead of the pack we just created.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 21:54:19 +00:00
|
|
|
else if (names->nr) {
|
|
|
|
/* The largest pack was repacked, meaning that either
|
|
|
|
* one or two packs exist depending on whether the
|
|
|
|
* repository has a cruft pack or not.
|
|
|
|
*
|
|
|
|
* Select the non-cruft one as preferred to encourage
|
|
|
|
* pack-reuse among packs containing reachable objects
|
|
|
|
* over unreachable ones.
|
|
|
|
*
|
|
|
|
* (Note we could write multiple packs here if
|
|
|
|
* `--max-pack-size` was given, but any one of them
|
|
|
|
* will suffice, so pick the first one.)
|
|
|
|
*/
|
|
|
|
for_each_string_list_item(item, names) {
|
|
|
|
struct generated_pack_data *data = item->util;
|
|
|
|
if (has_pack_ext(data, ".mtimes"))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
strvec_pushf(&cmd.args, "--preferred-pack=pack-%s.pack",
|
|
|
|
item->string);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* No packs were kept, and no packs were written. The
|
|
|
|
* only thing remaining are .keep packs (unless
|
|
|
|
* --pack-kept-objects was given).
|
|
|
|
*
|
|
|
|
* Set the `--preferred-pack` arbitrarily here.
|
|
|
|
*/
|
|
|
|
;
|
|
|
|
}
|
2021-09-29 01:55:20 +00:00
|
|
|
|
2021-10-01 22:38:10 +00:00
|
|
|
if (refs_snapshot)
|
|
|
|
strvec_pushf(&cmd.args, "--refs-snapshot=%s", refs_snapshot);
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
ret = start_command(&cmd);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
in = xfdopen(cmd.in, "w");
|
|
|
|
for_each_string_list_item(item, include)
|
|
|
|
fprintf(in, "%s\n", item->string);
|
|
|
|
fclose(in);
|
|
|
|
|
|
|
|
return finish_command(&cmd);
|
|
|
|
}
|
|
|
|
|
2022-10-18 02:45:12 +00:00
|
|
|
static void remove_redundant_bitmaps(struct string_list *include,
|
|
|
|
const char *packdir)
|
|
|
|
{
|
|
|
|
struct strbuf path = STRBUF_INIT;
|
|
|
|
struct string_list_item *item;
|
|
|
|
size_t packdir_len;
|
|
|
|
|
|
|
|
strbuf_addstr(&path, packdir);
|
|
|
|
strbuf_addch(&path, '/');
|
|
|
|
packdir_len = path.len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove any pack bitmaps corresponding to packs which are now
|
|
|
|
* included in the MIDX.
|
|
|
|
*/
|
|
|
|
for_each_string_list_item(item, include) {
|
|
|
|
strbuf_addstr(&path, item->string);
|
|
|
|
strbuf_strip_suffix(&path, ".idx");
|
|
|
|
strbuf_addstr(&path, ".bitmap");
|
|
|
|
|
|
|
|
if (unlink(path.buf) && errno != ENOENT)
|
|
|
|
warning_errno(_("could not remove stale bitmap: %s"),
|
|
|
|
path.buf);
|
|
|
|
|
|
|
|
strbuf_setlen(&path, packdir_len);
|
|
|
|
}
|
|
|
|
strbuf_release(&path);
|
|
|
|
}
|
|
|
|
|
2023-10-02 16:54:58 +00:00
|
|
|
static int finish_pack_objects_cmd(struct child_process *cmd,
|
|
|
|
struct string_list *names,
|
|
|
|
int local)
|
|
|
|
{
|
|
|
|
FILE *out;
|
|
|
|
struct strbuf line = STRBUF_INIT;
|
|
|
|
|
|
|
|
out = xfdopen(cmd->out, "r");
|
|
|
|
while (strbuf_getline_lf(&line, out) != EOF) {
|
|
|
|
struct string_list_item *item;
|
|
|
|
|
|
|
|
if (line.len != the_hash_algo->hexsz)
|
|
|
|
die(_("repack: Expecting full hex object ID lines only "
|
|
|
|
"from pack-objects."));
|
|
|
|
/*
|
|
|
|
* Avoid putting packs written outside of the repository in the
|
|
|
|
* list of names.
|
|
|
|
*/
|
|
|
|
if (local) {
|
|
|
|
item = string_list_append(names, line.buf);
|
|
|
|
item->util = populate_pack_exts(line.buf);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
fclose(out);
|
|
|
|
|
|
|
|
strbuf_release(&line);
|
|
|
|
|
|
|
|
return finish_command(cmd);
|
|
|
|
}
|
|
|
|
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
static int write_filtered_pack(const struct pack_objects_args *args,
|
|
|
|
const char *destination,
|
|
|
|
const char *pack_prefix,
|
|
|
|
struct existing_packs *existing,
|
|
|
|
struct string_list *names)
|
|
|
|
{
|
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
|
|
|
struct string_list_item *item;
|
|
|
|
FILE *in;
|
|
|
|
int ret;
|
|
|
|
const char *caret;
|
|
|
|
const char *scratch;
|
|
|
|
int local = skip_prefix(destination, packdir, &scratch);
|
|
|
|
|
|
|
|
prepare_pack_objects(&cmd, args, destination);
|
|
|
|
|
|
|
|
strvec_push(&cmd.args, "--stdin-packs");
|
|
|
|
|
|
|
|
if (!pack_kept_objects)
|
|
|
|
strvec_push(&cmd.args, "--honor-pack-keep");
|
|
|
|
for_each_string_list_item(item, &existing->kept_packs)
|
|
|
|
strvec_pushf(&cmd.args, "--keep-pack=%s", item->string);
|
|
|
|
|
|
|
|
cmd.in = -1;
|
|
|
|
|
|
|
|
ret = start_command(&cmd);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here 'names' contains only the pack(s) that were just
|
|
|
|
* written, which is exactly the packs we want to keep. Also
|
|
|
|
* 'existing_kept_packs' already contains the packs in
|
|
|
|
* 'keep_pack_list'.
|
|
|
|
*/
|
|
|
|
in = xfdopen(cmd.in, "w");
|
|
|
|
for_each_string_list_item(item, names)
|
|
|
|
fprintf(in, "^%s-%s.pack\n", pack_prefix, item->string);
|
|
|
|
for_each_string_list_item(item, &existing->non_kept_packs)
|
|
|
|
fprintf(in, "%s.pack\n", item->string);
|
|
|
|
for_each_string_list_item(item, &existing->cruft_packs)
|
|
|
|
fprintf(in, "%s.pack\n", item->string);
|
|
|
|
caret = pack_kept_objects ? "" : "^";
|
|
|
|
for_each_string_list_item(item, &existing->kept_packs)
|
|
|
|
fprintf(in, "%s%s.pack\n", caret, item->string);
|
|
|
|
fclose(in);
|
|
|
|
|
|
|
|
return finish_pack_objects_cmd(&cmd, names, local);
|
|
|
|
}
|
|
|
|
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
static int existing_cruft_pack_cmp(const void *va, const void *vb)
|
|
|
|
{
|
|
|
|
struct packed_git *a = *(struct packed_git **)va;
|
|
|
|
struct packed_git *b = *(struct packed_git **)vb;
|
|
|
|
|
|
|
|
if (a->pack_size < b->pack_size)
|
|
|
|
return -1;
|
|
|
|
if (a->pack_size > b->pack_size)
|
|
|
|
return 1;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void collapse_small_cruft_packs(FILE *in, size_t max_size,
|
|
|
|
struct existing_packs *existing)
|
|
|
|
{
|
|
|
|
struct packed_git **existing_cruft, *p;
|
|
|
|
struct strbuf buf = STRBUF_INIT;
|
|
|
|
size_t total_size = 0;
|
|
|
|
size_t existing_cruft_nr = 0;
|
|
|
|
size_t i;
|
|
|
|
|
|
|
|
ALLOC_ARRAY(existing_cruft, existing->cruft_packs.nr);
|
|
|
|
|
|
|
|
for (p = get_all_packs(the_repository); p; p = p->next) {
|
|
|
|
if (!(p->is_cruft && p->pack_local))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
strbuf_reset(&buf);
|
|
|
|
strbuf_addstr(&buf, pack_basename(p));
|
|
|
|
strbuf_strip_suffix(&buf, ".pack");
|
|
|
|
|
|
|
|
if (!string_list_has_string(&existing->cruft_packs, buf.buf))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (existing_cruft_nr >= existing->cruft_packs.nr)
|
|
|
|
BUG("too many cruft packs (found %"PRIuMAX", but knew "
|
|
|
|
"of %"PRIuMAX")",
|
|
|
|
(uintmax_t)existing_cruft_nr + 1,
|
|
|
|
(uintmax_t)existing->cruft_packs.nr);
|
|
|
|
existing_cruft[existing_cruft_nr++] = p;
|
|
|
|
}
|
|
|
|
|
|
|
|
QSORT(existing_cruft, existing_cruft_nr, existing_cruft_pack_cmp);
|
|
|
|
|
|
|
|
for (i = 0; i < existing_cruft_nr; i++) {
|
|
|
|
size_t proposed;
|
|
|
|
|
|
|
|
p = existing_cruft[i];
|
|
|
|
proposed = st_add(total_size, p->pack_size);
|
|
|
|
|
|
|
|
if (proposed <= max_size) {
|
|
|
|
total_size = proposed;
|
|
|
|
fprintf(in, "-%s\n", pack_basename(p));
|
|
|
|
} else {
|
|
|
|
retain_cruft_pack(existing, p);
|
|
|
|
fprintf(in, "%s\n", pack_basename(p));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (i = 0; i < existing->non_kept_packs.nr; i++)
|
|
|
|
fprintf(in, "-%s.pack\n",
|
|
|
|
existing->non_kept_packs.items[i].string);
|
|
|
|
|
|
|
|
strbuf_release(&buf);
|
2023-10-07 17:20:31 +00:00
|
|
|
free(existing_cruft);
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
}
|
|
|
|
|
2022-05-20 23:18:03 +00:00
|
|
|
static int write_cruft_pack(const struct pack_objects_args *args,
|
2022-10-24 18:43:09 +00:00
|
|
|
const char *destination,
|
2022-05-20 23:18:03 +00:00
|
|
|
const char *pack_prefix,
|
2022-10-24 18:43:06 +00:00
|
|
|
const char *cruft_expiration,
|
2022-05-20 23:18:03 +00:00
|
|
|
struct string_list *names,
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
struct existing_packs *existing)
|
2022-05-20 23:18:03 +00:00
|
|
|
{
|
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
|
|
|
struct string_list_item *item;
|
2023-10-02 16:54:58 +00:00
|
|
|
FILE *in;
|
2022-05-20 23:18:03 +00:00
|
|
|
int ret;
|
2022-10-24 18:43:09 +00:00
|
|
|
const char *scratch;
|
|
|
|
int local = skip_prefix(destination, packdir, &scratch);
|
2022-05-20 23:18:03 +00:00
|
|
|
|
2022-10-24 18:43:09 +00:00
|
|
|
prepare_pack_objects(&cmd, args, destination);
|
2022-05-20 23:18:03 +00:00
|
|
|
|
|
|
|
strvec_push(&cmd.args, "--cruft");
|
|
|
|
if (cruft_expiration)
|
|
|
|
strvec_pushf(&cmd.args, "--cruft-expiration=%s",
|
|
|
|
cruft_expiration);
|
|
|
|
|
|
|
|
strvec_push(&cmd.args, "--honor-pack-keep");
|
|
|
|
strvec_push(&cmd.args, "--non-empty");
|
|
|
|
|
|
|
|
cmd.in = -1;
|
|
|
|
|
|
|
|
ret = start_command(&cmd);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* names has a confusing double use: it both provides the list
|
|
|
|
* of just-written new packs, and accepts the name of the cruft
|
|
|
|
* pack we are writing.
|
|
|
|
*
|
|
|
|
* By the time it is read here, it contains only the pack(s)
|
|
|
|
* that were just written, which is exactly the set of packs we
|
|
|
|
* want to consider kept.
|
builtin/repack.c: implement `--expire-to` for storing pruned objects
When pruning objects with `--cruft`, `git repack` offers some
flexibility when selecting the set of which objects are pruned via the
`--cruft-expiration` option.
This is useful for expiring objects which are older than the grace
period, making races where to-be-pruned objects become reachable and
then ancestors of freshly pushed objects, leaving the repository in a
corrupt state after pruning substantially less likely [1].
But in practice, such races are impossible to avoid entirely, no matter
how long the grace period is. To prevent this race, it is often
advisable to temporarily put a repository into a read-only state. But in
practice, this is not always practical, and so some middle ground would
be nice.
This patch introduces a new option, `--expire-to`, which teaches `git
repack` to write an additional cruft pack containing just the objects
which were pruned from the repository. The caller can specify a
directory outside of the current repository as the destination for this
second cruft pack.
This makes it possible to prune objects from a repository, while still
holding onto a supplemental copy of them outside of the original
repository. Having this copy on-disk makes it substantially easier to
recover objects when the aforementioned race is encountered.
`--expire-to` is implemented in a somewhat convoluted manner, which is
to take advantage of the fact that the first time `write_cruft_pack()`
is called, it adds the name of the cruft pack to the `names` string
list. That means the second time we call `write_cruft_pack()`, objects
in the previously-written cruft pack will be excluded.
As long as the caller ensures that no objects are expired during the
second pass, this is sufficient to generate a cruft pack containing all
objects which don't appear in any of the new packs written by `git
repack`, including the cruft pack. In other words, all of the objects
which are about to be pruned from the repository.
It is important to note that the destination in `--expire-to` does not
necessarily need to be a Git repository (though it can be) Notably, the
expired packs do not contain all ancestors of expired objects. So if the
source repository contains something like:
<unreachable>
/
C1 --- C2
\
refs/heads/master
where C2 is unreachable, but has a parent (C1) which is reachable, and
C2 would be pruned, then the expiry pack will contain only C2, not C1.
[1]: https://lore.kernel.org/git/20190319001829.GL29661@sigill.intra.peff.net/
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-24 18:43:12 +00:00
|
|
|
*
|
|
|
|
* If `--expire-to` is given, the double-use served by `names`
|
|
|
|
* ensures that the pack written to `--expire-to` excludes any
|
|
|
|
* objects contained in the cruft pack.
|
2022-05-20 23:18:03 +00:00
|
|
|
*/
|
|
|
|
in = xfdopen(cmd.in, "w");
|
|
|
|
for_each_string_list_item(item, names)
|
|
|
|
fprintf(in, "%s-%s.pack\n", pack_prefix, item->string);
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
if (args->max_pack_size && !cruft_expiration) {
|
|
|
|
collapse_small_cruft_packs(in, args->max_pack_size, existing);
|
|
|
|
} else {
|
|
|
|
for_each_string_list_item(item, &existing->non_kept_packs)
|
|
|
|
fprintf(in, "-%s.pack\n", item->string);
|
|
|
|
for_each_string_list_item(item, &existing->cruft_packs)
|
|
|
|
fprintf(in, "-%s.pack\n", item->string);
|
|
|
|
}
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
for_each_string_list_item(item, &existing->kept_packs)
|
2022-05-20 23:18:03 +00:00
|
|
|
fprintf(in, "%s.pack\n", item->string);
|
|
|
|
fclose(in);
|
|
|
|
|
2023-10-02 16:54:58 +00:00
|
|
|
return finish_pack_objects_cmd(&cmd, names, local);
|
2022-05-20 23:18:03 +00:00
|
|
|
}
|
|
|
|
|
2023-10-02 16:54:59 +00:00
|
|
|
static const char *find_pack_prefix(const char *packdir, const char *packtmp)
|
|
|
|
{
|
|
|
|
const char *pack_prefix;
|
|
|
|
if (!skip_prefix(packtmp, packdir, &pack_prefix))
|
|
|
|
die(_("pack prefix %s does not begin with objdir %s"),
|
|
|
|
packtmp, packdir);
|
|
|
|
if (*pack_prefix == '/')
|
|
|
|
pack_prefix++;
|
|
|
|
return pack_prefix;
|
|
|
|
}
|
|
|
|
|
2013-09-15 15:33:20 +00:00
|
|
|
int cmd_repack(int argc, const char **argv, const char *prefix)
|
|
|
|
{
|
2014-08-19 19:09:35 +00:00
|
|
|
struct child_process cmd = CHILD_PROCESS_INIT;
|
2013-09-15 15:33:20 +00:00
|
|
|
struct string_list_item *item;
|
|
|
|
struct string_list names = STRING_LIST_INIT_DUP;
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
struct existing_packs existing = EXISTING_PACKS_INIT;
|
2023-08-09 20:32:42 +00:00
|
|
|
struct pack_geometry geometry = { 0 };
|
2021-10-01 22:38:10 +00:00
|
|
|
struct tempfile *refs_snapshot = NULL;
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 20:15:16 +00:00
|
|
|
int i, ext, ret;
|
2021-12-20 14:48:11 +00:00
|
|
|
int show_progress;
|
2013-09-15 15:33:20 +00:00
|
|
|
|
|
|
|
/* variables to be filled by option parsing */
|
|
|
|
int delete_redundant = 0;
|
2014-01-23 01:28:30 +00:00
|
|
|
const char *unpack_unreachable = NULL;
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 04:36:28 +00:00
|
|
|
int keep_unreachable = 0;
|
2018-04-15 15:36:13 +00:00
|
|
|
struct string_list keep_pack_list = STRING_LIST_INIT_NODUP;
|
2018-08-08 22:34:05 +00:00
|
|
|
struct pack_objects_args po_args = {NULL};
|
2022-05-20 23:18:06 +00:00
|
|
|
struct pack_objects_args cruft_po_args = {NULL};
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
int write_midx = 0;
|
2022-10-24 18:43:06 +00:00
|
|
|
const char *cruft_expiration = NULL;
|
builtin/repack.c: implement `--expire-to` for storing pruned objects
When pruning objects with `--cruft`, `git repack` offers some
flexibility when selecting the set of which objects are pruned via the
`--cruft-expiration` option.
This is useful for expiring objects which are older than the grace
period, making races where to-be-pruned objects become reachable and
then ancestors of freshly pushed objects, leaving the repository in a
corrupt state after pruning substantially less likely [1].
But in practice, such races are impossible to avoid entirely, no matter
how long the grace period is. To prevent this race, it is often
advisable to temporarily put a repository into a read-only state. But in
practice, this is not always practical, and so some middle ground would
be nice.
This patch introduces a new option, `--expire-to`, which teaches `git
repack` to write an additional cruft pack containing just the objects
which were pruned from the repository. The caller can specify a
directory outside of the current repository as the destination for this
second cruft pack.
This makes it possible to prune objects from a repository, while still
holding onto a supplemental copy of them outside of the original
repository. Having this copy on-disk makes it substantially easier to
recover objects when the aforementioned race is encountered.
`--expire-to` is implemented in a somewhat convoluted manner, which is
to take advantage of the fact that the first time `write_cruft_pack()`
is called, it adds the name of the cruft pack to the `names` string
list. That means the second time we call `write_cruft_pack()`, objects
in the previously-written cruft pack will be excluded.
As long as the caller ensures that no objects are expired during the
second pass, this is sufficient to generate a cruft pack containing all
objects which don't appear in any of the new packs written by `git
repack`, including the cruft pack. In other words, all of the objects
which are about to be pruned from the repository.
It is important to note that the destination in `--expire-to` does not
necessarily need to be a Git repository (though it can be) Notably, the
expired packs do not contain all ancestors of expired objects. So if the
source repository contains something like:
<unreachable>
/
C1 --- C2
\
refs/heads/master
where C2 is unreachable, but has a parent (C1) which is reachable, and
C2 would be pruned, then the expiry pack will contain only C2, not C1.
[1]: https://lore.kernel.org/git/20190319001829.GL29661@sigill.intra.peff.net/
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-24 18:43:12 +00:00
|
|
|
const char *expire_to = NULL;
|
2023-10-02 16:55:03 +00:00
|
|
|
const char *filter_to = NULL;
|
2013-09-15 15:33:20 +00:00
|
|
|
|
|
|
|
struct option builtin_repack_options[] = {
|
|
|
|
OPT_BIT('a', NULL, &pack_everything,
|
|
|
|
N_("pack everything in a single pack"), ALL_INTO_ONE),
|
|
|
|
OPT_BIT('A', NULL, &pack_everything,
|
|
|
|
N_("same as -a, and turn unreachable objects loose"),
|
|
|
|
LOOSEN_UNREACHABLE | ALL_INTO_ONE),
|
2022-05-20 23:18:03 +00:00
|
|
|
OPT_BIT(0, "cruft", &pack_everything,
|
|
|
|
N_("same as -a, pack unreachable cruft objects separately"),
|
|
|
|
PACK_CRUFT),
|
|
|
|
OPT_STRING(0, "cruft-expiration", &cruft_expiration, N_("approxidate"),
|
2023-04-18 20:40:35 +00:00
|
|
|
N_("with --cruft, expire objects older than this")),
|
builtin/repack.c: implement support for `--max-cruft-size`
Cruft packs are an alternative mechanism for storing a collection of
unreachable objects whose mtimes are recent enough to avoid being
pruned out of the repository.
When cruft packs were first introduced back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20) and
a7d493833f (builtin/pack-objects.c: --cruft with expiration,
2022-05-20), the recommended workflow consisted of:
- Repacking periodically, either by packing anything loose in the
repository (via `git repack -d`) or producing a geometric sequence
of packs (via `git repack --geometric=<d> -d`).
- Every so often, splitting the repository into two packs, one cruft
to store the unreachable objects, and another non-cruft pack to
store the reachable objects.
Repositories may (out of band with the above) choose periodically to
prune out some unreachable objects which have aged out of the grace
period by generating a pack with `--cruft-expiration=<approxidate>`.
This allowed repositories to maintain relatively few packs on average,
and quarantine unreachable objects together in a cruft pack, avoiding
the pitfalls of holding unreachable objects as loose while they age out
(for more, see some of the details in 3d89a8c118
(Documentation/technical: add cruft-packs.txt, 2022-05-20)).
This all works, but can be costly from an I/O-perspective when
frequently repacking a repository that has many unreachable objects.
This problem is exacerbated when those unreachable objects are rarely
(if every) pruned.
Since there is at most one cruft pack in the above scheme, each time we
update the cruft pack it must be rewritten from scratch. Because much of
the pack is reused, this is a relatively inexpensive operation from a
CPU-perspective, but is very costly in terms of I/O since we end up
rewriting basically the same pack (plus any new unreachable objects that
have entered the repository since the last time a cruft pack was
generated).
At the time, we decided against implementing more robust support for
multiple cruft packs. This patch implements that support which we were
lacking.
Introduce a new option `--max-cruft-size` which allows repositories to
accumulate cruft packs up to a given size, after which point a new
generation of cruft packs can accumulate until it reaches the maximum
size, and so on. To generate a new cruft pack, the process works like
so:
- Sort a list of any existing cruft packs in ascending order of pack
size.
- Starting from the beginning of the list, group cruft packs together
while the accumulated size is smaller than the maximum specified
pack size.
- Combine the objects in these cruft packs together into a new cruft
pack, along with any other unreachable objects which have since
entered the repository.
Once a cruft pack grows beyond the size specified via `--max-cruft-size`
the pack is effectively frozen. This limits the I/O churn up to a
quadratic function of the value specified by the `--max-cruft-size`
option, instead of behaving quadratically in the number of total
unreachable objects.
When pruning unreachable objects, we bypass the new code paths which
combine small cruft packs together, and instead start from scratch,
passing in the appropriate `--max-pack-size` down to `pack-objects`,
putting it in charge of keeping the resulting set of cruft packs sized
correctly.
This may seem like further I/O churn, but in practice it isn't so bad.
We could prune old cruft packs for whom all or most objects are removed,
and then generate a new cruft pack with just the remaining set of
objects. But this additional complexity buys us relatively little,
because most objects end up being pruned anyway, so the I/O churn is
well contained.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 00:44:32 +00:00
|
|
|
OPT_MAGNITUDE(0, "max-cruft-size", &cruft_po_args.max_pack_size,
|
|
|
|
N_("with --cruft, limit the size of new cruft packs")),
|
2013-09-15 15:33:20 +00:00
|
|
|
OPT_BOOL('d', NULL, &delete_redundant,
|
|
|
|
N_("remove redundant packs, and run git-prune-packed")),
|
2018-08-08 22:34:05 +00:00
|
|
|
OPT_BOOL('f', NULL, &po_args.no_reuse_delta,
|
2013-09-15 15:33:20 +00:00
|
|
|
N_("pass --no-reuse-delta to git-pack-objects")),
|
2018-08-08 22:34:05 +00:00
|
|
|
OPT_BOOL('F', NULL, &po_args.no_reuse_object,
|
2013-09-15 15:33:20 +00:00
|
|
|
N_("pass --no-reuse-object to git-pack-objects")),
|
2022-03-14 07:42:46 +00:00
|
|
|
OPT_NEGBIT('n', NULL, &run_update_server_info,
|
|
|
|
N_("do not run git-update-server-info"), 1),
|
2018-08-08 22:34:05 +00:00
|
|
|
OPT__QUIET(&po_args.quiet, N_("be quiet")),
|
|
|
|
OPT_BOOL('l', "local", &po_args.local,
|
2013-09-15 15:33:20 +00:00
|
|
|
N_("pass --local to git-pack-objects")),
|
2014-06-10 20:10:07 +00:00
|
|
|
OPT_BOOL('b', "write-bitmap-index", &write_bitmaps,
|
2013-12-21 14:00:31 +00:00
|
|
|
N_("write bitmap index")),
|
2018-08-16 06:13:10 +00:00
|
|
|
OPT_BOOL('i', "delta-islands", &use_delta_islands,
|
|
|
|
N_("pass --delta-islands to git-pack-objects")),
|
2013-09-15 15:33:20 +00:00
|
|
|
OPT_STRING(0, "unpack-unreachable", &unpack_unreachable, N_("approxidate"),
|
|
|
|
N_("with -A, do not loosen objects older than this")),
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 04:36:28 +00:00
|
|
|
OPT_BOOL('k', "keep-unreachable", &keep_unreachable,
|
|
|
|
N_("with -a, repack unreachable objects")),
|
2018-08-08 22:34:05 +00:00
|
|
|
OPT_STRING(0, "window", &po_args.window, N_("n"),
|
2013-09-15 15:33:20 +00:00
|
|
|
N_("size of the window used for delta compression")),
|
2018-08-08 22:34:05 +00:00
|
|
|
OPT_STRING(0, "window-memory", &po_args.window_memory, N_("bytes"),
|
2013-09-15 15:33:20 +00:00
|
|
|
N_("same as the above, but limit memory size instead of entries count")),
|
2018-08-08 22:34:05 +00:00
|
|
|
OPT_STRING(0, "depth", &po_args.depth, N_("n"),
|
2013-09-15 15:33:20 +00:00
|
|
|
N_("limits the maximum delta depth")),
|
2018-08-08 22:34:05 +00:00
|
|
|
OPT_STRING(0, "threads", &po_args.threads, N_("n"),
|
2017-04-26 23:09:25 +00:00
|
|
|
N_("limits the maximum number of threads")),
|
2023-10-03 00:44:29 +00:00
|
|
|
OPT_MAGNITUDE(0, "max-pack-size", &po_args.max_pack_size,
|
2013-09-15 15:33:20 +00:00
|
|
|
N_("maximum size of each packfile")),
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
OPT_PARSE_LIST_OBJECTS_FILTER(&po_args.filter_options),
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 20:04:20 +00:00
|
|
|
OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
|
|
|
|
N_("repack objects in packs marked with .keep")),
|
2018-04-15 15:36:13 +00:00
|
|
|
OPT_STRING_LIST(0, "keep-pack", &keep_pack_list, N_("name"),
|
|
|
|
N_("do not repack this pack")),
|
2023-08-09 20:32:42 +00:00
|
|
|
OPT_INTEGER('g', "geometric", &geometry.split_factor,
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
N_("find a geometric progression with factor <N>")),
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
OPT_BOOL('m', "write-midx", &write_midx,
|
|
|
|
N_("write a multi-pack index of the resulting packs")),
|
builtin/repack.c: implement `--expire-to` for storing pruned objects
When pruning objects with `--cruft`, `git repack` offers some
flexibility when selecting the set of which objects are pruned via the
`--cruft-expiration` option.
This is useful for expiring objects which are older than the grace
period, making races where to-be-pruned objects become reachable and
then ancestors of freshly pushed objects, leaving the repository in a
corrupt state after pruning substantially less likely [1].
But in practice, such races are impossible to avoid entirely, no matter
how long the grace period is. To prevent this race, it is often
advisable to temporarily put a repository into a read-only state. But in
practice, this is not always practical, and so some middle ground would
be nice.
This patch introduces a new option, `--expire-to`, which teaches `git
repack` to write an additional cruft pack containing just the objects
which were pruned from the repository. The caller can specify a
directory outside of the current repository as the destination for this
second cruft pack.
This makes it possible to prune objects from a repository, while still
holding onto a supplemental copy of them outside of the original
repository. Having this copy on-disk makes it substantially easier to
recover objects when the aforementioned race is encountered.
`--expire-to` is implemented in a somewhat convoluted manner, which is
to take advantage of the fact that the first time `write_cruft_pack()`
is called, it adds the name of the cruft pack to the `names` string
list. That means the second time we call `write_cruft_pack()`, objects
in the previously-written cruft pack will be excluded.
As long as the caller ensures that no objects are expired during the
second pass, this is sufficient to generate a cruft pack containing all
objects which don't appear in any of the new packs written by `git
repack`, including the cruft pack. In other words, all of the objects
which are about to be pruned from the repository.
It is important to note that the destination in `--expire-to` does not
necessarily need to be a Git repository (though it can be) Notably, the
expired packs do not contain all ancestors of expired objects. So if the
source repository contains something like:
<unreachable>
/
C1 --- C2
\
refs/heads/master
where C2 is unreachable, but has a parent (C1) which is reachable, and
C2 would be pruned, then the expiry pack will contain only C2, not C1.
[1]: https://lore.kernel.org/git/20190319001829.GL29661@sigill.intra.peff.net/
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-24 18:43:12 +00:00
|
|
|
OPT_STRING(0, "expire-to", &expire_to, N_("dir"),
|
|
|
|
N_("pack prefix to store a pack containing pruned objects")),
|
2023-10-02 16:55:03 +00:00
|
|
|
OPT_STRING(0, "filter-to", &filter_to, N_("dir"),
|
|
|
|
N_("pack prefix to store a pack containing filtered out objects")),
|
2013-09-15 15:33:20 +00:00
|
|
|
OPT_END()
|
|
|
|
};
|
|
|
|
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
list_objects_filter_init(&po_args.filter_options);
|
|
|
|
|
2022-05-20 23:18:06 +00:00
|
|
|
git_config(repack_config, &cruft_po_args);
|
2013-09-15 15:33:20 +00:00
|
|
|
|
|
|
|
argc = parse_options(argc, argv, prefix, builtin_repack_options,
|
|
|
|
git_repack_usage, 0);
|
|
|
|
|
2015-06-23 10:54:11 +00:00
|
|
|
if (delete_redundant && repository_format_precious_objects)
|
|
|
|
die(_("cannot delete packs in a precious-objects repo"));
|
|
|
|
|
2023-12-06 11:51:56 +00:00
|
|
|
die_for_incompatible_opt3(unpack_unreachable || (pack_everything & LOOSEN_UNREACHABLE), "-A",
|
|
|
|
keep_unreachable, "-k/--keep-unreachable",
|
|
|
|
pack_everything & PACK_CRUFT, "--cruft");
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 04:36:28 +00:00
|
|
|
|
2023-12-06 11:51:56 +00:00
|
|
|
if (pack_everything & PACK_CRUFT)
|
2022-05-20 23:18:03 +00:00
|
|
|
pack_everything |= ALL_INTO_ONE;
|
|
|
|
|
2019-06-29 19:13:59 +00:00
|
|
|
if (write_bitmaps < 0) {
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
if (!write_midx &&
|
|
|
|
(!(pack_everything & ALL_INTO_ONE) || !is_bare_repository()))
|
2019-07-31 05:39:27 +00:00
|
|
|
write_bitmaps = 0;
|
2021-08-31 20:52:43 +00:00
|
|
|
} else if (write_bitmaps &&
|
|
|
|
git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0) &&
|
|
|
|
git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0)) {
|
|
|
|
write_bitmaps = 0;
|
2019-06-29 19:13:59 +00:00
|
|
|
}
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 20:04:20 +00:00
|
|
|
if (pack_kept_objects < 0)
|
2021-12-20 14:48:10 +00:00
|
|
|
pack_kept_objects = write_bitmaps > 0 && !write_midx;
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 20:04:20 +00:00
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
if (write_bitmaps && !(pack_everything & ALL_INTO_ONE) && !write_midx)
|
2016-12-28 22:45:42 +00:00
|
|
|
die(_(incremental_bitmap_conflict_error));
|
|
|
|
|
2023-04-14 06:02:12 +00:00
|
|
|
if (write_bitmaps && po_args.local && has_alt_odb(the_repository)) {
|
|
|
|
/*
|
|
|
|
* When asked to do a local repack, but we have
|
|
|
|
* packfiles that are inherited from an alternate, then
|
|
|
|
* we cannot guarantee that the multi-pack-index would
|
|
|
|
* have full coverage of all objects. We thus disable
|
|
|
|
* writing bitmaps in that case.
|
|
|
|
*/
|
|
|
|
warning(_("disabling bitmap writing, as some objects are not being packed"));
|
|
|
|
write_bitmaps = 0;
|
|
|
|
}
|
|
|
|
|
2021-10-01 22:38:10 +00:00
|
|
|
if (write_midx && write_bitmaps) {
|
|
|
|
struct strbuf path = STRBUF_INIT;
|
|
|
|
|
|
|
|
strbuf_addf(&path, "%s/%s_XXXXXX", get_object_directory(),
|
|
|
|
"bitmap-ref-tips");
|
|
|
|
|
|
|
|
refs_snapshot = xmks_tempfile(path.buf);
|
|
|
|
midx_snapshot_refs(refs_snapshot);
|
|
|
|
|
|
|
|
strbuf_release(&path);
|
|
|
|
}
|
|
|
|
|
2022-05-20 19:01:45 +00:00
|
|
|
packdir = mkpathdup("%s/pack", get_object_directory());
|
|
|
|
packtmp_name = xstrfmt(".tmp-%d-pack", (int)getpid());
|
|
|
|
packtmp = mkpathdup("%s/%s", packdir, packtmp_name);
|
|
|
|
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
collect_pack_filenames(&existing, &keep_pack_list);
|
2022-05-20 19:01:45 +00:00
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
if (geometry.split_factor) {
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
if (pack_everything)
|
2022-01-05 20:02:16 +00:00
|
|
|
die(_("options '%s' and '%s' cannot be used together"), "--geometric", "-A/-a");
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
init_pack_geometry(&geometry, &existing, &po_args);
|
2023-08-09 20:32:42 +00:00
|
|
|
split_pack_geometry(&geometry);
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
}
|
|
|
|
|
2022-10-24 18:43:03 +00:00
|
|
|
prepare_pack_objects(&cmd, &po_args, packtmp);
|
2018-08-08 22:34:05 +00:00
|
|
|
|
2021-12-20 14:48:11 +00:00
|
|
|
show_progress = !po_args.quiet && isatty(2);
|
|
|
|
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd.args, "--keep-true-parents");
|
repack: add `repack.packKeptObjects` config var
The git-repack command always passes `--honor-pack-keep`
to pack-objects. This has traditionally been a good thing,
as we do not want to duplicate those objects in a new pack,
and we are not going to delete the old pack.
However, when bitmaps are in use, it is important for a full
repack to include all reachable objects, even if they may be
duplicated in a .keep pack. Otherwise, we cannot generate
the bitmaps, as the on-disk format requires the set of
objects in the pack to be fully closed.
Even if the repository does not generally have .keep files,
a simultaneous push could cause a race condition in which a
.keep file exists at the moment of a repack. The repack may
try to include those objects in one of two situations:
1. The pushed .keep pack contains objects that were
already in the repository (e.g., blobs due to a revert of
an old commit).
2. Receive-pack updates the refs, making the objects
reachable, but before it removes the .keep file, the
repack runs.
In either case, we may prefer to duplicate some objects in
the new, full pack, and let the next repack (after the .keep
file is cleaned up) take care of removing them.
This patch introduces both a command-line and config option
to disable the `--honor-pack-keep` option. By default, it
is triggered when pack.writeBitmaps (or `--write-bitmap-index`
is turned on), but specifying it explicitly can override the
behavior (e.g., in cases where you prefer .keep files to
bitmaps, but only when they are present).
Note that this option just disables the pack-objects
behavior. We still leave packs with a .keep in place, as we
do not necessarily know that we have duplicated all of their
objects.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-03-03 20:04:20 +00:00
|
|
|
if (!pack_kept_objects)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd.args, "--honor-pack-keep");
|
2018-04-15 15:36:13 +00:00
|
|
|
for (i = 0; i < keep_pack_list.nr; i++)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_pushf(&cmd.args, "--keep-pack=%s",
|
strvec: fix indentation in renamed calls
Code which split an argv_array call across multiple lines, like:
argv_array_pushl(&args, "one argument",
"another argument", "and more",
NULL);
was recently mechanically renamed to use strvec, which results in
mis-matched indentation like:
strvec_pushl(&args, "one argument",
"another argument", "and more",
NULL);
Let's fix these up to align the arguments with the opening paren. I did
this manually by sifting through the results of:
git jump grep 'strvec_.*,$'
and liberally applying my editor's auto-format. Most of the changes are
of the form shown above, though I also normalized a few that had
originally used a single-tab indentation (rather than our usual style of
aligning with the open paren). I also rewrapped a couple of obvious
cases (e.g., where previously too-long lines became short enough to fit
on one), but I wasn't aggressive about it. In cases broken to three or
more lines, the grouping of arguments is sometimes meaningful, and it
wasn't worth my time or reviewer time to ponder each case individually.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28 20:26:31 +00:00
|
|
|
keep_pack_list.items[i].string);
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd.args, "--non-empty");
|
2023-08-09 20:32:42 +00:00
|
|
|
if (!geometry.split_factor) {
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
/*
|
2021-03-05 15:22:02 +00:00
|
|
|
* We need to grab all reachable objects, including those that
|
|
|
|
* are reachable from reflogs and the index.
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
*
|
2021-03-05 15:22:02 +00:00
|
|
|
* When repacking into a geometric progression of packs,
|
|
|
|
* however, we ask 'git pack-objects --stdin-packs', and it is
|
|
|
|
* not about packing objects based on reachability but about
|
|
|
|
* repacking all the objects in specified packs and loose ones
|
|
|
|
* (indeed, --stdin-packs is incompatible with these options).
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
*/
|
|
|
|
strvec_push(&cmd.args, "--all");
|
|
|
|
strvec_push(&cmd.args, "--reflog");
|
|
|
|
strvec_push(&cmd.args, "--indexed-objects");
|
|
|
|
}
|
2023-03-28 13:58:53 +00:00
|
|
|
if (repo_has_promisor_remote(the_repository))
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd.args, "--exclude-promisor-objects");
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
if (!write_midx) {
|
|
|
|
if (write_bitmaps > 0)
|
|
|
|
strvec_push(&cmd.args, "--write-bitmap-index");
|
|
|
|
else if (write_bitmaps < 0)
|
|
|
|
strvec_push(&cmd.args, "--write-bitmap-index-quiet");
|
|
|
|
}
|
2018-08-16 06:13:10 +00:00
|
|
|
if (use_delta_islands)
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd.args, "--delta-islands");
|
2013-09-15 15:33:20 +00:00
|
|
|
|
builtin/repack.c: keep track of existing packs unconditionally
In order to be able to write a multi-pack index during repacking, `git
repack` must keep track of which packs it wants to write into the MIDX.
This set is the union of existing packs which will not be deleted,
new pack(s) generated as a result of the repack, and .keep packs.
Prior to this patch, `git repack` populated the list of existing packs
only when repacking all-into-one (i.e., with `-A` or `-a`), but we will
soon need to know this list when repacking when writing a MIDX without
a-i-o.
Populate the list of existing packs unconditionally, and guard removing
packs from that list only when repacking a-i-o.
Additionally, keep track of filenames of kept packs separately, since
this, too, will be used in an upcoming patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:10 +00:00
|
|
|
if (pack_everything & ALL_INTO_ONE) {
|
2018-08-08 22:34:06 +00:00
|
|
|
repack_promisor_objects(&po_args, &names);
|
|
|
|
|
builtin/repack.c: extract `has_existing_non_kept_packs()`
When there is:
- at least one pre-existing packfile (which is not marked as kept),
- repacking with the `-d` flag, and
- not doing a cruft repack
, then we pass a handful of additional options to the inner
`pack-objects` process, like `--unpack-unreachable`,
`--keep-unreachable`, and `--pack-loose-unreachable`, in addition to
marking any packs we just wrote for promisor remotes as kept in-core
(with `--keep-pack`, as opposed to the presence of a ".keep" file on
disk).
Because we store both cruft and non-cruft packs together in the same
`existing.non_kept_packs` list, it suffices to check its `nr` member to
see if it is zero or not.
But a following change will store cruft- and non-cruft packs separately,
meaning this check would break as a result. Prepare for this by
extracting this part of the check into a new helper function called
`has_existing_non_kept_packs()`.
This patch does not introduce any functional changes, but prepares us to
make a more isolated change in a subsequent patch.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:54 +00:00
|
|
|
if (has_existing_non_kept_packs(&existing) &&
|
|
|
|
delete_redundant &&
|
2022-05-20 23:18:03 +00:00
|
|
|
!(pack_everything & PACK_CRUFT)) {
|
repack: avoid loosening promisor objects in partial clones
When `git repack -A -d` is run in a partial clone, `pack-objects`
is invoked twice: once to repack all promisor objects, and once to
repack all non-promisor objects. The latter `pack-objects` invocation
is with --exclude-promisor-objects and --unpack-unreachable, which
loosens all objects unused during this invocation. Unfortunately,
this includes promisor objects.
Because the -d argument to `git repack` subsequently deletes all loose
objects also in packs, these just-loosened promisor objects will be
immediately deleted. However, this extra disk churn is unnecessary in
the first place. For example, in a newly-cloned partial repo that
filters all blob objects (e.g. `--filter=blob:none`), `repack` ends up
unpacking all trees and commits into the filesystem because every
object, in this particular case, is a promisor object. Depending on
the repo size, this increases the disk usage considerably: In my copy
of the linux.git, the object directory peaked 26GB of more disk usage.
In order to avoid this extra disk churn, pass the names of the promisor
packfiles as --keep-pack arguments to the second invocation of
`pack-objects`. This informs `pack-objects` that the promisor objects
are already in a safe packfile and, therefore, do not need to be
loosened.
For testing, we need to validate whether any object was loosened.
However, the "evidence" (loosened objects) is deleted during the
process which prevents us from inspecting the object directory.
Instead, let's teach `pack-objects` to count loosened objects and
emit via trace2 thus allowing inspecting the debug events after the
process is finished. This new event is used on the added regression
test.
Lastly, add a new perf test to evaluate the performance impact
made by this changes (tested on git.git):
Test HEAD^ HEAD
----------------------------------------------------------
5600.3: gc 134.38(41.93+90.95) 7.80(6.72+1.35) -94.2%
For a bigger repository, such as linux.git, the improvement is
even bigger:
Test HEAD^ HEAD
-------------------------------------------------------------------
5600.3: gc 6833.00(918.07+3162.74) 268.79(227.02+39.18) -96.1%
These improvements are particular big because every object in the
newly-cloned partial repository is a promisor object.
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Helped-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Rafael Silva <rafaeloliveira.cs@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-04-21 19:32:12 +00:00
|
|
|
for_each_string_list_item(item, &names) {
|
|
|
|
strvec_pushf(&cmd.args, "--keep-pack=%s-%s.pack",
|
|
|
|
packtmp_name, item->string);
|
|
|
|
}
|
2015-03-20 18:43:13 +00:00
|
|
|
if (unpack_unreachable) {
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_pushf(&cmd.args,
|
strvec: fix indentation in renamed calls
Code which split an argv_array call across multiple lines, like:
argv_array_pushl(&args, "one argument",
"another argument", "and more",
NULL);
was recently mechanically renamed to use strvec, which results in
mis-matched indentation like:
strvec_pushl(&args, "one argument",
"another argument", "and more",
NULL);
Let's fix these up to align the arguments with the opening paren. I did
this manually by sifting through the results of:
git jump grep 'strvec_.*,$'
and liberally applying my editor's auto-format. Most of the changes are
of the form shown above, though I also normalized a few that had
originally used a single-tab indentation (rather than our usual style of
aligning with the open paren). I also rewrapped a couple of obvious
cases (e.g., where previously too-long lines became short enough to fit
on one), but I wasn't aggressive about it. In cases broken to three or
more lines, the grouping of arguments is sometimes meaningful, and it
wasn't worth my time or reviewer time to ponder each case individually.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28 20:26:31 +00:00
|
|
|
"--unpack-unreachable=%s",
|
|
|
|
unpack_unreachable);
|
2015-03-20 18:43:13 +00:00
|
|
|
} else if (pack_everything & LOOSEN_UNREACHABLE) {
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd.args,
|
strvec: fix indentation in renamed calls
Code which split an argv_array call across multiple lines, like:
argv_array_pushl(&args, "one argument",
"another argument", "and more",
NULL);
was recently mechanically renamed to use strvec, which results in
mis-matched indentation like:
strvec_pushl(&args, "one argument",
"another argument", "and more",
NULL);
Let's fix these up to align the arguments with the opening paren. I did
this manually by sifting through the results of:
git jump grep 'strvec_.*,$'
and liberally applying my editor's auto-format. Most of the changes are
of the form shown above, though I also normalized a few that had
originally used a single-tab indentation (rather than our usual style of
aligning with the open paren). I also rewrapped a couple of obvious
cases (e.g., where previously too-long lines became short enough to fit
on one), but I wasn't aggressive about it. In cases broken to three or
more lines, the grouping of arguments is sometimes meaningful, and it
wasn't worth my time or reviewer time to ponder each case individually.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-07-28 20:26:31 +00:00
|
|
|
"--unpack-unreachable");
|
repack: add --keep-unreachable option
The usual way to do a full repack (and what is done by
git-gc) is to run "repack -Ad --unpack-unreachable=<when>",
which will loosen any unreachable objects newer than
"<when>", and drop any older ones.
This is a safer alternative to "repack -ad", because
"<when>" becomes a grace period during which we will not
drop any new objects that are about to be referenced.
However, it isn't perfectly safe. It's always possible that
a process is about to reference an old object. Even if that
process were to take care to update the timestamp on the
object, there is no atomicity with a simultaneously running
"repack" process.
So while unlikely, there is a small race wherein we may drop
an object that is in the process of being referenced. If you
do automated repacking on a large number of active
repositories, you may hit it eventually, and the result is a
corrupted repository.
It would be nice to fix that race in the long run, but it's
complicated. In the meantime, there is a much simpler
strategy for automated repository maintenance: do not drop
objects at all. We already have a "--keep-unreachable"
option in pack-objects; we just need to plumb it through
from git-repack.
Note that this _isn't_ plumbed through from git-gc, so at
this point it's strictly a tool for people doing their own
advanced repository maintenance strategy.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-06-13 04:36:28 +00:00
|
|
|
} else if (keep_unreachable) {
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd.args, "--keep-unreachable");
|
|
|
|
strvec_push(&cmd.args, "--pack-loose-unreachable");
|
2015-03-20 18:43:13 +00:00
|
|
|
}
|
2013-09-15 15:33:20 +00:00
|
|
|
}
|
2023-08-09 20:32:42 +00:00
|
|
|
} else if (geometry.split_factor) {
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
strvec_push(&cmd.args, "--stdin-packs");
|
|
|
|
strvec_push(&cmd.args, "--unpacked");
|
2013-09-15 15:33:20 +00:00
|
|
|
} else {
|
2020-07-28 20:24:27 +00:00
|
|
|
strvec_push(&cmd.args, "--unpacked");
|
|
|
|
strvec_push(&cmd.args, "--incremental");
|
2013-09-15 15:33:20 +00:00
|
|
|
}
|
|
|
|
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
if (po_args.filter_options.choice)
|
|
|
|
strvec_pushf(&cmd.args, "--filter=%s",
|
|
|
|
expand_list_objects_filter_spec(&po_args.filter_options));
|
2023-10-02 16:55:03 +00:00
|
|
|
else if (filter_to)
|
|
|
|
die(_("option '%s' can only be used along with '%s'"), "--filter-to", "--filter");
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
if (geometry.split_factor)
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
cmd.in = -1;
|
|
|
|
else
|
|
|
|
cmd.no_stdin = 1;
|
2013-09-15 15:33:20 +00:00
|
|
|
|
|
|
|
ret = start_command(&cmd);
|
|
|
|
if (ret)
|
2023-02-06 23:07:42 +00:00
|
|
|
goto cleanup;
|
2013-09-15 15:33:20 +00:00
|
|
|
|
2023-08-09 20:32:42 +00:00
|
|
|
if (geometry.split_factor) {
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
FILE *in = xfdopen(cmd.in, "w");
|
|
|
|
/*
|
|
|
|
* The resulting pack should contain all objects in packs that
|
|
|
|
* are going to be rolled up, but exclude objects in packs which
|
|
|
|
* are being left alone.
|
|
|
|
*/
|
2023-08-09 20:32:42 +00:00
|
|
|
for (i = 0; i < geometry.split; i++)
|
|
|
|
fprintf(in, "%s\n", pack_basename(geometry.pack[i]));
|
|
|
|
for (i = geometry.split; i < geometry.pack_nr; i++)
|
|
|
|
fprintf(in, "^%s\n", pack_basename(geometry.pack[i]));
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
fclose(in);
|
|
|
|
}
|
|
|
|
|
2023-10-02 16:54:58 +00:00
|
|
|
ret = finish_pack_objects_cmd(&cmd, &names, 1);
|
2013-09-15 15:33:20 +00:00
|
|
|
if (ret)
|
2023-02-06 23:07:42 +00:00
|
|
|
goto cleanup;
|
2013-09-15 15:33:20 +00:00
|
|
|
|
2018-08-08 22:34:05 +00:00
|
|
|
if (!names.nr && !po_args.quiet)
|
2018-11-10 05:16:10 +00:00
|
|
|
printf_ln(_("Nothing new to pack."));
|
2013-09-15 15:33:20 +00:00
|
|
|
|
2022-05-20 23:18:03 +00:00
|
|
|
if (pack_everything & PACK_CRUFT) {
|
2023-10-02 16:54:59 +00:00
|
|
|
const char *pack_prefix = find_pack_prefix(packdir, packtmp);
|
2022-05-20 23:18:03 +00:00
|
|
|
|
2022-05-20 23:18:06 +00:00
|
|
|
if (!cruft_po_args.window)
|
|
|
|
cruft_po_args.window = po_args.window;
|
|
|
|
if (!cruft_po_args.window_memory)
|
|
|
|
cruft_po_args.window_memory = po_args.window_memory;
|
|
|
|
if (!cruft_po_args.depth)
|
|
|
|
cruft_po_args.depth = po_args.depth;
|
|
|
|
if (!cruft_po_args.threads)
|
|
|
|
cruft_po_args.threads = po_args.threads;
|
builtin/pack-objects.c: support `--max-pack-size` with `--cruft`
When pack-objects learned the `--cruft` option back in b757353676
(builtin/pack-objects.c: --cruft without expiration, 2022-05-20), we
explicitly forbade `--cruft` with `--max-pack-size`.
At the time, there was no specific rationale given in the patch for not
supporting the `--max-pack-size` option with `--cruft`. (As best I can
remember, it's because we were trying to push users towards only ever
having a single cruft pack, but I cannot be sure).
However, `--max-pack-size` is flexible enough that it already works with
`--cruft` and can shard unreachable objects across multiple cruft packs,
creating separate ".mtimes" files as appropriate. In fact, the
`--max-pack-size` option worked with `--cruft` as far back as
b757353676!
This is because we overwrite the `written_list`, and pass down the
appropriate length, i.e. the number of objects written in each pack
shard.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-08-28 22:49:07 +00:00
|
|
|
if (!cruft_po_args.max_pack_size)
|
|
|
|
cruft_po_args.max_pack_size = po_args.max_pack_size;
|
2022-05-20 23:18:06 +00:00
|
|
|
|
|
|
|
cruft_po_args.local = po_args.local;
|
|
|
|
cruft_po_args.quiet = po_args.quiet;
|
|
|
|
|
2022-10-24 18:43:09 +00:00
|
|
|
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
|
2022-10-24 18:43:06 +00:00
|
|
|
cruft_expiration, &names,
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
&existing);
|
2022-05-20 23:18:03 +00:00
|
|
|
if (ret)
|
2023-02-06 23:07:42 +00:00
|
|
|
goto cleanup;
|
builtin/repack.c: implement `--expire-to` for storing pruned objects
When pruning objects with `--cruft`, `git repack` offers some
flexibility when selecting the set of which objects are pruned via the
`--cruft-expiration` option.
This is useful for expiring objects which are older than the grace
period, making races where to-be-pruned objects become reachable and
then ancestors of freshly pushed objects, leaving the repository in a
corrupt state after pruning substantially less likely [1].
But in practice, such races are impossible to avoid entirely, no matter
how long the grace period is. To prevent this race, it is often
advisable to temporarily put a repository into a read-only state. But in
practice, this is not always practical, and so some middle ground would
be nice.
This patch introduces a new option, `--expire-to`, which teaches `git
repack` to write an additional cruft pack containing just the objects
which were pruned from the repository. The caller can specify a
directory outside of the current repository as the destination for this
second cruft pack.
This makes it possible to prune objects from a repository, while still
holding onto a supplemental copy of them outside of the original
repository. Having this copy on-disk makes it substantially easier to
recover objects when the aforementioned race is encountered.
`--expire-to` is implemented in a somewhat convoluted manner, which is
to take advantage of the fact that the first time `write_cruft_pack()`
is called, it adds the name of the cruft pack to the `names` string
list. That means the second time we call `write_cruft_pack()`, objects
in the previously-written cruft pack will be excluded.
As long as the caller ensures that no objects are expired during the
second pass, this is sufficient to generate a cruft pack containing all
objects which don't appear in any of the new packs written by `git
repack`, including the cruft pack. In other words, all of the objects
which are about to be pruned from the repository.
It is important to note that the destination in `--expire-to` does not
necessarily need to be a Git repository (though it can be) Notably, the
expired packs do not contain all ancestors of expired objects. So if the
source repository contains something like:
<unreachable>
/
C1 --- C2
\
refs/heads/master
where C2 is unreachable, but has a parent (C1) which is reachable, and
C2 would be pruned, then the expiry pack will contain only C2, not C1.
[1]: https://lore.kernel.org/git/20190319001829.GL29661@sigill.intra.peff.net/
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-24 18:43:12 +00:00
|
|
|
|
|
|
|
if (delete_redundant && expire_to) {
|
|
|
|
/*
|
|
|
|
* If `--expire-to` is given with `-d`, it's possible
|
|
|
|
* that we're about to prune some objects. With cruft
|
|
|
|
* packs, pruning is implicit: any objects from existing
|
|
|
|
* packs that weren't picked up by new packs are removed
|
|
|
|
* when their packs are deleted.
|
|
|
|
*
|
|
|
|
* Generate an additional cruft pack, with one twist:
|
|
|
|
* `names` now includes the name of the cruft pack
|
|
|
|
* written in the previous step. So the contents of
|
|
|
|
* _this_ cruft pack exclude everything contained in the
|
|
|
|
* existing cruft pack (that is, all of the unreachable
|
|
|
|
* objects which are no older than
|
|
|
|
* `--cruft-expiration`).
|
|
|
|
*
|
|
|
|
* To make this work, cruft_expiration must become NULL
|
|
|
|
* so that this cruft pack doesn't actually prune any
|
|
|
|
* objects. If it were non-NULL, this call would always
|
|
|
|
* generate an empty pack (since every object not in the
|
|
|
|
* cruft pack generated above will have an mtime older
|
|
|
|
* than the expiration).
|
|
|
|
*/
|
|
|
|
ret = write_cruft_pack(&cruft_po_args, expire_to,
|
|
|
|
pack_prefix,
|
|
|
|
NULL,
|
|
|
|
&names,
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
&existing);
|
builtin/repack.c: implement `--expire-to` for storing pruned objects
When pruning objects with `--cruft`, `git repack` offers some
flexibility when selecting the set of which objects are pruned via the
`--cruft-expiration` option.
This is useful for expiring objects which are older than the grace
period, making races where to-be-pruned objects become reachable and
then ancestors of freshly pushed objects, leaving the repository in a
corrupt state after pruning substantially less likely [1].
But in practice, such races are impossible to avoid entirely, no matter
how long the grace period is. To prevent this race, it is often
advisable to temporarily put a repository into a read-only state. But in
practice, this is not always practical, and so some middle ground would
be nice.
This patch introduces a new option, `--expire-to`, which teaches `git
repack` to write an additional cruft pack containing just the objects
which were pruned from the repository. The caller can specify a
directory outside of the current repository as the destination for this
second cruft pack.
This makes it possible to prune objects from a repository, while still
holding onto a supplemental copy of them outside of the original
repository. Having this copy on-disk makes it substantially easier to
recover objects when the aforementioned race is encountered.
`--expire-to` is implemented in a somewhat convoluted manner, which is
to take advantage of the fact that the first time `write_cruft_pack()`
is called, it adds the name of the cruft pack to the `names` string
list. That means the second time we call `write_cruft_pack()`, objects
in the previously-written cruft pack will be excluded.
As long as the caller ensures that no objects are expired during the
second pass, this is sufficient to generate a cruft pack containing all
objects which don't appear in any of the new packs written by `git
repack`, including the cruft pack. In other words, all of the objects
which are about to be pruned from the repository.
It is important to note that the destination in `--expire-to` does not
necessarily need to be a Git repository (though it can be) Notably, the
expired packs do not contain all ancestors of expired objects. So if the
source repository contains something like:
<unreachable>
/
C1 --- C2
\
refs/heads/master
where C2 is unreachable, but has a parent (C1) which is reachable, and
C2 would be pruned, then the expiry pack will contain only C2, not C1.
[1]: https://lore.kernel.org/git/20190319001829.GL29661@sigill.intra.peff.net/
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-24 18:43:12 +00:00
|
|
|
if (ret)
|
2023-02-06 23:07:42 +00:00
|
|
|
goto cleanup;
|
builtin/repack.c: implement `--expire-to` for storing pruned objects
When pruning objects with `--cruft`, `git repack` offers some
flexibility when selecting the set of which objects are pruned via the
`--cruft-expiration` option.
This is useful for expiring objects which are older than the grace
period, making races where to-be-pruned objects become reachable and
then ancestors of freshly pushed objects, leaving the repository in a
corrupt state after pruning substantially less likely [1].
But in practice, such races are impossible to avoid entirely, no matter
how long the grace period is. To prevent this race, it is often
advisable to temporarily put a repository into a read-only state. But in
practice, this is not always practical, and so some middle ground would
be nice.
This patch introduces a new option, `--expire-to`, which teaches `git
repack` to write an additional cruft pack containing just the objects
which were pruned from the repository. The caller can specify a
directory outside of the current repository as the destination for this
second cruft pack.
This makes it possible to prune objects from a repository, while still
holding onto a supplemental copy of them outside of the original
repository. Having this copy on-disk makes it substantially easier to
recover objects when the aforementioned race is encountered.
`--expire-to` is implemented in a somewhat convoluted manner, which is
to take advantage of the fact that the first time `write_cruft_pack()`
is called, it adds the name of the cruft pack to the `names` string
list. That means the second time we call `write_cruft_pack()`, objects
in the previously-written cruft pack will be excluded.
As long as the caller ensures that no objects are expired during the
second pass, this is sufficient to generate a cruft pack containing all
objects which don't appear in any of the new packs written by `git
repack`, including the cruft pack. In other words, all of the objects
which are about to be pruned from the repository.
It is important to note that the destination in `--expire-to` does not
necessarily need to be a Git repository (though it can be) Notably, the
expired packs do not contain all ancestors of expired objects. So if the
source repository contains something like:
<unreachable>
/
C1 --- C2
\
refs/heads/master
where C2 is unreachable, but has a parent (C1) which is reachable, and
C2 would be pruned, then the expiry pack will contain only C2, not C1.
[1]: https://lore.kernel.org/git/20190319001829.GL29661@sigill.intra.peff.net/
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-24 18:43:12 +00:00
|
|
|
}
|
2022-05-20 23:18:03 +00:00
|
|
|
}
|
|
|
|
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
if (po_args.filter_options.choice) {
|
2023-10-02 16:55:03 +00:00
|
|
|
if (!filter_to)
|
|
|
|
filter_to = packtmp;
|
|
|
|
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
ret = write_filtered_pack(&po_args,
|
2023-10-02 16:55:03 +00:00
|
|
|
filter_to,
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
find_pack_prefix(packdir, packtmp),
|
|
|
|
&existing,
|
|
|
|
&names);
|
|
|
|
if (ret)
|
|
|
|
goto cleanup;
|
|
|
|
}
|
|
|
|
|
2022-05-20 19:01:51 +00:00
|
|
|
string_list_sort(&names);
|
|
|
|
|
2019-05-17 18:41:49 +00:00
|
|
|
close_object_store(the_repository->objects);
|
2018-12-15 22:04:01 +00:00
|
|
|
|
2013-09-15 15:33:20 +00:00
|
|
|
/*
|
|
|
|
* Ok we have prepared all new packfiles.
|
|
|
|
*/
|
|
|
|
for_each_string_list_item(item, &names) {
|
2022-10-22 00:21:45 +00:00
|
|
|
struct generated_pack_data *data = item->util;
|
|
|
|
|
2013-12-21 14:00:19 +00:00
|
|
|
for (ext = 0; ext < ARRAY_SIZE(exts); ext++) {
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 00:21:54 +00:00
|
|
|
char *fname;
|
2018-07-12 19:39:40 +00:00
|
|
|
|
2013-09-15 15:33:20 +00:00
|
|
|
fname = mkpathdup("%s/pack-%s%s",
|
2013-12-21 14:00:23 +00:00
|
|
|
packdir, item->string, exts[ext].name);
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 20:15:16 +00:00
|
|
|
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 00:21:54 +00:00
|
|
|
if (data->tempfiles[ext]) {
|
|
|
|
const char *fname_old = get_tempfile_path(data->tempfiles[ext]);
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 20:15:16 +00:00
|
|
|
struct stat statbuffer;
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 00:21:54 +00:00
|
|
|
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 20:15:16 +00:00
|
|
|
if (!stat(fname_old, &statbuffer)) {
|
|
|
|
statbuffer.st_mode &= ~(S_IWUSR | S_IWGRP | S_IWOTH);
|
|
|
|
chmod(fname_old, statbuffer.st_mode);
|
|
|
|
}
|
|
|
|
|
repack: use tempfiles for signal cleanup
When git-repack exits due to a signal, it tries to clean up by calling
its remove_temporary_files() function, which walks through the packs dir
looking for ".tmp-$$-pack-*" files to delete (where "$$" is the pid of
the current process).
The biggest problem here is that remove_temporary_files() is not safe to
call in a signal handler. It uses opendir(), which isn't on the POSIX
async-signal-safe list. The details will be platform-specific, but a
likely issue is that it needs to allocate memory; if we receive a signal
while inside malloc(), etc, we'll conflict on the allocator lock and
deadlock with ourselves.
We can fix this by just cleaning up the files directly, without walking
the directory. We already know the complete list of .tmp-* files that
were generated, because we recorded them via populate_pack_exts(). When
we find files there, we can use register_tempfile() to record the
filenames. If we receive a signal, then the tempfile API will clean them
up for us, and it's async-safe and pretty battle-tested.
Note that this is slightly racier than the existing scheme. We don't
record the filenames until pack-objects tells us the hash over stdout.
So during the period between it generating the file and reporting the
hash, we'd fail to clean up. However, that period is very small. During
most of the pack generation process pack-objects is using its own
internal tempfiles. It's only at the very end that it moves them into
the names git-repack expects, and then it immediately reports the name
to us. Given that cleanup like this is best effort (after all, we may
get SIGKILL), this level of race is acceptable.
When we register the tempfiles, we'll record them locally and use the
result to call rename_tempfile(), rather than renaming by hand. This
isn't strictly necessary, as once we've renamed the files they're gone,
and the tempfile API's cleanup unlink() would simply become a pointless
noop. But managing the lifetimes of the tempfile objects is the cleanest
thing to do, and the tempfile pointers naturally fill the same role as
the old booleans.
This patch also fixes another small problem. We only hook signals, and
don't set up an atexit handler. So if we see an error that causes us to
die(), we'll leave the .tmp-* files in place. But since the tempfile API
handles this for us, this is now fixed for free. The new test covers
this by stimulating a failure of pack-objects when generating a cruft
pack. Before this patch, the .tmp-* file for the main pack would have
been left, but now we correctly clean it up.
Two small subtleties on the implementation:
- in the renaming loop, we can stop re-constructing fname_old; we only
use it when we have a tempfile to rename, so we can just ask the
tempfile for its path (which, barring bugs, should be identical)
- when renaming fails, our error message mentions fname_old. But since
a failed rename_tempfile() invalidates the tempfile struct, we'll
lose access to that string. Instead, let's mention the destination
filename, which is what most other callers do.
Reported-by: Jan Pokorný <poki@fnusa.cz>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 00:21:54 +00:00
|
|
|
if (rename_tempfile(&data->tempfiles[ext], fname))
|
|
|
|
die_errno(_("renaming pack to '%s' failed"), fname);
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 20:15:16 +00:00
|
|
|
} else if (!exts[ext].optional)
|
repack: expand error message for missing pack files
If pack-objects tells us it generated pack $hash, we expect to find
.tmp-$$-pack-$hash.pack, .idx, .rev, and so on. Some of these files are
optional, but others are not. For the required ones, we'll bail with an
error if any of them is missing.
The error message is just "missing required file", which is a bit vague.
We should be more clear that it is not the user's fault, but rather that
the sub-pgoram we called is not operating as expected. In practice,
nobody should ever see this message, as it would generally only be
caused by a bug in Git.
It probably doesn't make sense to convert this to a BUG(), though, as
there are other (unlikely) possibilities, such as somebody else racily
deleting the files, filesystem errors causing stat() to fail, and so on.
A nice side effect here is that we stop relying on fname_old in this
code path, which will let us deal with it only in the first part of the
conditional.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-10-22 00:21:50 +00:00
|
|
|
die(_("pack-objects did not write a '%s' file for pack %s-%s"),
|
|
|
|
exts[ext].name, packtmp, item->string);
|
builtin/repack.c: don't move existing packs out of the way
When 'git repack' creates a pack with the same name as any existing
pack, it moves the existing one to 'old-pack-xxx.{pack,idx,...}' and
then renames the new one into place.
Eventually, it would be nice to have 'git repack' allow for writing a
multi-pack index at the critical time (after the new packs have been
written / moved into place, but before the old ones have been deleted).
Guessing that this option might be called '--write-midx', this makes the
following situation (where repacks are issued back-to-back without any
new objects) impossible:
$ git repack -adb
$ git repack -adb --write-midx
In the second repack, the existing packs are overwritten verbatim with
the same rename-to-old sequence. At that point, the current MIDX is
invalidated, since it refers to now-missing packs. So that code wants to
be run after the MIDX is re-written. But (prior to this patch) the new
MIDX can't be written until the new packs are moved into place. So, we
have a circular dependency.
This is all hypothetical, since no code currently exists to write a MIDX
safely during a 'git repack' (the 'GIT_TEST_MULTI_PACK_INDEX' does so
unsafely). Putting hypothetical aside, though: why do we need to rename
existing packs to be prefixed with 'old-' anyway?
This behavior dates all the way back to 2ad47d6 (git-repack: Be
careful when updating the same pack as an existing one., 2006-06-25).
2ad47d6 is mainly concerned about a case where a newly written pack
would have a different structure than its index. This used to be
possible when the pack name was a hash of the set of objects. Under this
naming scheme, two packs that store the same set of objects could differ
in delta selection, object positioning, or both. If this happened, then
any such packs would be unreadable in the instant between copying the
new pack and new index (i.e., either the index or pack will be stale
depending on the order that they were copied).
But since 1190a1a (pack-objects: name pack files after trailer hash,
2013-12-05), this is no longer possible, since pack files are named not
after their logical contents (i.e., the set of objects), but by the
actual checksum of their contents. So, this old- behavior can safely go,
which allows us to avoid our circular dependency above.
In addition to avoiding the circular dependency, this patch also makes
'git repack' a lot simpler, since we don't have to deal with failures
encountered when renaming existing packs to be prefixed with 'old-'.
This patch is mostly limited to removing code paths that deal with the
'old' prefixing, with the exception of files that include the pack's
name in their own filename, like .idx, .bitmap, and related files. The
exception is that we want to continue to trust what pack-objects wrote.
That is, it is not the case that we pretend as if pack-objects didn't
write files identical to ones that already exist, but rather that we
respect what pack-objects wrote as the source of truth. That cuts two
ways:
- If pack-objects produced an identical pack to one that already
exists with a bitmap, but did not produce a bitmap, we remove the
bitmap that already exists. (This behavior is codified in t7700.14).
- If pack-objects produced an identical pack to one that already
exists, we trust the just-written version of the coresponding .idx,
.promisor, and other files over the ones that already exist. This
ensures that we use the most up-to-date versions of this files,
which is safe even in the face of format changes in, say, the .idx
file (which would not be reflected in the .idx file's name).
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-11-17 20:15:16 +00:00
|
|
|
else if (unlink(fname) < 0 && errno != ENOENT)
|
|
|
|
die_errno(_("could not unlink: %s"), fname);
|
2013-09-15 15:33:20 +00:00
|
|
|
|
2015-08-10 09:35:38 +00:00
|
|
|
free(fname);
|
2013-09-15 15:33:20 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
/* End of pack replacement. */
|
|
|
|
|
2023-09-13 19:17:46 +00:00
|
|
|
if (delete_redundant && pack_everything & ALL_INTO_ONE)
|
|
|
|
mark_packs_for_deletion(&existing, &names);
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
|
|
|
|
if (write_midx) {
|
|
|
|
struct string_list include = STRING_LIST_INIT_NODUP;
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
midx_included_packs(&include, &existing, &names, &geometry);
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
|
builtin/repack.c: avoid making cruft packs preferred
When doing a `--geometric` repack, we make sure that the preferred pack
(if writing a MIDX) is the largest pack that we *didn't* repack. That
has the effect of keeping the preferred pack in sync with the pack
containing a majority of the repository's reachable objects.
But if the repository happens to double in size, we'll repack
everything. Here we don't specify any `--preferred-pack`, and instead
let the MIDX code choose.
In the past, that worked fine, since there would only be one pack to
choose from: the one we just wrote. But it's no longer necessarily the
case that there is one pack to choose from. It's possible that the
repository also has a cruft pack, too.
If the cruft pack happens to come earlier in lexical order (and has an
earlier mtime than any non-cruft pack), we'll pick that pack as
preferred. This makes it impossible to reuse chunks of the reachable
pack verbatim from pack-objects, so is sub-optimal.
Luckily, this is a somewhat rare circumstance to be in, since we would
have to repack the entire repository during a `--geometric` repack, and
the cruft pack would have to sort ahead of the pack we just created.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-03 21:54:19 +00:00
|
|
|
ret = write_midx_included_packs(&include, &geometry, &names,
|
2021-10-01 22:38:10 +00:00
|
|
|
refs_snapshot ? get_tempfile_path(refs_snapshot) : NULL,
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
show_progress, write_bitmaps > 0);
|
|
|
|
|
2022-10-18 02:45:12 +00:00
|
|
|
if (!ret && write_bitmaps)
|
|
|
|
remove_redundant_bitmaps(&include, packdir);
|
|
|
|
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
string_list_clear(&include, 0);
|
|
|
|
|
|
|
|
if (ret)
|
2023-02-06 23:07:42 +00:00
|
|
|
goto cleanup;
|
builtin/repack.c: support writing a MIDX while repacking
Teach `git repack` a new `--write-midx` option for callers that wish to
persist a multi-pack index in their repository while repacking.
There are two existing alternatives to this new flag, but they don't
cover our particular use-case. These alternatives are:
- Call 'git multi-pack-index write' after running 'git repack', or
- Set 'GIT_TEST_MULTI_PACK_INDEX=1' in your environment when running
'git repack'.
The former works, but introduces a gap in bitmap coverage between
repacking and writing a new MIDX (since the repack may have deleted a
pack included in the existing MIDX, invalidating it altogether).
Setting the 'GIT_TEST_' environment variable is obviously unsupported.
In fact, even if it were supported officially, it still wouldn't work,
because it generates the MIDX *after* redundant packs have been dropped,
leading to the same issue as above.
Introduce a new option which eliminates this race by teaching `git
repack` to generate the MIDX at the critical point: after the new packs
have been written and moved into place, but before the redundant packs
have been removed.
This option is compatible with `git repack`'s '--bitmap' option (it
changes the interpretation to be: "write a bitmap corresponding to the
MIDX after one has been generated").
There is a little bit of additional noise in the patch below to avoid
repeating ourselves when selecting which packs to delete. Instead of a
single loop as before (where we iterate over 'existing_packs', decide if
a pack is worth deleting, and if so, delete it), we have two loops (the
first where we decide which ones are worth deleting, and the second
where we actually do the deleting). This makes it so we have a single
check we can make consistently when (1) telling the MIDX which packs we
want to exclude, and (2) actually unlinking the redundant packs.
There is also a tiny change to short-circuit the body of
write_midx_included_packs() when no packs remain in the case of an empty
repository. The MIDX code does not handle this, so avoid trying to
generate a MIDX covering zero packs in the first place.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:18 +00:00
|
|
|
}
|
|
|
|
|
2018-08-08 22:34:06 +00:00
|
|
|
reprepare_packed_git(the_repository);
|
|
|
|
|
2013-09-15 15:33:20 +00:00
|
|
|
if (delete_redundant) {
|
2014-09-13 07:28:01 +00:00
|
|
|
int opts = 0;
|
builtin/repack.c: extract redundant pack cleanup for existing packs
To remove redundant packs at the end of a repacking operation, Git uses
its `remove_redundant_pack()` function in a loop over the set of
pre-existing, non-kept packs.
In a later commit, we will split this list into two, one for
pre-existing cruft pack(s), and another for non-cruft pack(s). Prepare
for this by factoring out the routine to loop over and delete redundant
packs into its own function.
Instead of calling `remove_redundant_pack()` directly, we now will call
`remove_redundant_existing_packs()`, which itself dispatches a call to
`remove_redundant_packs_1()`. Note that the geometric repacking code
will still call `remove_redundant_pack()` directly, but see the previous
commit for more details.
Having `remove_redundant_packs_1()` exist as a separate function may
seem like overkill in this patch. However, a later patch will call
`remove_redundant_packs_1()` once over two separate lists, so this
refactoring sets us up for that.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:51 +00:00
|
|
|
remove_redundant_existing_packs(&existing);
|
builtin/repack.c: add '--geometric' option
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its
entire contents regularly
This patch implements a '--geometric=<n>' option in 'git repack'. This
allows the caller to specify that they would like each pack to be at
least a factor times as large as the previous largest pack (by object
count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2,
..., up to Pn. Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n], where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it
along two directions:
1. We assume that there is a cutoff of packs _before starting the
repack_ where everything to the right of that cut-off already forms
a geometric progression (or no cutoff exists and everything must be
repacked).
2. We assume that everything smaller than the cutoff count must be
repacked. This forms our base assumption, but it can also cause
even the "heavy" packs to get repacked, for e.g., if we have 6
packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32',
rolling up the first two packs into a pack with 2 objects. That
breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^' indicates the position of our split). To restore a
progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored. Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too
often while also only creating one new pack at a time. Another wrinkle
is that we assume that loose, indexed, and reflog'd objects are
insignificant, and lump them into any new pack that we create. This can
lead to non-idempotent results.
Suggested-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Reviewed-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-02-23 02:25:27 +00:00
|
|
|
|
2023-09-13 19:17:49 +00:00
|
|
|
if (geometry.split_factor)
|
|
|
|
geometry_remove_redundant_packs(&geometry, &names,
|
|
|
|
&existing);
|
2021-12-20 14:48:11 +00:00
|
|
|
if (show_progress)
|
2014-09-13 07:28:01 +00:00
|
|
|
opts |= PRUNE_PACKED_VERBOSE;
|
|
|
|
prune_packed_objects(opts);
|
repack -ad: prune the list of shallow commits
`git repack` can drop unreachable commits without further warning,
making the corresponding entries in `.git/shallow` invalid, which causes
serious problems when deepening the branches.
One scenario where unreachable commits are dropped by `git repack` is
when a `git fetch --prune` (or even a `git fetch` when a ref was
force-pushed in the meantime) can make a commit unreachable that was
reachable before.
Therefore it is not safe to assume that a `git repack -adlf` will keep
unreachable commits alone (under the assumption that they had not been
packed in the first place, which is an assumption at least some of Git's
code seems to make).
This is particularly important to keep in mind when looking at the
`.git/shallow` file: if any commits listed in that file become
unreachable, it is not a problem, but if they go missing, it *is* a
problem. One symptom of this problem is that a deepening fetch may now
fail with
fatal: error in object: unshallow <commit-hash>
To avoid this problem, let's prune the shallow list in `git repack` when
the `-d` option is passed, unless `-A` is passed, too (which would force
the now-unreachable objects to be turned into loose objects instead of
being deleted). Additionally, we also need to take `--keep-reachable`
and `--unpack-unreachable=<date>` into account.
Note: an alternative solution discussed during the review of this patch
was to teach `git fetch` to simply ignore entries in .git/shallow if the
corresponding commits do not exist locally. A quick test, however,
revealed that the .git/shallow file is written during a shallow *clone*,
in which case the commits do not exist, either, but the "shallow" line
*does* need to be sent. Therefore, this approach would be a lot more
finicky than the approach presented by the this patch.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-10-24 15:56:13 +00:00
|
|
|
|
|
|
|
if (!keep_unreachable &&
|
|
|
|
(!(pack_everything & LOOSEN_UNREACHABLE) ||
|
|
|
|
unpack_unreachable) &&
|
|
|
|
is_repository_shallow(the_repository))
|
|
|
|
prune_shallow(PRUNE_QUICK);
|
2013-09-15 15:33:20 +00:00
|
|
|
}
|
|
|
|
|
2022-03-14 07:42:46 +00:00
|
|
|
if (run_update_server_info)
|
2014-09-13 07:28:01 +00:00
|
|
|
update_server_info(0);
|
2018-10-12 17:34:20 +00:00
|
|
|
|
2021-08-31 20:52:43 +00:00
|
|
|
if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX, 0)) {
|
|
|
|
unsigned flags = 0;
|
|
|
|
if (git_env_bool(GIT_TEST_MULTI_PACK_INDEX_WRITE_BITMAP, 0))
|
|
|
|
flags |= MIDX_WRITE_BITMAP | MIDX_WRITE_REV_INDEX;
|
midx: preliminary support for `--refs-snapshot`
To figure out which commits we can write a bitmap for, the multi-pack
index/bitmap code does a reachability traversal, marking any commit
which can be found in the MIDX as eligible to receive a bitmap.
This approach will cause a problem when multi-pack bitmaps are able to
be generated from `git repack`, since the reference tips can change
during the repack. Even though we ignore commits that don't exist in
the MIDX (when doing a scan of the ref tips), it's possible that a
commit in the MIDX reaches something that isn't.
This can happen when a multi-pack index contains some pack which refers
to loose objects (e.g., if a pack was pushed after starting the repack
but before generating the MIDX which depends on an object which is
stored as loose in the repository, and by definition isn't included in
the multi-pack index).
By taking a snapshot of the references before we start repacking, we can
close that race window. In the above scenario (where we have a packed
object pointing at a loose one), we'll either (a) take a snapshot of the
references before seeing the packed one, or (b) take it after, at which
point we can guarantee that the loose object will be packed and included
in the MIDX.
This patch does just that. It writes a temporary "reference snapshot",
which is a list of OIDs that are at the ref tips before writing a
multi-pack bitmap. References that are "preferred" (i.e,. are a suffix
of at least one value of the 'pack.preferBitmapTips' configuration) are
marked with a special '+'.
The format is simple: one line per commit at each tip, with an optional
'+' at the beginning (for preferred references, as described above).
When provided, the reference snapshot is used to drive bitmap selection
instead of the MIDX code doing its own traversal. When it isn't
provided, the usual traversal takes place instead.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-09-29 01:55:07 +00:00
|
|
|
write_midx_file(get_object_directory(), NULL, NULL, flags);
|
2021-08-31 20:52:43 +00:00
|
|
|
}
|
2018-10-12 17:34:20 +00:00
|
|
|
|
2023-02-06 23:07:42 +00:00
|
|
|
cleanup:
|
2022-10-22 00:21:45 +00:00
|
|
|
string_list_clear(&names, 1);
|
builtin/repack.c: extract structure to store existing packs
The repack machinery needs to keep track of which packfiles were present
in the repository at the beginning of a repack, segmented by whether or
not each pack is marked as kept.
The names of these packs are stored in two `string_list`s, corresponding
to kept- and non-kept packs, respectively. As a consequence, many
functions within the repack code need to take both `string_list`s as
arguments, leading to code like this:
ret = write_cruft_pack(&cruft_po_args, packtmp, pack_prefix,
cruft_expiration, &names,
&existing_nonkept_packs, /* <- */
&existing_kept_packs); /* <- */
Wrap up this pair of `string_list`s into a single structure that stores
both. This saves us from having to pass both string lists separately,
and prepares for adding additional fields to this structure.
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-09-13 19:17:35 +00:00
|
|
|
existing_packs_release(&existing);
|
2023-08-09 20:32:42 +00:00
|
|
|
free_pack_geometry(&geometry);
|
repack: add `--filter=<filter-spec>` option
This new option puts the objects specified by `<filter-spec>` into a
separate packfile.
This could be useful if, for example, some blobs take up a lot of
precious space on fast storage while they are rarely accessed. It could
make sense to move them into a separate cheaper, though slower, storage.
It's possible to find which new packfile contains the filtered out
objects using one of the following:
- `git verify-pack -v ...`,
- `test-tool find-pack ...`, which a previous commit added,
- `--filter-to=<dir>`, which a following commit will add to specify
where the pack containing the filtered out objects will be.
This feature is implemented by running `git pack-objects` twice in a
row. The first command is run with `--filter=<filter-spec>`, using the
specified filter. It packs objects while omitting the objects specified
by the filter. Then another `git pack-objects` command is launched using
`--stdin-packs`. We pass it all the previously existing packs into its
stdin, so that it will pack all the objects in the previously existing
packs. But we also pass into its stdin, the pack created by the previous
`git pack-objects --filter=<filter-spec>` command as well as the kept
packs, all prefixed with '^', so that the objects in these packs will be
omitted from the resulting pack. The result is that only the objects
filtered out by the first `git pack-objects` command are in the pack
resulting from the second `git pack-objects` command.
As the interactions with kept packs are a bit tricky, a few related
tests are added.
Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-10-02 16:55:01 +00:00
|
|
|
list_objects_filter_release(&po_args.filter_options);
|
2013-09-15 15:33:20 +00:00
|
|
|
|
2023-02-06 23:07:42 +00:00
|
|
|
return ret;
|
2013-09-15 15:33:20 +00:00
|
|
|
}
|