git/builtin/fsck.c

977 lines
25 KiB
C
Raw Normal View History

#define USE_THE_INDEX_COMPATIBILITY_MACROS
#include "builtin.h"
#include "cache.h"
#include "repository.h"
#include "config.h"
#include "commit.h"
#include "tree.h"
#include "blob.h"
#include "tag.h"
#include "refs.h"
#include "pack.h"
#include "cache-tree.h"
#include "tree-walk.h"
#include "fsck.h"
#include "parse-options.h"
#include "dir.h"
#include "progress.h"
#include "streaming.h"
#include "decorate.h"
#include "packfile.h"
#include "object-store.h"
#include "run-command.h"
#include "worktree.h"
#define REACHABLE 0x0001
#define SEEN 0x0002
#define HAS_OBJ 0x0004
/* This flag is set if something points to this object. */
#define USED 0x0008
static int show_root;
static int show_tags;
static int show_unreachable;
static int include_reflogs = 1;
fsck: default to "git fsck --full" Linus and other git developers from the early days trained their fingers to type the command, every once in a while even without thinking, to check the consistency of the repository back when the lower core part of the git was still being developed. Developers who wanted to make sure that git correctly dealt with packfiles could deliberately trigger their creation and checked them after they were created carefully, but loose objects are the ones that are written by various commands from random codepaths. It made some technical sense to have a mode that checked only loose objects from the debugging point of view for that reason. Even for git developers, there no longer is any reason to type "git fsck" every five minutes these days, worried that some newly created objects might be corrupt due to recent change to git. The reason we did not make "--full" the default is probably we trust our filesystems a bit too much. At least, we trusted filesystems more than we trusted the lower core part of git that was under development. Once a packfile is created and we always use it read-only, there didn't seem to be much point in suspecting that the underlying filesystems or disks may corrupt them in such a way that is not caught by the SHA-1 checksum over the entire packfile and per object checksum. That trust in the filesystems might have been a good tradeoff between fsck performance and reliability on platforms git was initially developed on and for, but it may not be true anymore as we run on many more platforms these days. Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-10-20 18:46:55 +00:00
static int check_full = 1;
static int connectivity_only;
static int check_strict;
static int keep_cache_objects;
static struct fsck_options fsck_walk_options = FSCK_OPTIONS_DEFAULT;
static struct fsck_options fsck_obj_options = FSCK_OPTIONS_DEFAULT;
static int errors_found;
static int write_lost_and_found;
static int verbose;
static int show_progress = -1;
static int show_dangling = 1;
static int name_objects;
#define ERROR_OBJECT 01
#define ERROR_REACHABLE 02
#define ERROR_PACK 04
#define ERROR_REFS 010
#define ERROR_COMMIT_GRAPH 020
static const char *describe_object(struct object *obj)
{
static struct strbuf bufs[] = {
STRBUF_INIT, STRBUF_INIT, STRBUF_INIT, STRBUF_INIT
};
static int b = 0;
struct strbuf *buf;
char *name = NULL;
if (name_objects)
name = lookup_decoration(fsck_walk_options.object_names, obj);
buf = bufs + b;
b = (b + 1) % ARRAY_SIZE(bufs);
strbuf_reset(buf);
strbuf_addstr(buf, oid_to_hex(&obj->oid));
if (name)
strbuf_addf(buf, " (%s)", name);
return buf->buf;
}
static const char *printable_type(struct object *obj)
{
const char *ret;
fsck: lazily load types under --connectivity-only The recent fixes to "fsck --connectivity-only" load all of the objects with their correct types. This keeps the connectivity-only code path close to the regular one, but it also introduces some unnecessary inefficiency. While getting the type of an object is cheap compared to actually opening and parsing the object (as the non-connectivity-only case would do), it's still not free. For reachable non-blob objects, we end up having to parse them later anyway (to see what they point to), making our type lookup here redundant. For unreachable objects, we might never hit them at all in the reachability traversal, making the lookup completely wasted. And in some cases, we might have quite a few unreachable objects (e.g., when alternates are used for shared object storage between repositories, it's normal for there to be objects reachable from other repositories but not the one running fsck). The comment in mark_object_for_connectivity() claims two benefits to getting the type up front: 1. We need to know the types during fsck_walk(). (And not explicitly mentioned, but we also need them when printing the types of broken or dangling commits). We can address this by lazy-loading the types as necessary. Most objects never need this lazy-load at all, because they fall into one of these categories: a. Reachable from our tips, and are coerced into the correct type as we traverse (e.g., a parent link will call lookup_commit(), which converts OBJ_NONE to OBJ_COMMIT). b. Unreachable, but not at the tip of a chunk of unreachable history. We only mention the tips as "dangling", so an unreachable commit which links to hundreds of other objects needs only report the type of the tip commit. 2. It serves as a cross-check that the coercion in (1a) is correct (i.e., we'll complain about a parent link that points to a blob). But we get most of this for free already, because right after coercing, we'll parse any non-blob objects. So we'd notice then if we expected a commit and got a blob. The one exception is when we expect a blob, in which case we never actually read the object contents. So this is a slight weakening, but given that the whole point of --connectivity-only is to sacrifice some data integrity checks for speed, this seems like an acceptable tradeoff. Here are before and after timings for an extreme case with ~5M reachable objects and another ~12M unreachable (it's the torvalds/linux repository on GitHub, connected to shared storage for all of the other kernel forks): [before] $ time git fsck --no-dangling --connectivity-only real 3m4.323s user 1m25.121s sys 1m38.710s [after] $ time git fsck --no-dangling --connectivity-only real 0m51.497s user 0m49.575s sys 0m1.776s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-26 04:12:07 +00:00
if (obj->type == OBJ_NONE) {
enum object_type type = oid_object_info(the_repository,
&obj->oid, NULL);
fsck: lazily load types under --connectivity-only The recent fixes to "fsck --connectivity-only" load all of the objects with their correct types. This keeps the connectivity-only code path close to the regular one, but it also introduces some unnecessary inefficiency. While getting the type of an object is cheap compared to actually opening and parsing the object (as the non-connectivity-only case would do), it's still not free. For reachable non-blob objects, we end up having to parse them later anyway (to see what they point to), making our type lookup here redundant. For unreachable objects, we might never hit them at all in the reachability traversal, making the lookup completely wasted. And in some cases, we might have quite a few unreachable objects (e.g., when alternates are used for shared object storage between repositories, it's normal for there to be objects reachable from other repositories but not the one running fsck). The comment in mark_object_for_connectivity() claims two benefits to getting the type up front: 1. We need to know the types during fsck_walk(). (And not explicitly mentioned, but we also need them when printing the types of broken or dangling commits). We can address this by lazy-loading the types as necessary. Most objects never need this lazy-load at all, because they fall into one of these categories: a. Reachable from our tips, and are coerced into the correct type as we traverse (e.g., a parent link will call lookup_commit(), which converts OBJ_NONE to OBJ_COMMIT). b. Unreachable, but not at the tip of a chunk of unreachable history. We only mention the tips as "dangling", so an unreachable commit which links to hundreds of other objects needs only report the type of the tip commit. 2. It serves as a cross-check that the coercion in (1a) is correct (i.e., we'll complain about a parent link that points to a blob). But we get most of this for free already, because right after coercing, we'll parse any non-blob objects. So we'd notice then if we expected a commit and got a blob. The one exception is when we expect a blob, in which case we never actually read the object contents. So this is a slight weakening, but given that the whole point of --connectivity-only is to sacrifice some data integrity checks for speed, this seems like an acceptable tradeoff. Here are before and after timings for an extreme case with ~5M reachable objects and another ~12M unreachable (it's the torvalds/linux repository on GitHub, connected to shared storage for all of the other kernel forks): [before] $ time git fsck --no-dangling --connectivity-only real 3m4.323s user 1m25.121s sys 1m38.710s [after] $ time git fsck --no-dangling --connectivity-only real 0m51.497s user 0m49.575s sys 0m1.776s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-26 04:12:07 +00:00
if (type > 0)
object_as_type(the_repository, obj, type, 0);
fsck: lazily load types under --connectivity-only The recent fixes to "fsck --connectivity-only" load all of the objects with their correct types. This keeps the connectivity-only code path close to the regular one, but it also introduces some unnecessary inefficiency. While getting the type of an object is cheap compared to actually opening and parsing the object (as the non-connectivity-only case would do), it's still not free. For reachable non-blob objects, we end up having to parse them later anyway (to see what they point to), making our type lookup here redundant. For unreachable objects, we might never hit them at all in the reachability traversal, making the lookup completely wasted. And in some cases, we might have quite a few unreachable objects (e.g., when alternates are used for shared object storage between repositories, it's normal for there to be objects reachable from other repositories but not the one running fsck). The comment in mark_object_for_connectivity() claims two benefits to getting the type up front: 1. We need to know the types during fsck_walk(). (And not explicitly mentioned, but we also need them when printing the types of broken or dangling commits). We can address this by lazy-loading the types as necessary. Most objects never need this lazy-load at all, because they fall into one of these categories: a. Reachable from our tips, and are coerced into the correct type as we traverse (e.g., a parent link will call lookup_commit(), which converts OBJ_NONE to OBJ_COMMIT). b. Unreachable, but not at the tip of a chunk of unreachable history. We only mention the tips as "dangling", so an unreachable commit which links to hundreds of other objects needs only report the type of the tip commit. 2. It serves as a cross-check that the coercion in (1a) is correct (i.e., we'll complain about a parent link that points to a blob). But we get most of this for free already, because right after coercing, we'll parse any non-blob objects. So we'd notice then if we expected a commit and got a blob. The one exception is when we expect a blob, in which case we never actually read the object contents. So this is a slight weakening, but given that the whole point of --connectivity-only is to sacrifice some data integrity checks for speed, this seems like an acceptable tradeoff. Here are before and after timings for an extreme case with ~5M reachable objects and another ~12M unreachable (it's the torvalds/linux repository on GitHub, connected to shared storage for all of the other kernel forks): [before] $ time git fsck --no-dangling --connectivity-only real 3m4.323s user 1m25.121s sys 1m38.710s [after] $ time git fsck --no-dangling --connectivity-only real 0m51.497s user 0m49.575s sys 0m1.776s Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-26 04:12:07 +00:00
}
ret = type_name(obj->type);
if (!ret)
ret = _("unknown");
return ret;
}
static int fsck_config(const char *var, const char *value, void *cb)
{
if (strcmp(var, "fsck.skiplist") == 0) {
const char *path;
struct strbuf sb = STRBUF_INIT;
if (git_config_pathname(&path, var, value))
return 1;
strbuf_addf(&sb, "skiplist=%s", path);
free((char *)path);
fsck_set_msg_types(&fsck_obj_options, sb.buf);
strbuf_release(&sb);
return 0;
}
if (skip_prefix(var, "fsck.", &var)) {
fsck_set_msg_type(&fsck_obj_options, var, value);
return 0;
}
return git_default_config(var, value, cb);
}
static int objerror(struct object *obj, const char *err)
{
errors_found |= ERROR_OBJECT;
/* TRANSLATORS: e.g. error in tree 01bfda: <more explanation> */
fprintf_ln(stderr, _("error in %s %s: %s"),
printable_type(obj), describe_object(obj), err);
return -1;
}
static int fsck_error_func(struct fsck_options *o,
struct object *obj, int type, const char *message)
{
switch (type) {
case FSCK_WARN:
/* TRANSLATORS: e.g. warning in tree 01bfda: <more explanation> */
fprintf_ln(stderr, _("warning in %s %s: %s"),
printable_type(obj), describe_object(obj), message);
return 0;
case FSCK_ERROR:
/* TRANSLATORS: e.g. error in tree 01bfda: <more explanation> */
fprintf_ln(stderr, _("error in %s %s: %s"),
printable_type(obj), describe_object(obj), message);
return 1;
default:
BUG("%d (FSCK_IGNORE?) should never trigger this callback", type);
}
}
static struct object_array pending;
static int mark_object(struct object *obj, int type, void *data, struct fsck_options *options)
{
struct object *parent = data;
/*
* The only case data is NULL or type is OBJ_ANY is when
* mark_object_reachable() calls us. All the callers of
* that function has non-NULL obj hence ...
*/
if (!obj) {
/* ... these references to parent->fld are safe here */
printf_ln(_("broken link from %7s %s"),
printable_type(parent), describe_object(parent));
printf_ln(_("broken link from %7s %s"),
(type == OBJ_ANY ? _("unknown") : type_name(type)),
_("unknown"));
errors_found |= ERROR_REACHABLE;
return 1;
}
if (type != OBJ_ANY && obj->type != type)
/* ... and the reference to parent is safe here */
objerror(parent, _("wrong object type in link"));
if (obj->flags & REACHABLE)
return 0;
obj->flags |= REACHABLE;
if (is_promisor_object(&obj->oid))
/*
* Further recursion does not need to be performed on this
* object since it is a promisor object (so it does not need to
* be added to "pending").
*/
return 0;
if (!(obj->flags & HAS_OBJ)) {
if (parent && !has_object_file(&obj->oid)) {
printf_ln(_("broken link from %7s %s\n"
" to %7s %s"),
printable_type(parent),
describe_object(parent),
printable_type(obj),
describe_object(obj));
errors_found |= ERROR_REACHABLE;
}
return 1;
}
add_object_array(obj, NULL, &pending);
return 0;
}
static void mark_object_reachable(struct object *obj)
{
mark_object(obj, OBJ_ANY, NULL, NULL);
}
static int traverse_one_object(struct object *obj)
{
int result = fsck_walk(obj, obj, &fsck_walk_options);
if (obj->type == OBJ_TREE) {
struct tree *tree = (struct tree *)obj;
free_tree_buffer(tree);
}
return result;
}
static int traverse_reachable(void)
{
struct progress *progress = NULL;
unsigned int nr = 0;
int result = 0;
if (show_progress)
progress: simplify "delayed" progress API We used to expose the full power of the delayed progress API to the callers, so that they can specify, not just the message to show and expected total amount of work that is used to compute the percentage of work performed so far, the percent-threshold parameter P and the delay-seconds parameter N. The progress meter starts to show at N seconds into the operation only if we have not yet completed P per-cent of the total work. Most callers used either (0%, 2s) or (50%, 1s) as (P, N), but there are oddballs that chose more random-looking values like 95%. For a smoother workload, (50%, 1s) would allow us to start showing the progress meter earlier than (0%, 2s), while keeping the chance of not showing progress meter for long running operation the same as the latter. For a task that would take 2s or more to complete, it is likely that less than half of it would complete within the first second, if the workload is smooth. But for a spiky workload whose earlier part is easier, such a setting is likely to fail to show the progress meter entirely and (0%, 2s) is more appropriate. But that is merely a theory. Realistically, it is of dubious value to ask each codepath to carefully consider smoothness of their workload and specify their own setting by passing two extra parameters. Let's simplify the API by dropping both parameters and have everybody use (0%, 2s). Oh, by the way, the percent-threshold parameter and the structure member were consistently misspelled, which also is now fixed ;-) Helped-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-19 17:39:41 +00:00
progress = start_delayed_progress(_("Checking connectivity"), 0);
while (pending.nr) {
object_array: add and use `object_array_pop()` In a couple of places, we pop objects off an object array `foo` by decreasing `foo.nr`. We access `foo.nr` in many places, but most if not all other times we do so read-only, e.g., as we iterate over the array. But when we change `foo.nr` behind the array's back, it feels a bit nasty and looks like it might leak memory. Leaks happen if the popped element has an allocated `name` or `path`. At the moment, that is not the case. Still, 1) the object array might gain more fields that want to be freed, 2) a code path where we pop might start using names or paths, 3) one of these code paths might be copied to somewhere where we do, and 4) using a dedicated function for popping is conceptually cleaner. Introduce and use `object_array_pop()` instead. Release memory in the new function. Document that popping an object leaves the associated elements in limbo. The converted places were identified by grepping for "\.nr\>" and looking for "--". Make the new function return NULL on an empty array. This is consistent with `pop_commit()` and allows the following: while ((o = object_array_pop(&foo)) != NULL) { // do something } But as noted above, we don't need to go out of our way to avoid reading `foo.nr`. This is probably more readable: while (foo.nr) { ... o = object_array_pop(&foo); // do something } The name of `object_array_pop()` does not quite align with `add_object_array()`. That is unfortunate. On the other hand, it matches `object_array_clear()`. Arguably it's `add_...` that is the odd one out, since it reads like it's used to "add" an "object array". For that reason, side with `object_array_clear()`. Signed-off-by: Martin Ågren <martin.agren@gmail.com> Reviewed-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-09-22 23:34:53 +00:00
result |= traverse_one_object(object_array_pop(&pending));
display_progress(progress, ++nr);
}
stop_progress(&progress);
return !!result;
}
static int mark_used(struct object *obj, int type, void *data, struct fsck_options *options)
{
if (!obj)
return 1;
obj->flags |= USED;
return 0;
}
fsck: always compute USED flags for unreachable objects The --connectivity-only option avoids opening every object, and instead just marks reachable objects with a flag and compares this to the set of all objects. This strategy is discussed in more detail in 3e3f8bd608 (fsck: prepare dummy objects for --connectivity-check, 2017-01-17). This means that we report _every_ unreachable object as dangling. Whereas in a full fsck, we'd have actually opened and parsed each of those unreachable objects, marking their child objects with the USED flag, to mean "this was mentioned by another object". And thus we can report only the tip of an unreachable segment of the object graph as dangling. You can see this difference with a trivial example: tree=$(git hash-object -t tree -w /dev/null) one=$(echo one | git commit-tree $tree) two=$(echo two | git commit-tree -p $one $tree) Running `git fsck` will report only $two as dangling, but with --connectivity-only, both commits (and the tree) are reported. Likewise, using --lost-found would write all three objects. We can make --connectivity-only work like the normal case by taking a separate pass over the unreachable objects, parsing them and marking objects they refer to as USED. That still avoids parsing any blobs, though we do pay the cost to access any unreachable commits and trees (which may or may not be noticeable, depending on how many you have). If neither --dangling nor --lost-found is in effect, then we can skip this step entirely, just like we do now. That makes "--connectivity-only --no-dangling" just as fast as the current "--connectivity-only". I.e., we do the correct thing always, but you can still tweak the options to make it faster if you don't care about dangling objects. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-05 04:47:39 +00:00
static void mark_unreachable_referents(const struct object_id *oid)
{
struct fsck_options options = FSCK_OPTIONS_DEFAULT;
struct object *obj = lookup_object(the_repository, oid);
fsck: always compute USED flags for unreachable objects The --connectivity-only option avoids opening every object, and instead just marks reachable objects with a flag and compares this to the set of all objects. This strategy is discussed in more detail in 3e3f8bd608 (fsck: prepare dummy objects for --connectivity-check, 2017-01-17). This means that we report _every_ unreachable object as dangling. Whereas in a full fsck, we'd have actually opened and parsed each of those unreachable objects, marking their child objects with the USED flag, to mean "this was mentioned by another object". And thus we can report only the tip of an unreachable segment of the object graph as dangling. You can see this difference with a trivial example: tree=$(git hash-object -t tree -w /dev/null) one=$(echo one | git commit-tree $tree) two=$(echo two | git commit-tree -p $one $tree) Running `git fsck` will report only $two as dangling, but with --connectivity-only, both commits (and the tree) are reported. Likewise, using --lost-found would write all three objects. We can make --connectivity-only work like the normal case by taking a separate pass over the unreachable objects, parsing them and marking objects they refer to as USED. That still avoids parsing any blobs, though we do pay the cost to access any unreachable commits and trees (which may or may not be noticeable, depending on how many you have). If neither --dangling nor --lost-found is in effect, then we can skip this step entirely, just like we do now. That makes "--connectivity-only --no-dangling" just as fast as the current "--connectivity-only". I.e., we do the correct thing always, but you can still tweak the options to make it faster if you don't care about dangling objects. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-05 04:47:39 +00:00
if (!obj || !(obj->flags & HAS_OBJ))
return; /* not part of our original set */
if (obj->flags & REACHABLE)
return; /* reachable objects already traversed */
/*
* Avoid passing OBJ_NONE to fsck_walk, which will parse the object
* (and we want to avoid parsing blobs).
*/
if (obj->type == OBJ_NONE) {
enum object_type type = oid_object_info(the_repository,
&obj->oid, NULL);
if (type > 0)
object_as_type(the_repository, obj, type, 0);
}
options.walk = mark_used;
fsck_walk(obj, NULL, &options);
}
static int mark_loose_unreachable_referents(const struct object_id *oid,
const char *path,
void *data)
{
mark_unreachable_referents(oid);
return 0;
}
static int mark_packed_unreachable_referents(const struct object_id *oid,
struct packed_git *pack,
uint32_t pos,
void *data)
{
mark_unreachable_referents(oid);
return 0;
}
/*
* Check a single reachable object
*/
static void check_reachable_object(struct object *obj)
{
/*
* We obviously want the object to be parsed,
* except if it was in a pack-file and we didn't
* do a full fsck
*/
if (!(obj->flags & HAS_OBJ)) {
if (is_promisor_object(&obj->oid))
return;
if (has_object_pack(&obj->oid))
return; /* it is in pack - forget about it */
printf_ln(_("missing %s %s"), printable_type(obj),
describe_object(obj));
errors_found |= ERROR_REACHABLE;
return;
}
}
/*
* Check a single unreachable object
*/
static void check_unreachable_object(struct object *obj)
{
/*
* Missing unreachable object? Ignore it. It's not like
* we miss it (since it can't be reached), nor do we want
* to complain about it being unreachable (since it does
* not exist).
*/
if (!(obj->flags & HAS_OBJ))
return;
/*
* Unreachable object that exists? Show it if asked to,
* since this is something that is prunable.
*/
if (show_unreachable) {
printf_ln(_("unreachable %s %s"), printable_type(obj),
describe_object(obj));
return;
}
/*
* "!USED" means that nothing at all points to it, including
* other unreachable objects. In other words, it's the "tip"
* of some set of unreachable objects, usually a commit that
* got dropped.
*
* Such starting points are more interesting than some random
* set of unreachable objects, so we show them even if the user
* hasn't asked for _all_ unreachable objects. If you have
* deleted a branch by mistake, this is a prime candidate to
* start looking at, for example.
*/
if (!(obj->flags & USED)) {
if (show_dangling)
printf_ln(_("dangling %s %s"), printable_type(obj),
describe_object(obj));
if (write_lost_and_found) {
char *filename = git_pathdup("lost-found/%s/%s",
obj->type == OBJ_COMMIT ? "commit" : "other",
describe_object(obj));
FILE *f;
if (safe_create_leading_directories_const(filename)) {
error(_("could not create lost-found"));
free(filename);
return;
}
f = xfopen(filename, "w");
if (obj->type == OBJ_BLOB) {
if (stream_blob_to_fd(fileno(f), &obj->oid, NULL, 1))
die_errno(_("could not write '%s'"), filename);
} else
fprintf(f, "%s\n", describe_object(obj));
if (fclose(f))
die_errno(_("could not finish '%s'"),
filename);
free(filename);
}
return;
}
/*
* Otherwise? It's there, it's unreachable, and some other unreachable
* object points to it. Ignore it - it's not interesting, and we showed
* all the interesting cases above.
*/
}
static void check_object(struct object *obj)
{
if (verbose)
fprintf_ln(stderr, _("Checking %s"), describe_object(obj));
if (obj->flags & REACHABLE)
check_reachable_object(obj);
else
check_unreachable_object(obj);
}
static void check_connectivity(void)
{
int i, max;
/* Traverse the pending reachable objects */
traverse_reachable();
fsck: always compute USED flags for unreachable objects The --connectivity-only option avoids opening every object, and instead just marks reachable objects with a flag and compares this to the set of all objects. This strategy is discussed in more detail in 3e3f8bd608 (fsck: prepare dummy objects for --connectivity-check, 2017-01-17). This means that we report _every_ unreachable object as dangling. Whereas in a full fsck, we'd have actually opened and parsed each of those unreachable objects, marking their child objects with the USED flag, to mean "this was mentioned by another object". And thus we can report only the tip of an unreachable segment of the object graph as dangling. You can see this difference with a trivial example: tree=$(git hash-object -t tree -w /dev/null) one=$(echo one | git commit-tree $tree) two=$(echo two | git commit-tree -p $one $tree) Running `git fsck` will report only $two as dangling, but with --connectivity-only, both commits (and the tree) are reported. Likewise, using --lost-found would write all three objects. We can make --connectivity-only work like the normal case by taking a separate pass over the unreachable objects, parsing them and marking objects they refer to as USED. That still avoids parsing any blobs, though we do pay the cost to access any unreachable commits and trees (which may or may not be noticeable, depending on how many you have). If neither --dangling nor --lost-found is in effect, then we can skip this step entirely, just like we do now. That makes "--connectivity-only --no-dangling" just as fast as the current "--connectivity-only". I.e., we do the correct thing always, but you can still tweak the options to make it faster if you don't care about dangling objects. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-05 04:47:39 +00:00
/*
* With --connectivity-only, we won't have actually opened and marked
* unreachable objects with USED. Do that now to make --dangling, etc
* accurate.
*/
if (connectivity_only && (show_dangling || write_lost_and_found)) {
/*
* Even though we already have a "struct object" for each of
* these in memory, we must not iterate over the internal
* object hash as we do below. Our loop would potentially
* resize the hash, making our iteration invalid.
*
* Instead, we'll just go back to the source list of objects,
* and ignore any that weren't present in our earlier
* traversal.
*/
for_each_loose_object(mark_loose_unreachable_referents, NULL, 0);
for_each_packed_object(mark_packed_unreachable_referents, NULL, 0);
}
/* Look up all the requirements, warn about missing objects.. */
max = get_max_object_index();
if (verbose)
fprintf_ln(stderr, _("Checking connectivity (%d objects)"), max);
for (i = 0; i < max; i++) {
struct object *obj = get_indexed_object(i);
if (obj)
check_object(obj);
}
}
static int fsck_obj(struct object *obj, void *buffer, unsigned long size)
{
int err;
if (obj->flags & SEEN)
return 0;
obj->flags |= SEEN;
if (verbose)
fprintf_ln(stderr, _("Checking %s %s"),
printable_type(obj), describe_object(obj));
if (fsck_walk(obj, NULL, &fsck_obj_options))
objerror(obj, _("broken links"));
err = fsck_object(obj, buffer, size, &fsck_obj_options);
if (err)
goto out;
if (obj->type == OBJ_COMMIT) {
struct commit *commit = (struct commit *) obj;
if (!commit->parents && show_root)
printf_ln(_("root %s"),
describe_object(&commit->object));
}
if (obj->type == OBJ_TAG) {
struct tag *tag = (struct tag *) obj;
if (show_tags && tag->tagged) {
printf_ln(_("tagged %s %s (%s) in %s"),
printable_type(tag->tagged),
describe_object(tag->tagged),
tag->tag,
describe_object(&tag->object));
}
}
out:
if (obj->type == OBJ_TREE)
free_tree_buffer((struct tree *)obj);
if (obj->type == OBJ_COMMIT)
free_commit_buffer(the_repository->parsed_objects,
(struct commit *)obj);
return err;
}
static int fsck_obj_buffer(const struct object_id *oid, enum object_type type,
unsigned long size, void *buffer, int *eaten)
{
/*
* Note, buffer may be NULL if type is OBJ_BLOB. See
* verify_packfile(), data_valid variable for details.
*/
struct object *obj;
obj = parse_object_buffer(the_repository, oid, type, size, buffer,
eaten);
if (!obj) {
errors_found |= ERROR_OBJECT;
return error(_("%s: object corrupt or missing"),
oid_to_hex(oid));
}
obj->flags &= ~(REACHABLE | SEEN);
obj->flags |= HAS_OBJ;
return fsck_obj(obj, buffer, size);
}
static int default_refs;
static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
timestamp_t timestamp)
{
struct object *obj;
if (!is_null_oid(oid)) {
obj = lookup_object(the_repository, oid);
if (obj && (obj->flags & HAS_OBJ)) {
if (timestamp && name_objects)
add_decoration(fsck_walk_options.object_names,
obj,
xstrfmt("%s@{%"PRItime"}", refname, timestamp));
obj->flags |= USED;
mark_object_reachable(obj);
} else if (!is_promisor_object(oid)) {
error(_("%s: invalid reflog entry %s"),
refname, oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
}
}
}
static int fsck_handle_reflog_ent(struct object_id *ooid, struct object_id *noid,
const char *email, timestamp_t timestamp, int tz,
const char *message, void *cb_data)
{
const char *refname = cb_data;
if (verbose)
fprintf_ln(stderr, _("Checking reflog %s->%s"),
oid_to_hex(ooid), oid_to_hex(noid));
fsck_handle_reflog_oid(refname, ooid, 0);
fsck_handle_reflog_oid(refname, noid, timestamp);
return 0;
}
static int fsck_handle_reflog(const char *logname, const struct object_id *oid,
int flag, void *cb_data)
{
struct strbuf refname = STRBUF_INIT;
strbuf_worktree_ref(cb_data, &refname, logname);
for_each_reflog_ent(refname.buf, fsck_handle_reflog_ent, refname.buf);
strbuf_release(&refname);
return 0;
}
static int fsck_handle_ref(const char *refname, const struct object_id *oid,
int flag, void *cb_data)
{
struct object *obj;
obj = parse_object(the_repository, oid);
if (!obj) {
if (is_promisor_object(oid)) {
/*
* Increment default_refs anyway, because this is a
* valid ref.
*/
default_refs++;
return 0;
}
error(_("%s: invalid sha1 pointer %s"),
refname, oid_to_hex(oid));
errors_found |= ERROR_REACHABLE;
/* We'll continue with the rest despite the error.. */
return 0;
}
if (obj->type != OBJ_COMMIT && is_branch(refname)) {
error(_("%s: not a commit"), refname);
errors_found |= ERROR_REFS;
}
default_refs++;
obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(refname));
mark_object_reachable(obj);
return 0;
}
static int fsck_head_link(const char *head_ref_name,
const char **head_points_at,
struct object_id *head_oid);
static void get_default_heads(void)
{
struct worktree **worktrees, **p;
const char *head_points_at;
struct object_id head_oid;
for_each_rawref(fsck_handle_ref, NULL);
worktrees = get_worktrees(0);
for (p = worktrees; *p; p++) {
struct worktree *wt = *p;
struct strbuf ref = STRBUF_INIT;
strbuf_worktree_ref(wt, &ref, "HEAD");
fsck_head_link(ref.buf, &head_points_at, &head_oid);
if (head_points_at && !is_null_oid(&head_oid))
fsck_handle_ref(ref.buf, &head_oid, 0, NULL);
strbuf_release(&ref);
if (include_reflogs)
refs_for_each_reflog(get_worktree_ref_store(wt),
fsck_handle_reflog, wt);
}
free_worktrees(worktrees);
/*
* Not having any default heads isn't really fatal, but
* it does mean that "--unreachable" no longer makes any
* sense (since in this case everything will obviously
* be unreachable by definition.
*
* Showing dangling objects is valid, though (as those
* dangling objects are likely lost heads).
*
* So we just print a warning about it, and clear the
* "show_unreachable" flag.
*/
if (!default_refs) {
fprintf_ln(stderr, _("notice: No default references"));
show_unreachable = 0;
}
}
static int fsck_loose(const struct object_id *oid, const char *path, void *data)
{
struct object *obj;
enum object_type type;
unsigned long size;
void *contents;
int eaten;
if (read_loose_object(path, oid, &type, &size, &contents) < 0) {
errors_found |= ERROR_OBJECT;
error(_("%s: object corrupt or missing: %s"),
oid_to_hex(oid), path);
return 0; /* keep checking other objects */
}
if (!contents && type != OBJ_BLOB)
BUG("read_loose_object streamed a non-blob");
obj = parse_object_buffer(the_repository, oid, type, size,
contents, &eaten);
if (!obj) {
errors_found |= ERROR_OBJECT;
error(_("%s: object could not be parsed: %s"),
oid_to_hex(oid), path);
if (!eaten)
free(contents);
return 0; /* keep checking other objects */
}
obj->flags &= ~(REACHABLE | SEEN);
obj->flags |= HAS_OBJ;
if (fsck_obj(obj, contents, size))
errors_found |= ERROR_OBJECT;
if (!eaten)
free(contents);
return 0; /* keep checking other objects, even if we saw an error */
}
static int fsck_cruft(const char *basename, const char *path, void *data)
{
if (!starts_with(basename, "tmp_obj_"))
fprintf_ln(stderr, _("bad sha1 file: %s"), path);
return 0;
}
static int fsck_subdir(unsigned int nr, const char *path, void *progress)
{
display_progress(progress, nr + 1);
return 0;
}
static void fsck_object_dir(const char *path)
{
struct progress *progress = NULL;
if (verbose)
fprintf_ln(stderr, _("Checking object directory"));
if (show_progress)
progress = start_progress(_("Checking object directories"), 256);
for_each_loose_file_in_objdir(path, fsck_loose, fsck_cruft, fsck_subdir,
progress);
display_progress(progress, 256);
stop_progress(&progress);
}
static int fsck_head_link(const char *head_ref_name,
const char **head_points_at,
struct object_id *head_oid)
{
int null_is_error = 0;
if (verbose)
fprintf_ln(stderr, _("Checking %s link"), head_ref_name);
*head_points_at = resolve_ref_unsafe(head_ref_name, 0, head_oid, NULL);
if (!*head_points_at) {
errors_found |= ERROR_REFS;
return error(_("invalid %s"), head_ref_name);
}
if (!strcmp(*head_points_at, head_ref_name))
/* detached HEAD */
null_is_error = 1;
else if (!starts_with(*head_points_at, "refs/heads/")) {
errors_found |= ERROR_REFS;
return error(_("%s points to something strange (%s)"),
head_ref_name, *head_points_at);
}
if (is_null_oid(head_oid)) {
if (null_is_error) {
errors_found |= ERROR_REFS;
return error(_("%s: detached HEAD points at nothing"),
head_ref_name);
}
fprintf_ln(stderr,
_("notice: %s points to an unborn branch (%s)"),
head_ref_name, *head_points_at + 11);
}
return 0;
}
static int fsck_cache_tree(struct cache_tree *it)
{
int i;
int err = 0;
if (verbose)
fprintf_ln(stderr, _("Checking cache tree"));
if (0 <= it->entry_count) {
struct object *obj = parse_object(the_repository, &it->oid);
if (!obj) {
error(_("%s: invalid sha1 pointer in cache-tree"),
oid_to_hex(&it->oid));
errors_found |= ERROR_REFS;
return 1;
}
obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(":"));
mark_object_reachable(obj);
if (obj->type != OBJ_TREE)
err |= objerror(obj, _("non-tree in cache-tree"));
}
for (i = 0; i < it->subtree_nr; i++)
err |= fsck_cache_tree(it->down[i]->cache_tree);
return err;
}
static void mark_object_for_connectivity(const struct object_id *oid)
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
{
struct object *obj = lookup_unknown_object(oid);
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
obj->flags |= HAS_OBJ;
}
static int mark_loose_for_connectivity(const struct object_id *oid,
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
const char *path,
void *data)
{
mark_object_for_connectivity(oid);
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
return 0;
}
static int mark_packed_for_connectivity(const struct object_id *oid,
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
struct packed_git *pack,
uint32_t pos,
void *data)
{
mark_object_for_connectivity(oid);
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
return 0;
}
static char const * const fsck_usage[] = {
N_("git fsck [<options>] [<object>...]"),
NULL
};
static struct option fsck_opts[] = {
OPT__VERBOSE(&verbose, N_("be verbose")),
OPT_BOOL(0, "unreachable", &show_unreachable, N_("show unreachable objects")),
OPT_BOOL(0, "dangling", &show_dangling, N_("show dangling objects")),
OPT_BOOL(0, "tags", &show_tags, N_("report tags")),
OPT_BOOL(0, "root", &show_root, N_("report root nodes")),
OPT_BOOL(0, "cache", &keep_cache_objects, N_("make index objects head nodes")),
OPT_BOOL(0, "reflogs", &include_reflogs, N_("make reflogs head nodes (default)")),
OPT_BOOL(0, "full", &check_full, N_("also consider packs and alternate objects")),
OPT_BOOL(0, "connectivity-only", &connectivity_only, N_("check only connectivity")),
OPT_BOOL(0, "strict", &check_strict, N_("enable more strict checking")),
OPT_BOOL(0, "lost-found", &write_lost_and_found,
N_("write dangling objects in .git/lost-found")),
OPT_BOOL(0, "progress", &show_progress, N_("show progress")),
OPT_BOOL(0, "name-objects", &name_objects, N_("show verbose names for reachable objects")),
OPT_END(),
};
int cmd_fsck(int argc, const char **argv, const char *prefix)
{
int i;
struct object_directory *odb;
sha1_file: support lazily fetching missing objects Teach sha1_file to fetch objects from the remote configured in extensions.partialclone whenever an object is requested but missing. The fetching of objects can be suppressed through a global variable. This is used by fsck and index-pack. However, by default, such fetching is not suppressed. This is meant as a temporary measure to ensure that all Git commands work in such a situation. Future patches will update some commands to either tolerate missing objects (without fetching them) or be more efficient in fetching them. In order to determine the code changes in sha1_file.c necessary, I investigated the following: (1) functions in sha1_file.c that take in a hash, without the user regarding how the object is stored (loose or packed) (2) functions in packfile.c (because I need to check callers that know about the loose/packed distinction and operate on both differently, and ensure that they can handle the concept of objects that are neither loose nor packed) (1) is handled by the modification to sha1_object_info_extended(). For (2), I looked at for_each_packed_object and others. For for_each_packed_object, the callers either already work or are fixed in this patch: - reachable - only to find recent objects - builtin/fsck - already knows about missing objects - builtin/cat-file - warning message added in this commit Callers of the other functions do not need to be changed: - parse_pack_index - http - indirectly from http_get_info_packs - find_pack_entry_one - this searches a single pack that is provided as an argument; the caller already knows (through other means) that the sought object is in a specific pack - find_sha1_pack - fast-import - appears to be an optimization to not store a file if it is already in a pack - http-walker - to search through a struct alt_base - http-push - to search through remote packs - has_sha1_pack - builtin/fsck - already knows about promisor objects - builtin/count-objects - informational purposes only (check if loose object is also packed) - builtin/prune-packed - check if object to be pruned is packed (if not, don't prune it) - revision - used to exclude packed objects if requested by user - diff - just for optimization Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-12-08 15:27:14 +00:00
/* fsck knows how to handle missing promisor objects */
fetch_if_missing = 0;
errors_found = 0;
read_replace_refs = 0;
argc = parse_options(argc, argv, prefix, fsck_opts, fsck_usage, 0);
fsck_walk_options.walk = mark_object;
fsck_obj_options.walk = mark_used;
fsck_obj_options.error_func = fsck_error_func;
if (check_strict)
fsck_obj_options.strict = 1;
if (show_progress == -1)
show_progress = isatty(2);
if (verbose)
show_progress = 0;
if (write_lost_and_found) {
check_full = 1;
include_reflogs = 0;
}
if (name_objects)
fsck_walk_options.object_names =
xcalloc(1, sizeof(struct decoration));
git_config(fsck_config, NULL);
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
if (connectivity_only) {
for_each_loose_object(mark_loose_for_connectivity, NULL, 0);
for_each_packed_object(mark_packed_for_connectivity, NULL, 0);
} else {
prepare_alt_odb(the_repository);
sha1-file: use an object_directory for the main object dir Our handling of alternate object directories is needlessly different from the main object directory. As a result, many places in the code basically look like this: do_something(r->objects->objdir); for (odb = r->objects->alt_odb_list; odb; odb = odb->next) do_something(odb->path); That gets annoying when do_something() is non-trivial, and we've resorted to gross hacks like creating fake alternates (see find_short_object_filename()). Instead, let's give each raw_object_store a unified list of object_directory structs. The first will be the main store, and everything after is an alternate. Very few callers even care about the distinction, and can just loop over the whole list (and those who care can just treat the first element differently). A few observations: - we don't need r->objects->objectdir anymore, and can just mechanically convert that to r->objects->odb->path - object_directory's path field needs to become a real pointer rather than a FLEX_ARRAY, in order to fill it with expand_base_dir() - we'll call prepare_alt_odb() earlier in many functions (i.e., outside of the loop). This may result in us calling it even when our function would be satisfied looking only at the main odb. But this doesn't matter in practice. It's not a very expensive operation in the first place, and in the majority of cases it will be a noop. We call it already (and cache its results) in prepare_packed_git(), and we'll generally check packs before loose objects. So essentially every program is going to call it immediately once per program. Arguably we should just prepare_alt_odb() immediately upon setting up the repository's object directory, which would save us sprinkling calls throughout the code base (and forgetting to do so has been a source of subtle bugs in the past). But I've stopped short of that here, since there are already a lot of other moving parts in this patch. - Most call sites just get shorter. The check_and_freshen() functions are an exception, because they have entry points to handle local and nonlocal directories separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 14:50:39 +00:00
for (odb = the_repository->objects->odb; odb; odb = odb->next)
fsck_object_dir(odb->path);
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
if (check_full) {
struct packed_git *p;
uint32_t total = 0, count = 0;
struct progress *progress = NULL;
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
if (show_progress) {
for (p = get_all_packs(the_repository); p;
p = p->next) {
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
if (open_pack_index(p))
continue;
total += p->num_objects;
}
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
progress = start_progress(_("Checking objects"), total);
}
for (p = get_all_packs(the_repository); p;
p = p->next) {
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
/* verify gives error messages itself */
if (verify_pack(the_repository,
p, fsck_obj_buffer,
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
progress, count))
errors_found |= ERROR_PACK;
count += p->num_objects;
}
fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-01-17 21:32:57 +00:00
stop_progress(&progress);
}
if (fsck_finish(&fsck_obj_options))
errors_found |= ERROR_OBJECT;
}
for (i = 0; i < argc; i++) {
const char *arg = argv[i];
struct object_id oid;
if (!get_oid(arg, &oid)) {
struct object *obj = lookup_object(the_repository,
&oid);
if (!obj || !(obj->flags & HAS_OBJ)) {
if (is_promisor_object(&oid))
continue;
error(_("%s: object missing"), oid_to_hex(&oid));
errors_found |= ERROR_OBJECT;
continue;
}
obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj, xstrdup(arg));
mark_object_reachable(obj);
continue;
}
error(_("invalid parameter: expected sha1, got '%s'"), arg);
errors_found |= ERROR_OBJECT;
}
/*
* If we've not been given any explicit head information, do the
* default ones from .git/refs. We also consider the index file
* in this case (ie this implies --cache).
*/
if (!argc) {
get_default_heads();
keep_cache_objects = 1;
}
if (keep_cache_objects) {
verify_index_checksum = 1;
verify_ce_order = 1;
read_cache();
for (i = 0; i < active_nr; i++) {
unsigned int mode;
struct blob *blob;
struct object *obj;
mode = active_cache[i]->ce_mode;
if (S_ISGITLINK(mode))
continue;
blob = lookup_blob(the_repository,
&active_cache[i]->oid);
if (!blob)
continue;
obj = &blob->object;
obj->flags |= USED;
if (name_objects)
add_decoration(fsck_walk_options.object_names,
obj,
xstrfmt(":%s", active_cache[i]->name));
mark_object_reachable(obj);
}
if (active_cache_tree)
fsck_cache_tree(active_cache_tree);
}
check_connectivity();
if (!git_config_get_bool("core.commitgraph", &i) && i) {
struct child_process commit_graph_verify = CHILD_PROCESS_INIT;
const char *verify_argv[] = { "commit-graph", "verify", NULL, NULL, NULL };
prepare_alt_odb(the_repository);
sha1-file: use an object_directory for the main object dir Our handling of alternate object directories is needlessly different from the main object directory. As a result, many places in the code basically look like this: do_something(r->objects->objdir); for (odb = r->objects->alt_odb_list; odb; odb = odb->next) do_something(odb->path); That gets annoying when do_something() is non-trivial, and we've resorted to gross hacks like creating fake alternates (see find_short_object_filename()). Instead, let's give each raw_object_store a unified list of object_directory structs. The first will be the main store, and everything after is an alternate. Very few callers even care about the distinction, and can just loop over the whole list (and those who care can just treat the first element differently). A few observations: - we don't need r->objects->objectdir anymore, and can just mechanically convert that to r->objects->odb->path - object_directory's path field needs to become a real pointer rather than a FLEX_ARRAY, in order to fill it with expand_base_dir() - we'll call prepare_alt_odb() earlier in many functions (i.e., outside of the loop). This may result in us calling it even when our function would be satisfied looking only at the main odb. But this doesn't matter in practice. It's not a very expensive operation in the first place, and in the majority of cases it will be a noop. We call it already (and cache its results) in prepare_packed_git(), and we'll generally check packs before loose objects. So essentially every program is going to call it immediately once per program. Arguably we should just prepare_alt_odb() immediately upon setting up the repository's object directory, which would save us sprinkling calls throughout the code base (and forgetting to do so has been a source of subtle bugs in the past). But I've stopped short of that here, since there are already a lot of other moving parts in this patch. - Most call sites just get shorter. The check_and_freshen() functions are an exception, because they have entry points to handle local and nonlocal directories separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 14:50:39 +00:00
for (odb = the_repository->objects->odb; odb; odb = odb->next) {
child_process_init(&commit_graph_verify);
commit_graph_verify.argv = verify_argv;
commit_graph_verify.git_cmd = 1;
verify_argv[2] = "--object-dir";
verify_argv[3] = odb->path;
if (run_command(&commit_graph_verify))
errors_found |= ERROR_COMMIT_GRAPH;
}
}
if (!git_config_get_bool("core.multipackindex", &i) && i) {
struct child_process midx_verify = CHILD_PROCESS_INIT;
const char *midx_argv[] = { "multi-pack-index", "verify", NULL, NULL, NULL };
prepare_alt_odb(the_repository);
sha1-file: use an object_directory for the main object dir Our handling of alternate object directories is needlessly different from the main object directory. As a result, many places in the code basically look like this: do_something(r->objects->objdir); for (odb = r->objects->alt_odb_list; odb; odb = odb->next) do_something(odb->path); That gets annoying when do_something() is non-trivial, and we've resorted to gross hacks like creating fake alternates (see find_short_object_filename()). Instead, let's give each raw_object_store a unified list of object_directory structs. The first will be the main store, and everything after is an alternate. Very few callers even care about the distinction, and can just loop over the whole list (and those who care can just treat the first element differently). A few observations: - we don't need r->objects->objectdir anymore, and can just mechanically convert that to r->objects->odb->path - object_directory's path field needs to become a real pointer rather than a FLEX_ARRAY, in order to fill it with expand_base_dir() - we'll call prepare_alt_odb() earlier in many functions (i.e., outside of the loop). This may result in us calling it even when our function would be satisfied looking only at the main odb. But this doesn't matter in practice. It's not a very expensive operation in the first place, and in the majority of cases it will be a noop. We call it already (and cache its results) in prepare_packed_git(), and we'll generally check packs before loose objects. So essentially every program is going to call it immediately once per program. Arguably we should just prepare_alt_odb() immediately upon setting up the repository's object directory, which would save us sprinkling calls throughout the code base (and forgetting to do so has been a source of subtle bugs in the past). But I've stopped short of that here, since there are already a lot of other moving parts in this patch. - Most call sites just get shorter. The check_and_freshen() functions are an exception, because they have entry points to handle local and nonlocal directories separately. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-12 14:50:39 +00:00
for (odb = the_repository->objects->odb; odb; odb = odb->next) {
child_process_init(&midx_verify);
midx_verify.argv = midx_argv;
midx_verify.git_cmd = 1;
midx_argv[2] = "--object-dir";
midx_argv[3] = odb->path;
if (run_command(&midx_verify))
errors_found |= ERROR_COMMIT_GRAPH;
}
}
return errors_found;
}