From 1ada11ee62658ff4366d93b97995eea055827171 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Mon, 16 Jan 2017 16:24:03 -0500 Subject: [PATCH 1/9] t1450: clean up sub-objects in duplicate-entry test This test creates a multi-level set of trees, but its cleanup routine only removes the top-level tree. After the test finishes, the inner tree and the blob it points to remain, making the inner tree dangling. A later test ("cleaned up") verifies that we've removed any cruft and "git fsck" output is clean. This passes only because of a bug in git-fsck which fails to notice dangling trees. In preparation for fixing the bug, let's teach this earlier test to clean up after itself correctly. We have to remove the inner tree (and therefore the blob, too, which becomes dangling after removing that tree). Since the setup code happens inside a subshell, we can't just set a variable for each object. However, we can stuff all of the sha1s into the $T output variable, which is not used for anything except cleanup. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- t/t1450-fsck.sh | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index ee7d4736db..6eef8b28ed 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -189,14 +189,16 @@ test_expect_success 'commit with NUL in header' ' ' test_expect_success 'tree object with duplicate entries' ' - test_when_finished "remove_object \$T" && + test_when_finished "for i in \$T; do remove_object \$i; done" && T=$( GIT_INDEX_FILE=test-index && export GIT_INDEX_FILE && rm -f test-index && >x && git add x && + git rev-parse :x && T=$(git write-tree) && + echo $T && ( git cat-file tree $T && git cat-file tree $T From b4584e4f665f59f51572f479db6baf1a1cdbc03a Mon Sep 17 00:00:00 2001 From: Jeff King Date: Mon, 16 Jan 2017 16:25:35 -0500 Subject: [PATCH 2/9] fsck: report trees as dangling After checking connectivity, fsck looks through the list of any objects we've seen mentioned, and reports unreachable and un-"used" ones as dangling. However, it skips any object which is not marked as "parsed", as that is an object that we _don't_ have (but that somebody mentioned). Since 6e454b9a3 (clear parsed flag when we free tree buffers, 2013-06-05), that flag can't be relied on, and the correct method is to check the HAS_OBJ flag. The cleanup in that commit missed this callsite, though. As a result, we would generally fail to report dangling trees. We never noticed because there were no tests in this area (for trees or otherwise). Let's add some. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 2 +- t/t1450-fsck.sh | 27 +++++++++++++++++++++++++++ 2 files changed, 28 insertions(+), 1 deletion(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index f01b81eebf..3e67203f9c 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -225,7 +225,7 @@ static void check_unreachable_object(struct object *obj) * to complain about it being unreachable (since it does * not exist). */ - if (!obj->parsed) + if (!(obj->flags & HAS_OBJ)) return; /* diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index 6eef8b28ed..e88ec7747b 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -559,4 +559,31 @@ test_expect_success 'fsck --name-objects' ' ) ' +# for each of type, we have one version which is referenced by another object +# (and so while unreachable, not dangling), and another variant which really is +# dangling. +test_expect_success 'fsck notices dangling objects' ' + git init dangling && + ( + cd dangling && + blob=$(echo not-dangling | git hash-object -w --stdin) && + dblob=$(echo dangling | git hash-object -w --stdin) && + tree=$(printf "100644 blob %s\t%s\n" $blob one | git mktree) && + dtree=$(printf "100644 blob %s\t%s\n" $blob two | git mktree) && + commit=$(git commit-tree $tree) && + dcommit=$(git commit-tree -p $commit $tree) && + + cat >expect <<-EOF && + dangling blob $dblob + dangling commit $dcommit + dangling tree $dtree + EOF + + git fsck >actual && + # the output order is non-deterministic, as it comes from a hash + sort actual.sorted && + test_cmp expect actual.sorted + ) +' + test_done From 3e3f8bd608e6682a2ad4f6cef52ed8ec45b8ad59 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 17 Jan 2017 16:32:57 -0500 Subject: [PATCH 3/9] fsck: prepare dummy objects for --connectivity-check Normally fsck makes a pass over all objects to check their integrity, and then follows up with a reachability check to make sure we have all of the referenced objects (and to know which ones are dangling). The latter checks for the HAS_OBJ flag in obj->flags to see if we found the object in the first pass. Commit 02976bf85 (fsck: introduce `git fsck --connectivity-only`, 2015-06-22) taught fsck to skip the initial pass, and to fallback to has_sha1_file() instead of the HAS_OBJ check. However, it converted only one HAS_OBJ check to use has_sha1_file(). But there are many other places in builtin/fsck.c that assume that the flag is set (or that lookup_object() will return an object at all). This leads to several bugs with --connectivity-only: 1. mark_object() will not queue objects for examination, so recursively following links from commits to trees, etc, did nothing. I.e., we were checking the reachability of hardly anything at all. 2. When a set of heads is given on the command-line, we use lookup_object() to see if they exist. But without the initial pass, we assume nothing exists. 3. When loading reflog entries, we do a similar lookup_object() check, and complain that the reflog is broken if the object doesn't exist in our hash. So in short, --connectivity-only is broken pretty badly, and will claim that your repository is fine when it's not. Presumably nobody noticed for a few reasons. One is that the embedded test does not actually test the recursive nature of the reachability check. All of the missing objects are still in the index, and we directly check items from the index. This patch modifies the test to delete the index, which shows off breakage (1). Another is that --connectivity-only just skips the initial pass for loose objects. So on a real repository, the packed objects were still checked correctly. But on the flipside, it means that "git fsck --connectivity-only" still checks the sha1 of all of the packed objects, nullifying its original purpose of being a faster git-fsck. And of course the final problem is that the bug only shows up when there _is_ corruption, which is rare. So anybody running "git fsck --connectivity-only" proactively would assume it was being thorough, when it was not. One possibility for fixing this is to find all of the spots that rely on HAS_OBJ and tweak them for the connectivity-only case. But besides the risk that we might miss a spot (and I found three already, corresponding to the three bugs above), there are other parts of fsck that _can't_ work without a full list of objects. E.g., the list of dangling objects. Instead, let's make the connectivity-only case look more like the normal case. Rather than skip the initial pass completely, we'll do an abbreviated one that sets up the HAS_OBJ flag for each object, without actually loading the object data. That's simple and fast, and we don't have to care about the connectivity_only flag in the rest of the code at all. While we're at it, let's make sure we treat loose and packed objects the same (i.e., setting up dummy objects for both and skipping the actual sha1 check). That makes the connectivity-only check actually fast on a real repo (40 seconds versus 180 seconds on my copy of linux.git). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 122 ++++++++++++++++++++++++++++++++++++++---------- t/t1450-fsck.sh | 29 +++++++++++- 2 files changed, 125 insertions(+), 26 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 3e67203f9c..75e836e2fd 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -205,8 +205,6 @@ static void check_reachable_object(struct object *obj) if (!(obj->flags & HAS_OBJ)) { if (has_sha1_pack(obj->oid.hash)) return; /* it is in pack - forget about it */ - if (connectivity_only && has_object_file(&obj->oid)) - return; printf("missing %s %s\n", typename(obj->type), describe_object(obj)); errors_found |= ERROR_REACHABLE; @@ -584,6 +582,79 @@ static int fsck_cache_tree(struct cache_tree *it) return err; } +static void mark_object_for_connectivity(const unsigned char *sha1) +{ + struct object *obj = lookup_object(sha1); + + /* + * Setting the object type here isn't strictly necessary for a + * connectivity check. In most cases, our walk will expect a certain + * type (e.g., a tree referencing a blob) and will use lookup_blob() to + * assign the type. But doing it here has two advantages: + * + * 1. When the fsck_walk code looks at objects that _don't_ come from + * links (e.g., the tip of a ref), it may complain about the + * "unknown object type". + * + * 2. This serves as a nice cross-check that the graph links are + * sane. So --connectivity-only does not check that the bits of + * blobs are not corrupted, but it _does_ check that 100644 tree + * entries point to blobs, and so forth. + * + * Unfortunately we can't just use parse_object() here, because the + * whole point of --connectivity-only is to avoid reading the object + * data more than necessary. + */ + if (!obj || obj->type == OBJ_NONE) { + enum object_type type = sha1_object_info(sha1, NULL); + switch (type) { + case OBJ_BAD: + error("%s: unable to read object type", + sha1_to_hex(sha1)); + break; + case OBJ_COMMIT: + obj = (struct object *)lookup_commit(sha1); + break; + case OBJ_TREE: + obj = (struct object *)lookup_tree(sha1); + break; + case OBJ_BLOB: + obj = (struct object *)lookup_blob(sha1); + break; + case OBJ_TAG: + obj = (struct object *)lookup_tag(sha1); + break; + default: + error("%s: unknown object type %d", + sha1_to_hex(sha1), type); + } + + if (!obj || obj->type == OBJ_NONE) { + errors_found |= ERROR_OBJECT; + return; + } + } + + obj->flags |= HAS_OBJ; +} + +static int mark_loose_for_connectivity(const unsigned char *sha1, + const char *path, + void *data) +{ + mark_object_for_connectivity(sha1); + return 0; +} + +static int mark_packed_for_connectivity(const unsigned char *sha1, + struct packed_git *pack, + uint32_t pos, + void *data) +{ + mark_object_for_connectivity(sha1); + return 0; +} + static char const * const fsck_usage[] = { N_("git fsck [] [...]"), NULL @@ -640,38 +711,41 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) git_config(fsck_config, NULL); fsck_head_link(); - if (!connectivity_only) { + if (connectivity_only) { + for_each_loose_object(mark_loose_for_connectivity, NULL, 0); + for_each_packed_object(mark_packed_for_connectivity, NULL, 0); + } else { fsck_object_dir(get_object_directory()); prepare_alt_odb(); for (alt = alt_odb_list; alt; alt = alt->next) fsck_object_dir(alt->path); - } - if (check_full) { - struct packed_git *p; - uint32_t total = 0, count = 0; - struct progress *progress = NULL; + if (check_full) { + struct packed_git *p; + uint32_t total = 0, count = 0; + struct progress *progress = NULL; - prepare_packed_git(); + prepare_packed_git(); - if (show_progress) { - for (p = packed_git; p; p = p->next) { - if (open_pack_index(p)) - continue; - total += p->num_objects; + if (show_progress) { + for (p = packed_git; p; p = p->next) { + if (open_pack_index(p)) + continue; + total += p->num_objects; + } + + progress = start_progress(_("Checking objects"), total); } - - progress = start_progress(_("Checking objects"), total); + for (p = packed_git; p; p = p->next) { + /* verify gives error messages itself */ + if (verify_pack(p, fsck_obj_buffer, + progress, count)) + errors_found |= ERROR_PACK; + count += p->num_objects; + } + stop_progress(&progress); } - for (p = packed_git; p; p = p->next) { - /* verify gives error messages itself */ - if (verify_pack(p, fsck_obj_buffer, - progress, count)) - errors_found |= ERROR_PACK; - count += p->num_objects; - } - stop_progress(&progress); } heads = 0; diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index e88ec7747b..4d1c3ba664 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -523,9 +523,21 @@ test_expect_success 'fsck --connectivity-only' ' touch empty && git add empty && test_commit empty && + + # Drop the index now; we want to be sure that we + # recursively notice the broken objects + # because they are reachable from refs, not because + # they are in the index. + rm -f .git/index && + + # corrupt the blob, but in a way that we can still identify + # its type. That lets us see that --connectivity-only is + # not actually looking at the contents, but leaves it + # free to examine the type if it chooses. empty=.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 && - rm -f $empty && - echo invalid >$empty && + blob=$(echo unrelated | git hash-object -w --stdin) && + mv $(sha1_file $blob) $empty && + test_must_fail git fsck --strict && git fsck --strict --connectivity-only && tree=$(git rev-parse HEAD:) && @@ -537,6 +549,19 @@ test_expect_success 'fsck --connectivity-only' ' ) ' +test_expect_success 'fsck --connectivity-only with explicit head' ' + rm -rf connectivity-only && + git init connectivity-only && + ( + cd connectivity-only && + test_commit foo && + rm -f .git/index && + tree=$(git rev-parse HEAD^{tree}) && + remove_object $(git rev-parse HEAD:foo.t) && + test_must_fail git fsck --connectivity-only $tree + ) +' + remove_loose_object () { sha1="$(git rev-parse "$1")" && remainder=${sha1#??} && From c6c7b16d23a4cb6af26acee865c2ade1a3822bef Mon Sep 17 00:00:00 2001 From: Jeff King Date: Mon, 16 Jan 2017 16:33:29 -0500 Subject: [PATCH 4/9] fsck: tighten error-checks of "git fsck " Instead of checking reachability from the refs, you can ask fsck to check from a particular set of heads. However, the error checking here is quite lax. In particular: 1. It claims lookup_object() will report an error, which is not true. It only does a hash lookup, and the user has no clue that their argument was skipped. 2. When either the name or sha1 cannot be resolved, we continue to exit with a successful error code, even though we didn't check what the user asked us to. This patch fixes both of these cases. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 7 +++++-- t/t1450-fsck.sh | 5 +++++ 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 75e836e2fd..bacc899a32 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -755,9 +755,11 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) if (!get_sha1(arg, sha1)) { struct object *obj = lookup_object(sha1); - /* Error is printed by lookup_object(). */ - if (!obj) + if (!obj) { + error("%s: object missing", sha1_to_hex(sha1)); + errors_found |= ERROR_OBJECT; continue; + } obj->used = 1; if (name_objects) @@ -768,6 +770,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) continue; } error("invalid parameter: expected sha1, got '%s'", arg); + errors_found |= ERROR_OBJECT; } /* diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index 4d1c3ba664..6b6db62c4e 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -611,4 +611,9 @@ test_expect_success 'fsck notices dangling objects' ' ) ' +test_expect_success 'fsck $name notices bogus $name' ' + test_must_fail git fsck bogus && + test_must_fail git fsck $_z40 +' + test_done From c3271a0e4715eb9c3f03dde4fdda23f50cc176c3 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Mon, 16 Jan 2017 16:34:21 -0500 Subject: [PATCH 5/9] fsck: do not fallback "git fsck " to "git fsck" Since fsck tries to continue as much as it can after seeing an error, we still do the reachability check even if some heads we were given on the command-line are bogus. But if _none_ of the heads is is valid, we fallback to checking all refs and the index, which is not what the user asked for at all. Instead of checking "heads", the number of successful heads we got, check "argc" (which we know only has non-options in it, because parse_options removed the others). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 2 +- t/t1450-fsck.sh | 11 +++++++++++ 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index bacc899a32..3d9ee310d2 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -778,7 +778,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) * default ones from .git/refs. We also consider the index file * in this case (ie this implies --cache). */ - if (!heads) { + if (!argc) { get_default_heads(); keep_cache_objects = 1; } diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index 6b6db62c4e..509d69c90e 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -616,4 +616,15 @@ test_expect_success 'fsck $name notices bogus $name' ' test_must_fail git fsck $_z40 ' +test_expect_success 'bogus head does not fallback to all heads' ' + # set up a case that will cause a reachability complaint + echo to-be-deleted >foo && + git add foo && + blob=$(git rev-parse :foo) && + test_when_finished "git rm --cached foo" && + remove_object $blob && + test_must_fail git fsck $_z40 >out 2>&1 && + ! grep $blob out +' + test_done From c2d17b3b6ef42b20b9da294a7a920491b907c503 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Mon, 16 Jan 2017 16:34:57 -0500 Subject: [PATCH 6/9] fsck: check HAS_OBJ more consistently There are two spots that call lookup_object() and assume that a non-NULL result means we have the object: 1. When we're checking the objects given to us by the user on the command line. 2. When we're checking if a reflog entry is valid. This generally follows fsck's mental model that we will have looked at and loaded a "struct object" for each object in the repository. But it misses one case: if another object _mentioned_ an object, but we didn't actually parse it or verify that it exists, it will still have a struct. It's not clear if this is a triggerable bug or not. Certainly the later parts of the reachability check need to be careful of this, and do so by checking the HAS_OBJ flag. But both of these steps happen before we start traversing, so probably we won't have followed any links yet. Still, it's easy enough to be defensive here. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 3d9ee310d2..57f529b41b 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -398,7 +398,7 @@ static void fsck_handle_reflog_sha1(const char *refname, unsigned char *sha1, if (!is_null_sha1(sha1)) { obj = lookup_object(sha1); - if (obj) { + if (obj && (obj->flags & HAS_OBJ)) { if (timestamp && name_objects) add_decoration(fsck_walk_options.object_names, obj, @@ -755,7 +755,7 @@ int cmd_fsck(int argc, const char **argv, const char *prefix) if (!get_sha1(arg, sha1)) { struct object *obj = lookup_object(sha1); - if (!obj) { + if (!obj || !(obj->flags & HAS_OBJ)) { error("%s: object missing", sha1_to_hex(sha1)); errors_found |= ERROR_OBJECT; continue; From c20d4d702f13e6bd4e4c8757989bed62a75e2cfa Mon Sep 17 00:00:00 2001 From: Jeff King Date: Tue, 24 Jan 2017 08:27:49 -0500 Subject: [PATCH 7/9] t1450: use "mv -f" within loose object directory The loose objects are created with mode 0444. That doesn't prevent them being overwritten by rename(), but some versions of "mv" will be extra careful and prompt the user, even without "-i". Reportedly macOS does this, at least in the Travis builds. The prompt reads from /dev/null, defaulting to "no", and the object isn't moved. Then to make matters even more interesting, it still returns "0" and the rest of the test proceeds, but with a broken setup. We can work around it by using "mv -f" to override the prompt. This should work as it's already used in t5504 for the same purpose. Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- t/t1450-fsck.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh index 509d69c90e..d9cd99f2cb 100755 --- a/t/t1450-fsck.sh +++ b/t/t1450-fsck.sh @@ -536,7 +536,7 @@ test_expect_success 'fsck --connectivity-only' ' # free to examine the type if it chooses. empty=.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 && blob=$(echo unrelated | git hash-object -w --stdin) && - mv $(sha1_file $blob) $empty && + mv -f $(sha1_file $blob) $empty && test_must_fail git fsck --strict && git fsck --strict --connectivity-only && From 97ca7ca8ba3acbc7166fb7ff40819696ed20e8c6 Mon Sep 17 00:00:00 2001 From: Jeff King Date: Wed, 25 Jan 2017 23:11:00 -0500 Subject: [PATCH 8/9] fsck: move typename() printing to its own function When an object has a problem, we mention its type. But we do so by feeding the result of typename() directly to fprintf(). This is potentially dangerous because typename() can return NULL for some type values (like OBJ_NONE). It's doubtful that this can be triggered in practice with the current code, so this is probably not fixing a bug. But it future-proofs us against modifications that make things like OBJ_NONE more likely (and gives future patches a central point to handle them). Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 29 ++++++++++++++++++++--------- 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 57f529b41b..3d5ced2d3a 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -56,6 +56,17 @@ static const char *describe_object(struct object *obj) return buf.buf; } +static const char *printable_type(struct object *obj) +{ + const char *ret; + + ret = typename(obj->type); + if (!ret) + ret = "unknown"; + + return ret; +} + static int fsck_config(const char *var, const char *value, void *cb) { if (strcmp(var, "fsck.skiplist") == 0) { @@ -83,7 +94,7 @@ static void objreport(struct object *obj, const char *msg_type, const char *err) { fprintf(stderr, "%s in %s %s: %s\n", - msg_type, typename(obj->type), describe_object(obj), err); + msg_type, printable_type(obj), describe_object(obj), err); } static int objerror(struct object *obj, const char *err) @@ -114,7 +125,7 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt if (!obj) { /* ... these references to parent->fld are safe here */ printf("broken link from %7s %s\n", - typename(parent->type), describe_object(parent)); + printable_type(parent), describe_object(parent)); printf("broken link from %7s %s\n", (type == OBJ_ANY ? "unknown" : typename(type)), "unknown"); errors_found |= ERROR_REACHABLE; @@ -131,9 +142,9 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt if (!(obj->flags & HAS_OBJ)) { if (parent && !has_object_file(&obj->oid)) { printf("broken link from %7s %s\n", - typename(parent->type), describe_object(parent)); + printable_type(parent), describe_object(parent)); printf(" to %7s %s\n", - typename(obj->type), describe_object(obj)); + printable_type(obj), describe_object(obj)); errors_found |= ERROR_REACHABLE; } return 1; @@ -205,7 +216,7 @@ static void check_reachable_object(struct object *obj) if (!(obj->flags & HAS_OBJ)) { if (has_sha1_pack(obj->oid.hash)) return; /* it is in pack - forget about it */ - printf("missing %s %s\n", typename(obj->type), + printf("missing %s %s\n", printable_type(obj), describe_object(obj)); errors_found |= ERROR_REACHABLE; return; @@ -231,7 +242,7 @@ static void check_unreachable_object(struct object *obj) * since this is something that is prunable. */ if (show_unreachable) { - printf("unreachable %s %s\n", typename(obj->type), + printf("unreachable %s %s\n", printable_type(obj), describe_object(obj)); return; } @@ -250,7 +261,7 @@ static void check_unreachable_object(struct object *obj) */ if (!obj->used) { if (show_dangling) - printf("dangling %s %s\n", typename(obj->type), + printf("dangling %s %s\n", printable_type(obj), describe_object(obj)); if (write_lost_and_found) { char *filename = git_pathdup("lost-found/%s/%s", @@ -324,7 +335,7 @@ static int fsck_obj(struct object *obj) if (verbose) fprintf(stderr, "Checking %s %s\n", - typename(obj->type), describe_object(obj)); + printable_type(obj), describe_object(obj)); if (fsck_walk(obj, NULL, &fsck_obj_options)) objerror(obj, "broken links"); @@ -350,7 +361,7 @@ static int fsck_obj(struct object *obj) struct tag *tag = (struct tag *) obj; if (show_tags && tag->tagged) { - printf("tagged %s %s", typename(tag->tagged->type), + printf("tagged %s %s", printable_type(tag->tagged), describe_object(tag->tagged)); printf(" (%s) in %s\n", tag->tag, describe_object(&tag->object)); From a2b22854bd5f252cd036636091a1d30141c35bce Mon Sep 17 00:00:00 2001 From: Jeff King Date: Wed, 25 Jan 2017 23:12:07 -0500 Subject: [PATCH 9/9] fsck: lazily load types under --connectivity-only The recent fixes to "fsck --connectivity-only" load all of the objects with their correct types. This keeps the connectivity-only code path close to the regular one, but it also introduces some unnecessary inefficiency. While getting the type of an object is cheap compared to actually opening and parsing the object (as the non-connectivity-only case would do), it's still not free. For reachable non-blob objects, we end up having to parse them later anyway (to see what they point to), making our type lookup here redundant. For unreachable objects, we might never hit them at all in the reachability traversal, making the lookup completely wasted. And in some cases, we might have quite a few unreachable objects (e.g., when alternates are used for shared object storage between repositories, it's normal for there to be objects reachable from other repositories but not the one running fsck). The comment in mark_object_for_connectivity() claims two benefits to getting the type up front: 1. We need to know the types during fsck_walk(). (And not explicitly mentioned, but we also need them when printing the types of broken or dangling commits). We can address this by lazy-loading the types as necessary. Most objects never need this lazy-load at all, because they fall into one of these categories: a. Reachable from our tips, and are coerced into the correct type as we traverse (e.g., a parent link will call lookup_commit(), which converts OBJ_NONE to OBJ_COMMIT). b. Unreachable, but not at the tip of a chunk of unreachable history. We only mention the tips as "dangling", so an unreachable commit which links to hundreds of other objects needs only report the type of the tip commit. 2. It serves as a cross-check that the coercion in (1a) is correct (i.e., we'll complain about a parent link that points to a blob). But we get most of this for free already, because right after coercing, we'll parse any non-blob objects. So we'd notice then if we expected a commit and got a blob. The one exception is when we expect a blob, in which case we never actually read the object contents. So this is a slight weakening, but given that the whole point of --connectivity-only is to sacrifice some data integrity checks for speed, this seems like an acceptable tradeoff. Here are before and after timings for an extreme case with ~5M reachable objects and another ~12M unreachable (it's the torvalds/linux repository on GitHub, connected to shared storage for all of the other kernel forks): [before] $ time git fsck --no-dangling --connectivity-only real 3m4.323s user 1m25.121s sys 1m38.710s [after] $ time git fsck --no-dangling --connectivity-only real 0m51.497s user 0m49.575s sys 0m1.776s Signed-off-by: Jeff King Signed-off-by: Junio C Hamano --- builtin/fsck.c | 58 ++++++-------------------------------------------- fsck.c | 4 ++++ 2 files changed, 11 insertions(+), 51 deletions(-) diff --git a/builtin/fsck.c b/builtin/fsck.c index 3d5ced2d3a..140357b6fc 100644 --- a/builtin/fsck.c +++ b/builtin/fsck.c @@ -60,6 +60,12 @@ static const char *printable_type(struct object *obj) { const char *ret; + if (obj->type == OBJ_NONE) { + enum object_type type = sha1_object_info(obj->oid.hash, NULL); + if (type > 0) + object_as_type(obj, type, 0); + } + ret = typename(obj->type); if (!ret) ret = "unknown"; @@ -595,57 +601,7 @@ static int fsck_cache_tree(struct cache_tree *it) static void mark_object_for_connectivity(const unsigned char *sha1) { - struct object *obj = lookup_object(sha1); - - /* - * Setting the object type here isn't strictly necessary for a - * connectivity check. In most cases, our walk will expect a certain - * type (e.g., a tree referencing a blob) and will use lookup_blob() to - * assign the type. But doing it here has two advantages: - * - * 1. When the fsck_walk code looks at objects that _don't_ come from - * links (e.g., the tip of a ref), it may complain about the - * "unknown object type". - * - * 2. This serves as a nice cross-check that the graph links are - * sane. So --connectivity-only does not check that the bits of - * blobs are not corrupted, but it _does_ check that 100644 tree - * entries point to blobs, and so forth. - * - * Unfortunately we can't just use parse_object() here, because the - * whole point of --connectivity-only is to avoid reading the object - * data more than necessary. - */ - if (!obj || obj->type == OBJ_NONE) { - enum object_type type = sha1_object_info(sha1, NULL); - switch (type) { - case OBJ_BAD: - error("%s: unable to read object type", - sha1_to_hex(sha1)); - break; - case OBJ_COMMIT: - obj = (struct object *)lookup_commit(sha1); - break; - case OBJ_TREE: - obj = (struct object *)lookup_tree(sha1); - break; - case OBJ_BLOB: - obj = (struct object *)lookup_blob(sha1); - break; - case OBJ_TAG: - obj = (struct object *)lookup_tag(sha1); - break; - default: - error("%s: unknown object type %d", - sha1_to_hex(sha1), type); - } - - if (!obj || obj->type == OBJ_NONE) { - errors_found |= ERROR_OBJECT; - return; - } - } - + struct object *obj = lookup_unknown_object(sha1); obj->flags |= HAS_OBJ; } diff --git a/fsck.c b/fsck.c index 4a3069e204..939792752b 100644 --- a/fsck.c +++ b/fsck.c @@ -458,6 +458,10 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options) { if (!obj) return -1; + + if (obj->type == OBJ_NONE) + parse_object(obj->oid.hash); + switch (obj->type) { case OBJ_BLOB: return 0;