git/contrib
Patrick Steinhardt 57db2a094d refs: introduce reftable backend
Due to scalability issues, Shawn Pearce has originally proposed a new
"reftable" format more than six years ago [1]. Initially, this new
format was implemented in JGit with promising results. Around two years
ago, we have then added the "reftable" library to the Git codebase via
a4bbd13be3 (Merge branch 'hn/reftable', 2021-12-15). With this we have
landed all the low-level code to read and write reftables. Notably
missing though was the integration of this low-level code into the Git
code base in the form of a new ref backend that ties all of this
together.

This gap is now finally closed by introducing a new "reftable" backend
into the Git codebase. This new backend promises to bring some notable
improvements to Git repositories:

  - It becomes possible to do truly atomic writes where either all refs
    are committed to disk or none are. This was not possible with the
    "files" backend because ref updates were split across multiple loose
    files.

  - The disk space required to store many refs is reduced, both compared
    to loose refs and packed-refs. This is enabled both by the reftable
    format being a binary format, which is more compact, and by prefix
    compression.

  - We can ignore filesystem-specific behaviour as ref names are not
    encoded via paths anymore. This means there is no need to handle
    case sensitivity on Windows systems or Unicode precomposition on
    macOS.

  - There is no need to rewrite the complete refdb anymore every time a
    ref is being deleted like it was the case for packed-refs. This
    means that ref deletions are now constant time instead of scaling
    linearly with the number of refs.

  - We can ignore file/directory conflicts so that it becomes possible
    to store both "refs/heads/foo" and "refs/heads/foo/bar".

  - Due to this property we can retain reflogs for deleted refs. We have
    previously been deleting reflogs together with their refs to avoid
    file/directory conflicts, which is not necessary anymore.

  - We can properly enumerate all refs. With the "files" backend it is
    not easily possible to distinguish between refs and non-refs because
    they may live side by side in the gitdir.

Not all of these improvements are realized with the current "reftable"
backend implementation. At this point, the new backend is supposed to be
a drop-in replacement for the "files" backend that is used by basically
all Git repositories nowadays. It strives for 1:1 compatibility, which
means that a user can expect the same behaviour regardless of whether
they use the "reftable" backend or the "files" backend for most of the
part.

Most notably, this means we artificially limit the capabilities of the
"reftable" backend to match the limits of the "files" backend. It is not
possible to create refs that would end up with file/directory conflicts,
we do not retain reflogs, we perform stricter-than-necessary checks.
This is done intentionally due to two main reasons:

  - It makes it significantly easier to land the "reftable" backend as
    tests behave the same. It would be tough to argue for each and every
    single test that doesn't pass with the "reftable" backend.

  - It ensures compatibility between repositories that use the "files"
    backend and repositories that use the "reftable" backend. Like this,
    hosters can migrate their repositories to use the "reftable" backend
    without causing issues for clients that use the "files" backend in
    their clones.

It is expected that these artificial limitations may eventually go away
in the long term.

Performance-wise things very much depend on the actual workload. The
following benchmarks compare the "files" and "reftable" backends in the
current version:

  - Creating N refs in separate transactions shows that the "files"
    backend is ~50% faster. This is not surprising given that creating a
    ref only requires us to create a single loose ref. The "reftable"
    backend will also perform auto compaction on updates. In real-world
    workloads we would likely also want to perform pack loose refs,
    which would likely change the picture.

        Benchmark 1: update-ref: create refs sequentially (refformat = files, refcount = 1)
          Time (mean ± σ):       2.1 ms ±   0.3 ms    [User: 0.6 ms, System: 1.7 ms]
          Range (min … max):     1.8 ms …   4.3 ms    133 runs

        Benchmark 2: update-ref: create refs sequentially (refformat = reftable, refcount = 1)
          Time (mean ± σ):       2.7 ms ±   0.1 ms    [User: 0.6 ms, System: 2.2 ms]
          Range (min … max):     2.4 ms …   2.9 ms    132 runs

        Benchmark 3: update-ref: create refs sequentially (refformat = files, refcount = 1000)
          Time (mean ± σ):      1.975 s ±  0.006 s    [User: 0.437 s, System: 1.535 s]
          Range (min … max):    1.969 s …  1.980 s    3 runs

        Benchmark 4: update-ref: create refs sequentially (refformat = reftable, refcount = 1000)
          Time (mean ± σ):      2.611 s ±  0.013 s    [User: 0.782 s, System: 1.825 s]
          Range (min … max):    2.597 s …  2.622 s    3 runs

        Benchmark 5: update-ref: create refs sequentially (refformat = files, refcount = 100000)
          Time (mean ± σ):     198.442 s ±  0.241 s    [User: 43.051 s, System: 155.250 s]
          Range (min … max):   198.189 s … 198.670 s    3 runs

        Benchmark 6: update-ref: create refs sequentially (refformat = reftable, refcount = 100000)
          Time (mean ± σ):     294.509 s ±  4.269 s    [User: 104.046 s, System: 190.326 s]
          Range (min … max):   290.223 s … 298.761 s    3 runs

  - Creating N refs in a single transaction shows that the "files"
    backend is significantly slower once we start to write many refs.
    The "reftable" backend only needs to update two files, whereas the
    "files" backend needs to write one file per ref.

        Benchmark 1: update-ref: create many refs (refformat = files, refcount = 1)
          Time (mean ± σ):       1.9 ms ±   0.1 ms    [User: 0.4 ms, System: 1.4 ms]
          Range (min … max):     1.8 ms …   2.6 ms    151 runs

        Benchmark 2: update-ref: create many refs (refformat = reftable, refcount = 1)
          Time (mean ± σ):       2.5 ms ±   0.1 ms    [User: 0.7 ms, System: 1.7 ms]
          Range (min … max):     2.4 ms …   3.4 ms    148 runs

        Benchmark 3: update-ref: create many refs (refformat = files, refcount = 1000)
          Time (mean ± σ):     152.5 ms ±   5.2 ms    [User: 19.1 ms, System: 133.1 ms]
          Range (min … max):   148.5 ms … 167.8 ms    15 runs

        Benchmark 4: update-ref: create many refs (refformat = reftable, refcount = 1000)
          Time (mean ± σ):      58.0 ms ±   2.5 ms    [User: 28.4 ms, System: 29.4 ms]
          Range (min … max):    56.3 ms …  72.9 ms    40 runs

        Benchmark 5: update-ref: create many refs (refformat = files, refcount = 1000000)
          Time (mean ± σ):     152.752 s ±  0.710 s    [User: 20.315 s, System: 131.310 s]
          Range (min … max):   152.165 s … 153.542 s    3 runs

        Benchmark 6: update-ref: create many refs (refformat = reftable, refcount = 1000000)
          Time (mean ± σ):     51.912 s ±  0.127 s    [User: 26.483 s, System: 25.424 s]
          Range (min … max):   51.769 s … 52.012 s    3 runs

  - Deleting a ref in a fully-packed repository shows that the "files"
    backend scales with the number of refs. The "reftable" backend has
    constant-time deletions.

        Benchmark 1: update-ref: delete ref (refformat = files, refcount = 1)
          Time (mean ± σ):       1.7 ms ±   0.1 ms    [User: 0.4 ms, System: 1.2 ms]
          Range (min … max):     1.6 ms …   2.1 ms    316 runs

        Benchmark 2: update-ref: delete ref (refformat = reftable, refcount = 1)
          Time (mean ± σ):       1.8 ms ±   0.1 ms    [User: 0.4 ms, System: 1.3 ms]
          Range (min … max):     1.7 ms …   2.1 ms    294 runs

        Benchmark 3: update-ref: delete ref (refformat = files, refcount = 1000)
          Time (mean ± σ):       2.0 ms ±   0.1 ms    [User: 0.5 ms, System: 1.4 ms]
          Range (min … max):     1.9 ms …   2.5 ms    287 runs

        Benchmark 4: update-ref: delete ref (refformat = reftable, refcount = 1000)
          Time (mean ± σ):       1.9 ms ±   0.1 ms    [User: 0.5 ms, System: 1.3 ms]
          Range (min … max):     1.8 ms …   2.1 ms    217 runs

        Benchmark 5: update-ref: delete ref (refformat = files, refcount = 1000000)
          Time (mean ± σ):     229.8 ms ±   7.9 ms    [User: 182.6 ms, System: 46.8 ms]
          Range (min … max):   224.6 ms … 245.2 ms    6 runs

        Benchmark 6: update-ref: delete ref (refformat = reftable, refcount = 1000000)
          Time (mean ± σ):       2.0 ms ±   0.0 ms    [User: 0.6 ms, System: 1.3 ms]
          Range (min … max):     2.0 ms …   2.1 ms    3 runs

  - Listing all refs shows no significant advantage for either of the
    backends. The "files" backend is a bit faster, but not by a
    significant margin. When repositories are not packed the "reftable"
    backend outperforms the "files" backend because the "reftable"
    backend performs auto-compaction.

        Benchmark 1: show-ref: print all refs (refformat = files, refcount = 1, packed = true)
          Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 1.1 ms]
          Range (min … max):     1.5 ms …   2.0 ms    1729 runs

        Benchmark 2: show-ref: print all refs (refformat = reftable, refcount = 1, packed = true)
          Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 1.1 ms]
          Range (min … max):     1.5 ms …   1.8 ms    1816 runs

        Benchmark 3: show-ref: print all refs (refformat = files, refcount = 1000, packed = true)
          Time (mean ± σ):       4.3 ms ±   0.1 ms    [User: 0.9 ms, System: 3.3 ms]
          Range (min … max):     4.1 ms …   4.6 ms    645 runs

        Benchmark 4: show-ref: print all refs (refformat = reftable, refcount = 1000, packed = true)
          Time (mean ± σ):       4.5 ms ±   0.2 ms    [User: 1.0 ms, System: 3.3 ms]
          Range (min … max):     4.2 ms …   5.9 ms    643 runs

        Benchmark 5: show-ref: print all refs (refformat = files, refcount = 1000000, packed = true)
          Time (mean ± σ):      2.537 s ±  0.034 s    [User: 0.488 s, System: 2.048 s]
          Range (min … max):    2.511 s …  2.627 s    10 runs

        Benchmark 6: show-ref: print all refs (refformat = reftable, refcount = 1000000, packed = true)
          Time (mean ± σ):      2.712 s ±  0.017 s    [User: 0.653 s, System: 2.059 s]
          Range (min … max):    2.692 s …  2.752 s    10 runs

        Benchmark 7: show-ref: print all refs (refformat = files, refcount = 1, packed = false)
          Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 1.1 ms]
          Range (min … max):     1.5 ms …   1.9 ms    1834 runs

        Benchmark 8: show-ref: print all refs (refformat = reftable, refcount = 1, packed = false)
          Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 1.1 ms]
          Range (min … max):     1.4 ms …   2.0 ms    1840 runs

        Benchmark 9: show-ref: print all refs (refformat = files, refcount = 1000, packed = false)
          Time (mean ± σ):      13.8 ms ±   0.2 ms    [User: 2.8 ms, System: 10.8 ms]
          Range (min … max):    13.3 ms …  14.5 ms    208 runs

        Benchmark 10: show-ref: print all refs (refformat = reftable, refcount = 1000, packed = false)
          Time (mean ± σ):       4.5 ms ±   0.2 ms    [User: 1.2 ms, System: 3.3 ms]
          Range (min … max):     4.3 ms …   6.2 ms    624 runs

        Benchmark 11: show-ref: print all refs (refformat = files, refcount = 1000000, packed = false)
          Time (mean ± σ):     12.127 s ±  0.129 s    [User: 2.675 s, System: 9.451 s]
          Range (min … max):   11.965 s … 12.370 s    10 runs

        Benchmark 12: show-ref: print all refs (refformat = reftable, refcount = 1000000, packed = false)
          Time (mean ± σ):      2.799 s ±  0.022 s    [User: 0.735 s, System: 2.063 s]
          Range (min … max):    2.769 s …  2.836 s    10 runs

  - Printing a single ref shows no real difference between the "files"
    and "reftable" backends.

        Benchmark 1: show-ref: print single ref (refformat = files, refcount = 1)
          Time (mean ± σ):       1.5 ms ±   0.1 ms    [User: 0.4 ms, System: 1.0 ms]
          Range (min … max):     1.4 ms …   1.8 ms    1779 runs

        Benchmark 2: show-ref: print single ref (refformat = reftable, refcount = 1)
          Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 1.1 ms]
          Range (min … max):     1.4 ms …   2.5 ms    1753 runs

        Benchmark 3: show-ref: print single ref (refformat = files, refcount = 1000)
          Time (mean ± σ):       1.5 ms ±   0.1 ms    [User: 0.3 ms, System: 1.1 ms]
          Range (min … max):     1.4 ms …   1.9 ms    1840 runs

        Benchmark 4: show-ref: print single ref (refformat = reftable, refcount = 1000)
          Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 1.1 ms]
          Range (min … max):     1.5 ms …   2.0 ms    1831 runs

        Benchmark 5: show-ref: print single ref (refformat = files, refcount = 1000000)
          Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 1.1 ms]
          Range (min … max):     1.5 ms …   2.1 ms    1848 runs

        Benchmark 6: show-ref: print single ref (refformat = reftable, refcount = 1000000)
          Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 1.1 ms]
          Range (min … max):     1.5 ms …   2.1 ms    1762 runs

So overall, performance depends on the usecases. Except for many
sequential writes the "reftable" backend is roughly on par or
significantly faster than the "files" backend though. Given that the
"files" backend has received 18 years of optimizations by now this can
be seen as a win. Furthermore, we can expect that the "reftable" backend
will grow faster over time when attention turns more towards
optimizations.

The complete test suite passes, except for those tests explicitly marked
to require the REFFILES prerequisite. Some tests in t0610 are marked as
failing because they depend on still-in-flight bug fixes. Tests can be
run with the new backend by setting the GIT_TEST_DEFAULT_REF_FORMAT
environment variable to "reftable".

There is a single known conceptual incompatibility with the dumb HTTP
transport. As "info/refs" SHOULD NOT contain the HEAD reference, and
because the "HEAD" file is not valid anymore, it is impossible for the
remote client to figure out the default branch without changing the
protocol. This shortcoming needs to be handled in a subsequent patch
series.

As the reftable library has already been introduced a while ago, this
commit message will not go into the details of how exactly the on-disk
format works. Please refer to our preexisting technical documentation at
Documentation/technical/reftable for this.

[1]: https://public-inbox.org/git/CAJo=hJtyof=HRy=2sLP0ng0uZ4=S-DpZ5dR1aF+VHVETKG20OQ@mail.gmail.com/

Original-idea-by: Shawn Pearce <spearce@spearce.org>
Based-on-patch-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-07 08:28:37 -08:00
..
buildsystems Merge branch 'js/doc-unit-tests-with-cmake' 2023-12-09 16:37:47 -08:00
coccinelle config: pass kvi to die_bad_number() 2023-06-28 14:06:40 -07:00
completion Merge branch 'pb/complete-log-more' 2024-02-02 11:31:50 -08:00
contacts git-contacts: also recognise "Reported-by:" 2017-07-27 09:42:55 -07:00
credential Merge branch 'mh/credential-erase-improvements-more' 2023-08-28 09:51:16 -07:00
diff-highlight perl: bump the required Perl version to 5.8.1 from 5.8.0 2023-11-17 07:26:32 +09:00
emacs git{,-blame}.el: remove old bitrotting Emacs code 2018-04-16 17:25:49 +09:00
examples Merge branch 'bw/c-plus-plus' into ds/lazy-load-trees 2018-04-11 10:46:32 +09:00
fast-import import-tars: ignore the global PAX header 2020-03-24 14:39:47 -07:00
git-jump git-jump: admit to passing merge mode args to ls-files 2023-10-05 12:55:38 -07:00
git-shell-commands
hg-to-git hg-to-git: make it compatible with both python3 and python2 2019-09-18 12:03:05 -07:00
hooks multimail: stop shipping a copy 2021-06-11 13:35:19 +09:00
long-running-filter docs: warn about possible '=' in clean/smudge filter process values 2016-12-06 11:29:52 -08:00
mw-to-git Merge branch 'tz/send-email-negatable-options' 2023-12-09 16:37:51 -08:00
persistent-https docs/config: mention protocol implications of url.insteadOf 2017-06-01 10:07:10 +09:00
remote-helpers contrib: git-remote-{bzr,hg} placeholders don't need Python 2017-03-03 11:09:34 -08:00
stats
subtree subtree: fix split processing with multiple subtrees present 2024-01-25 10:56:34 -08:00
thunderbird-patch-inline contrib/thunderbird-patch-inline/appp.sh: use the $( ... ) construct for command substitution 2015-12-27 15:33:13 -08:00
update-unicode unicode_width.h: rename to use dash in file name 2018-04-11 18:11:00 +09:00
vscode vscode: improve tab size and wrapping 2022-06-27 15:37:44 -07:00
workdir refs: introduce reftable backend 2024-02-07 08:28:37 -08:00
coverage-diff.sh contrib: add coverage-diff script 2018-10-10 10:11:35 +09:00
git-resurrect.sh contrib/git-resurrect.sh: use hash-agnostic OID pattern 2020-10-08 11:48:56 -07:00
README doc: fix some typos, grammar and wording issues 2023-10-05 12:55:38 -07:00
remotes2config.sh
rerere-train.sh contrib/rerere-train: avoid useless gpg sign in training 2022-07-19 11:24:08 -07:00

Contributed Software

Although these pieces are available as part of the official git
source tree, they are in somewhat different status.  The
intention is to keep interesting tools around git here, maybe
even experimental ones, to give users an easier access to them,
and to give tools wider exposure, so that they can be improved
faster.

I am not expecting to touch these myself that much.  As far as
my day-to-day operation is concerned, these subdirectories are
owned by their respective primary authors.  I am willing to help
if users of these components and the contrib/ subtree "owners"
have technical/design issues to resolve, but the initiative to
fix and/or enhance things _must_ be on the side of the subtree
owners.  IOW, I won't be actively looking for bugs and rooms for
enhancements in them as the git maintainer -- I may only do so
just as one of the users when I want to scratch my own itch.  If
you have patches to things in contrib/ area, the patch should be
first sent to the primary author, and then the primary author
should ack and forward it to me (git pull request is nicer).
This is the same way as how I have been treating gitk, and to a
lesser degree various foreign SCM interfaces, so you know the
drill.

I expect things that start their life in the contrib/ area
to graduate out of contrib/ once they mature, either by becoming
projects on their own, or moving to the toplevel directory.  On
the other hand, I expect I'll be proposing removal of disused
and inactive ones from time to time.

If you have new things to add to this area, please first propose
it on the git mailing list, and after a list discussion proves
there is general interest (it does not have to be a
list-wide consensus for a tool targeted to a relatively narrow
audience -- for example I do not work with projects whose
upstream is svn, so I have no use for git-svn myself, but it is
of general interest for people who need to interoperate with SVN
repositories in a way git-svn works better than git-svnimport),
submit a patch to create a subdirectory of contrib/ and put your
stuff there.

-jc