Commit graph

20 commits

Author SHA1 Message Date
Junio C Hamano f424d7c33d Merge branch 'ps/reftable-styles'
Code clean-up in various reftable code paths.

* ps/reftable-styles:
  reftable/record: improve semantics when initializing records
  reftable/merged: refactor initialization of iterators
  reftable/merged: refactor seeking of records
  reftable/stack: use `size_t` to track stack length
  reftable/stack: use `size_t` to track stack slices during compaction
  reftable/stack: index segments with `size_t`
  reftable/stack: fix parameter validation when compacting range
  reftable: introduce macros to allocate arrays
  reftable: introduce macros to grow arrays
2024-02-12 13:16:10 -08:00
Patrick Steinhardt b4ff12c8ee reftable: introduce macros to allocate arrays
Similar to the preceding commit, let's carry over macros to allocate
arrays with `REFTABLE_ALLOC_ARRAY()` and `REFTABLE_CALLOC_ARRAY()`. This
requires us to change the signature of `reftable_calloc()`, which only
takes a single argument right now and thus puts the burden on the caller
to calculate the final array's size. This is a net improvement though as
it means that we can now provide proper overflow checks when multiplying
the array size with the member size.

Convert callsites of `reftable_calloc()` to the new signature and start
using the new macros where possible.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-06 12:10:08 -08:00
Patrick Steinhardt e7485601ca reftable/writer: fix writing multi-level indices
When finishing a section we will potentially write an index that makes
it more efficient to look up relevant blocks. The index records written
will encode, for each block of the indexed section, what the offset of
that block is as well as the last key of that block. Thus, the reader
would iterate through the index records to find the first key larger or
equal to the wanted key and then use the encoded offset to look up the
desired block.

When there are a lot of blocks to index though we may end up writing
multiple index blocks, too. To not require a linear search across all
index blocks we instead end up writing a multi-level index. Instead of
referring to the block we are after, an index record may point to
another index block. The reader will then access the highest-level index
and follow down the chain of index blocks until it hits the sought-after
block.

It has been observed though that it is impossible to seek ref records of
the last ref block when using a multi-level index. While the multi-level
index exists and looks fine for most of the part, the highest-level
index was missing an index record pointing to the last block of the next
index. Thus, every additional level made more refs become unseekable at
the end of the ref section.

The root cause is that we are not flushing the last block of the current
level once done writing the level. Consequently, it wasn't recorded in
the blocks that need to be indexed by the next-higher level and thus we
forgot about it.

Fix this bug by flushing blocks after we have written all index records.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-02-01 11:11:32 -08:00
John Cai 1df18a1c9a reftable: honor core.fsync
While the reffiles backend honors configured fsync settings, the
reftable backend does not. Address this by fsyncing reftable files using
the write-or-die api's fsync_component() in two places: when we
add additional entries into the table, and when we close the reftable
writer.

This commits adds a flush function pointer as a new member of
reftable_writer because we are not sure that the first argument to the
*write function pointer always contains a file descriptor. In the case of
strbuf_add_void, the first argument is a buffer. This way, we can pass
in a corresponding flush function that knows how to flush depending on
which writer is being used.

This patch does not contain tests as they will need to wait for another
patch to start to exercise the reftable backend. At that point, the
tests will be added to observe that fsyncs are happening when the
reftable is in use.

Signed-off-by: John Cai <johncai86@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-23 13:45:27 -08:00
Junio C Hamano 481d69dd63 Merge branch 'ps/reftable-fixes-and-optims'
More fixes and optimizations to the reftable backend.

* ps/reftable-fixes-and-optims:
  reftable/merged: transfer ownership of records when iterating
  reftable/merged: really reuse buffers to compute record keys
  reftable/record: store "val2" hashes as static arrays
  reftable/record: store "val1" hashes as static arrays
  reftable/record: constify some parts of the interface
  reftable/writer: fix index corruption when writing multiple indices
  reftable/stack: do not auto-compact twice in `reftable_stack_add()`
  reftable/stack: do not overwrite errors when compacting
2024-01-16 10:11:57 -08:00
Junio C Hamano 492ee03f60 Merge branch 'en/header-cleanup'
Remove unused header "#include".

* en/header-cleanup:
  treewide: remove unnecessary includes in source files
  treewide: add direct includes currently only pulled in transitively
  trace2/tr2_tls.h: remove unnecessary include
  submodule-config.h: remove unnecessary include
  pkt-line.h: remove unnecessary include
  line-log.h: remove unnecessary include
  http.h: remove unnecessary include
  fsmonitor--daemon.h: remove unnecessary includes
  blame.h: remove unnecessary includes
  archive.h: remove unnecessary include
  treewide: remove unnecessary includes in source files
  treewide: remove unnecessary includes from header files
2024-01-08 14:05:15 -08:00
Patrick Steinhardt b31e3cc620 reftable/record: store "val2" hashes as static arrays
Similar to the preceding commit, convert ref records of type "val2" to
store their object IDs in static arrays instead of allocating them for
every single record.

We're using the same benchmark as in the preceding commit, with `git
show-ref --quiet` in a repository with ~350k refs. This time around
though the effects aren't this huge. Before:

    HEAP SUMMARY:
        in use at exit: 21,163 bytes in 193 blocks
      total heap usage: 1,419,040 allocs, 1,418,847 frees, 62,153,868 bytes allocated

After:

    HEAP SUMMARY:
        in use at exit: 21,163 bytes in 193 blocks
      total heap usage: 1,410,148 allocs, 1,409,955 frees, 61,976,068 bytes allocated

This is because "val2"-type records are typically only stored for peeled
tags, and the number of annotated tags in the benchmark repository is
rather low. Still, it can be seen that this change leads to a reduction
of allocations overall, even if only a small one.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-03 09:54:21 -08:00
Patrick Steinhardt 7af607c58d reftable/record: store "val1" hashes as static arrays
When reading ref records of type "val1", we store its object ID in an
allocated array. This results in an additional allocation for every
single ref record we read, which is rather inefficient especially when
iterating over refs.

Refactor the code to instead use an embedded array of `GIT_MAX_RAWSZ`
bytes. While this means that `struct ref_record` is bigger now, we
typically do not store all refs in an array anyway and instead only
handle a limited number of records at the same point in time.

Using `git show-ref --quiet` in a repository with ~350k refs this leads
to a significant drop in allocations. Before:

    HEAP SUMMARY:
        in use at exit: 21,098 bytes in 192 blocks
      total heap usage: 2,116,683 allocs, 2,116,491 frees, 76,098,060 bytes allocated

After:

    HEAP SUMMARY:
        in use at exit: 21,098 bytes in 192 blocks
      total heap usage: 1,419,031 allocs, 1,418,839 frees, 62,145,036 bytes allocated

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-03 09:54:20 -08:00
Patrick Steinhardt ddac965965 reftable/writer: fix index corruption when writing multiple indices
Each reftable may contain multiple types of blocks for refs, objects and
reflog records, where each of these may have an index that makes it more
efficient to find the records. It was observed that the index for log
records can become corrupted under certain circumstances, where the
first entry of the index points into the object index instead of to the
log records.

As it turns out, this corruption can occur whenever we write a log index
as well as at least one additional index. Writing records and their index
is basically a two-step process:

  1. We write all blocks for the corresponding record. Each block that
     gets written is added to a list of blocks to index.

  2. Once all blocks were written we finish the section. If at least two
     blocks have been added to the list of blocks to index then we will
     now write the index for those blocks and flush it, as well.

When we have a very large number of blocks then we may decide to write a
multi-level index, which is why we also keep track of the list of the
index blocks in the same way as we previously kept track of the blocks
to index.

Now when we have finished writing all index blocks we clear the index
and flush the last block to disk. This is done in the wrong order though
because flushing the block to disk will re-add it to the list of blocks
to be indexed. The result is that the next section we are about to write
will have an entry in the list of blocks to index that points to the
last block of the preceding section's index, which will corrupt the log
index.

Fix this corruption by clearing the index after having written the last
block.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-01-03 09:54:20 -08:00
Elijah Newren eea0e59ffb treewide: remove unnecessary includes in source files
Each of these were checked with
   gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE}
to ensure that removing the direct inclusion of the header actually
resulted in that header no longer being included at all (i.e. that
no other header pulled it in transitively).

...except for a few cases where we verified that although the header
was brought in transitively, nothing from it was directly used in
that source file.  These cases were:
  * builtin/credential-cache.c
  * builtin/pull.c
  * builtin/send-pack.c

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-26 12:04:31 -08:00
Patrick Steinhardt 9abda98149 reftable/stack: fix use of unseeded randomness
When writing a new reftable stack, Git will first create the stack with
a random suffix so that concurrent updates will not try to write to the
same file. This random suffix is computed via a call to rand(3P). But we
never seed the function via srand(3P), which means that the suffix is in
fact always the same.

Fix this bug by using `git_rand()` instead, which does not need to be
initialized. While this function is likely going to be slower depending
on the platform, this slowness should not matter in practice as we only
use it when writing a new reftable stack.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2023-12-11 07:23:16 -08:00
Han-Wen Nienhuys 73a4c188b7 reftable: rename writer_stats to reftable_writer_stats
This function is part of the reftable API, so it should use the
reftable_ prefix

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-23 13:36:26 -08:00
Han-Wen Nienhuys 3c443a02a9 reftable: add test for length of disambiguating prefix
The ID => ref map is trimming object IDs to a disambiguating prefix.
Check that we are computing their length correctly.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-23 13:36:26 -08:00
Han-Wen Nienhuys b4007fcc6f reftable: ensure that obj_id_len is >= 2 on writing
When writing the same hash many times, we might decide to use a
length-1 object ID prefix for the ObjectID => ref table, which is out
of spec.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-23 13:36:26 -08:00
Han-Wen Nienhuys 1407659110 reftable: add a test that verifies that writing empty keys fails
Empty keys can only be written as ref records with empty names. The
log record has a logical timestamp in the key, so the key is never
empty.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-23 13:36:26 -08:00
Junio C Hamano 34230514b8 Merge branch 'hn/reftable-coverity-fixes'
Problems identified by Coverity in the reftable code have been
corrected.

* hn/reftable-coverity-fixes:
  reftable: add print functions to the record types
  reftable: make reftable_record a tagged union
  reftable: remove outdated file reftable.c
  reftable: implement record equality generically
  reftable: make reftable-record.h function signatures const correct
  reftable: handle null refnames in reftable_ref_record_equal
  reftable: drop stray printf in readwrite_test
  reftable: order unittests by complexity
  reftable: all xxx_free() functions accept NULL arguments
  reftable: fix resource warning
  reftable: ignore remove() return value in stack_test.c
  reftable: check reftable_stack_auto_compact() return value
  reftable: fix resource leak blocksource.c
  reftable: fix resource leak in block.c error path
  reftable: fix OOB stack write in print functions
2022-02-16 15:14:28 -08:00
Han-Wen Nienhuys 6322511148 reftable: drop stray printf in readwrite_test
Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-20 11:31:53 -08:00
Han-Wen Nienhuys 24d4d38c0b reftable: fix resource leak in block.c error path
Add test coverage for corrupt zlib data. Fix memory leaks demonstrated by
unittest.

This problem was discovered by a Coverity scan.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-20 11:31:52 -08:00
Han-Wen Nienhuys 0dd44584ab reftable: signal overflow
reflog entries have unbounded size. In theory, each log ('g') block in reftable
can have an arbitrary size, so the format allows for arbitrarily sized reflog
messages. However, in the implementation, we are not scaling the log blocks up
with the message, and writing a large message fails.

This triggers a failure for reftable in t7006-pager.sh.

Until this is fixed more structurally, report an error from within the reftable
library for easier debugging.

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-23 12:28:34 -08:00
Han-Wen Nienhuys ffc97f1a9e reftable: reftable file level tests
With support for reading and writing files in place, we can construct files (in
memory) and attempt to read them back.

Because some sections of the format are optional (eg. indices, log entries), we
have to exercise this code using multiple sizes of input data

Signed-off-by: Han-Wen Nienhuys <hanwen@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-08 10:45:48 -07:00