Commit graph

2085 commits

Author SHA1 Message Date
Junio C Hamano c276c21da6 Merge branch 'ds/sparse-sparse-checkout'
"sparse-checkout" learns to work well with the sparse-index
feature.

* ds/sparse-sparse-checkout:
  sparse-checkout: integrate with sparse index
  p2000: add test for 'git sparse-checkout [add|set]'
  sparse-index: complete partial expansion
  sparse-index: partially expand directories
  sparse-checkout: --no-sparse-index needs a full index
  cache-tree: implement cache_tree_find_path()
  sparse-index: introduce partially-sparse indexes
  sparse-index: create expand_index()
  t1092: stress test 'git sparse-checkout set'
  t1092: refactor 'sparse-index contents' test
2022-06-03 14:30:35 -07:00
Junio C Hamano 83937e9592 Merge branch 'ns/batch-fsync'
Introduce a filesystem-dependent mechanism to optimize the way the
bits for many loose object files are ensured to hit the disk
platter.

* ns/batch-fsync:
  core.fsyncmethod: performance tests for batch mode
  t/perf: add iteration setup mechanism to perf-lib
  core.fsyncmethod: tests for batch mode
  test-lib-functions: add parsing helpers for ls-files and ls-tree
  core.fsync: use batch mode and sync loose objects by default on Windows
  unpack-objects: use the bulk-checkin infrastructure
  update-index: use the bulk-checkin infrastructure
  builtin/add: add ODB transaction around add_files_to_cache
  cache-tree: use ODB transaction around writing a tree
  core.fsyncmethod: batched disk flushes for loose-objects
  bulk-checkin: rebrand plug/unplug APIs as 'odb transactions'
  bulk-checkin: rename 'state' variable and separate 'plugged' boolean
2022-06-03 14:30:34 -07:00
Derrick Stolee 9fadb373dd sparse-index: introduce partially-sparse indexes
A future change will present a temporary, in-memory mode where the index
can both contain sparse directory entries but also not be completely
collapsed to the smallest possible sparse directories. This will be
necessary for modifying the sparse-checkout definition while using a
sparse index.

For now, convert the single-bit member 'sparse_index' in 'struct
index_state' to be a an 'enum sparse_index_mode' with three modes:

* INDEX_EXPANDED (0): No sparse directories exist. This is always the
  case for repositories that do not use cone-mode sparse-checkout.

* INDEX_COLLAPSED: Sparse directories may exist. Files outside the
  sparse-checkout cone are reduced to sparse directory entries whenever
  possible.

* INDEX_PARTIALLY_SPARSE: Sparse directories may exist. Some file
  entries outside the sparse-checkout cone may exist. Running
  convert_to_sparse() may further reduce those files to sparse directory
  entries.

The main reason to store this extra information is to allow
convert_to_sparse() to short-circuit when the index is already in
INDEX_EXPANDED mode but to actually do the necessary work when in
INDEX_PARTIALLY_SPARSE mode.

The INDEX_PARTIALLY_SPARSE mode will be used in an upcoming change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-05-23 11:08:21 -07:00
Junio C Hamano 8b28e2e2e4 Merge branch 'ds/midx-normalize-pathname-before-comparison'
The path taken by "git multi-pack-index" command from the end user
was compared with path internally prepared by the tool withut first
normalizing, which lead to duplicated paths not being noticed,
which has been corrected.

* ds/midx-normalize-pathname-before-comparison:
  cache: use const char * for get_object_directory()
  multi-pack-index: use --object-dir real path
  midx: use real paths in lookup_multi_pack_index()
2022-05-04 09:51:29 -07:00
Derrick Stolee 11f9e8de3d cache: use const char * for get_object_directory()
The get_object_directory() method returns the exact string stored at
the_repository->objects->odb->path. The return type of "char *" implies
that the caller must keep track of the buffer and free() it when
complete. This causes significant problems later when the ODB is
accessed.

Use "const char *" as the return type to avoid this confusion. There are
no current callers that care about the non-const definition.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-25 11:31:13 -07:00
Neeraj Singh 8a94d83349 core.fsync: use batch mode and sync loose objects by default on Windows
Git for Windows has defaulted to core.fsyncObjectFiles=true since
September 2017. We turn on syncing of loose object files with batch mode
in upstream Git so that we can get broad coverage of the new code
upstream.

We don't actually do fsyncs in the most of the test suite, since
GIT_TEST_FSYNC is set to 0. However, we do exercise all of the
surrounding batch mode code since GIT_TEST_FSYNC merely makes the
maybe_fsync wrapper always appear to succeed.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-06 13:13:26 -07:00
Neeraj Singh c0f4752ed2 core.fsyncmethod: batched disk flushes for loose-objects
When adding many objects to a repo with `core.fsync=loose-object`,
the cost of fsync'ing each object file can become prohibitive.

One major source of the cost of fsync is the implied flush of the
hardware writeback cache within the disk drive. This commit introduces
a new `core.fsyncMethod=batch` option that batches up hardware flushes.
It hooks into the bulk-checkin odb-transaction functionality, takes
advantage of tmp-objdir, and uses the writeout-only support code.

When the new mode is enabled, we do the following for each new object:
1a. Create the object in a tmp-objdir.
2a. Issue a pagecache writeback request and wait for it to complete.

At the end of the entire transaction when unplugging bulk checkin:
1b. Issue an fsync against a dummy file to flush the log and hardware
   writeback cache, which should by now have seen the tmp-objdir writes.
2b. Rename all of the tmp-objdir files to their final names.
3b. When updating the index and/or refs, we assume that Git will issue
   another fsync internal to that operation. This is not the default
   today, but the user now has the option of syncing the index and there
   is a separate patch series to implement syncing of refs.

On a filesystem with a singular journal that is updated during name
operations (e.g. create, link, rename, etc), such as NTFS, HFS+, or XFS
we would expect the fsync to trigger a journal writeout so that this
sequence is enough to ensure that the user's data is durable by the time
the git command returns. This sequence also ensures that no object files
appear in the main object store unless they are fsync-durable.

Batch mode is only enabled if core.fsync includes loose-objects. If
the legacy core.fsyncObjectFiles setting is enabled, but core.fsync does
not include loose-objects, we will use file-by-file fsyncing.

In step (1a) of the sequence, the tmp-objdir is created lazily to avoid
work if no loose objects are ever added to the ODB. We use a tmp-objdir
to maintain the invariant that no loose-objects are visible in the main
ODB unless they are properly fsync-durable. This is important since
future ODB operations that try to create an object with specific
contents will silently drop the new data if an object with the target
hash exists without checking that the loose-object contents match the
hash. Only a full git-fsck would restore the ODB to a functional state
where dataloss doesn't occur.

In step (1b) of the sequence, we issue a fsync against a dummy file
created specifically for the purpose. This method has a little higher
cost than using one of the input object files, but makes adding new
callers of this mechanism easier, since we don't need to figure out
which object file is "last" or risk sharing violations by caching the fd
of the last object file.

_Performance numbers_:

Linux - Hyper-V VM running Kernel 5.11 (Ubuntu 20.04) on a fast SSD.
Mac - macOS 11.5.1 running on a Mac mini on a 1TB Apple SSD.
Windows - Same host as Linux, a preview version of Windows 11.

Adding 500 files to the repo with 'git add' Times reported in seconds.

object file syncing | Linux | Mac   | Windows
--------------------|-------|-------|--------
           disabled | 0.06  |  0.35 | 0.61
              fsync | 1.88  | 11.18 | 2.47
              batch | 0.15  |  0.41 | 1.53

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-04-06 13:13:01 -07:00
Junio C Hamano fca85986bb Merge branch 'ns/core-fsyncmethod' into ns/batch-fsync
* ns/core-fsyncmethod:
  configure.ac: fix HAVE_SYNC_FILE_RANGE definition
  core.fsyncmethod: correctly camel-case warning message
  core.fsync: fix incorrect expression for default configuration
  core.fsync: documentation and user-friendly aggregate options
  core.fsync: new option to harden the index
  core.fsync: add configuration parsing
  core.fsync: introduce granular fsync control infrastructure
  core.fsyncmethod: add writeout-only mode
  wrapper: make inclusion of Windows csprng header tightly scoped
2022-04-06 13:01:54 -07:00
Junio C Hamano 439c1e6d5d Merge branch 'jh/builtin-fsmonitor-part2'
Built-in fsmonitor (part 2).

* jh/builtin-fsmonitor-part2: (30 commits)
  t7527: test status with untracked-cache and fsmonitor--daemon
  fsmonitor: force update index after large responses
  fsmonitor--daemon: use a cookie file to sync with file system
  fsmonitor--daemon: periodically truncate list of modified files
  t/perf/p7519: add fsmonitor--daemon test cases
  t/perf/p7519: speed up test on Windows
  t/perf/p7519: fix coding style
  t/helper/test-chmtime: skip directories on Windows
  t/perf: avoid copying builtin fsmonitor files into test repo
  t7527: create test for fsmonitor--daemon
  t/helper/fsmonitor-client: create IPC client to talk to FSMonitor Daemon
  help: include fsmonitor--daemon feature flag in version info
  fsmonitor--daemon: implement handle_client callback
  compat/fsmonitor/fsm-listen-darwin: implement FSEvent listener on MacOS
  compat/fsmonitor/fsm-listen-darwin: add MacOS header files for FSEvent
  compat/fsmonitor/fsm-listen-win32: implement FSMonitor backend on Windows
  fsmonitor--daemon: create token-based changed path cache
  fsmonitor--daemon: define token-ids
  fsmonitor--daemon: add pathname classification
  fsmonitor--daemon: implement 'start' command
  ...
2022-04-04 10:56:24 -07:00
Junio C Hamano 27dd460799 Merge branch 'ns/core-fsyncmethod'
A couple of fix-up to a topic that is now in 'master'.

* ns/core-fsyncmethod:
  core.fsyncmethod: correctly camel-case warning message
  core.fsync: fix incorrect expression for default configuration
2022-04-04 10:56:22 -07:00
Neeraj Singh e5ec440c98 core.fsync: fix incorrect expression for default configuration
Commit b9f5d035 (core.fsync: documentation and user-friendly
aggregate options, 2022-03-15) introduced an incorrect value for
FSYNC_COMPONENTS_DEFAULT. We need an AND-NOT rather than OR-NOT.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-29 16:04:16 -07:00
Junio C Hamano 6e1a8952e9 Merge branch 'ps/fsync-refs'
Updates to refs traditionally weren't fsync'ed, but we can
configure using core.fsync variable to do so.

* ps/fsync-refs:
  core.fsync: new option to harden references
2022-03-25 16:38:25 -07:00
Junio C Hamano eb804cd405 Merge branch 'ns/core-fsyncmethod'
Replace core.fsyncObjectFiles with two new configuration variables,
core.fsync and core.fsyncMethod.

* ns/core-fsyncmethod:
  core.fsync: documentation and user-friendly aggregate options
  core.fsync: new option to harden the index
  core.fsync: add configuration parsing
  core.fsync: introduce granular fsync control infrastructure
  core.fsyncmethod: add writeout-only mode
  wrapper: make inclusion of Windows csprng header tightly scoped
2022-03-25 16:38:24 -07:00
Jeff Hostetler 1e0ea5c431 fsmonitor: config settings are repository-specific
Move fsmonitor config settings to a new and opaque
`struct fsmonitor_settings` structure.  Add a lazily-loaded pointer
to this into `struct repo_settings`

Create an `enum fsmonitor_mode` type in `struct fsmonitor_settings` to
represent the state of fsmonitor.  This lets us represent which, if
any, fsmonitor provider (hook or IPC) is enabled.

Create `fsm_settings__get_*()` getters to lazily look up fsmonitor-
related config settings.

Get rid of the `core_fsmonitor` global variable.  Move the code to
lookup the existing `core.fsmonitor` config value into the fsmonitor
settings.

Create a hook pathname variable in `struct fsmonitor-settings` and
only set it when in hook mode.

Extend the definition of `core.fsmonitor` to be either a boolean
or a hook pathname.  When true, the builtin FSMonitor is used.
When false or unset, no FSMonitor (neither builtin nor hook) is
used.

The existing `core_fsmonitor` global variable was used to store the
pathname to the fsmonitor hook *and* it was used as a boolean to see
if fsmonitor was enabled.  This dual usage and global visibility leads
to confusion when we add the IPC-based provider.  So lets hide the
details in fsmonitor-settings.c and let it decide which provider to
use in the case of multiple settings.  This avoids cluttering up
repo-settings.c with these private details.

A future commit in builtin-fsmonitor series will add the ability to
disqualify worktrees for various reasons, such as being mounted from a
remote volume, where fsmonitor should not be started.  Having the
config settings hidden in fsmonitor-settings.c allows such worktree
restrictions to override the config values used.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-25 16:04:15 -07:00
Junio C Hamano 430883a70c Merge branch 'ab/object-file-api-updates'
Object-file API shuffling.

* ab/object-file-api-updates:
  object-file API: pass an enum to read_object_with_reference()
  object-file.c: add a literal version of write_object_file_prepare()
  object-file API: have hash_object_file() take "enum object_type"
  object API: rename hash_object_file_literally() to write_*()
  object-file API: split up and simplify check_object_signature()
  object API users + docs: check <0, not !0 with check_object_signature()
  object API docs: move check_object_signature() docs to cache.h
  object API: correct "buf" v.s. "map" mismatch in *.c and *.h
  object-file API: have write_object_file() take "enum object_type"
  object-file API: add a format_object_header() function
  object-file API: return "void", not "int" from hash_object_file()
  object-file.c: split up declaration of unrelated variables
2022-03-16 17:53:08 -07:00
Patrick Steinhardt bc22d845c4 core.fsync: new option to harden references
When writing both loose and packed references to disk we first create a
lockfile, write the updated values into that lockfile, and on commit we
rename the file into place. According to filesystem developers, this
behaviour is broken because applications should always sync data to disk
before doing the final rename to ensure data consistency [1][2][3]. If
applications fail to do this correctly, a hard crash of the machine can
easily result in corrupted on-disk data.

This kind of corruption can in fact be easily observed with Git when the
machine hard-resets shortly after writing references to disk. On
machines with ext4, this will likely lead to the "empty files" problem:
the file has been renamed, but its data has not been synced to disk. The
result is that the reference is corrupt, and in the worst case this can
lead to data loss.

Implement a new option to harden references so that users and admins can
avoid this scenario by syncing locked loose and packed references to
disk before we rename them into place.

[1]: https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/
[2]: https://btrfs.wiki.kernel.org/index.php/FAQ (What are the crash guarantees of overwrite-by-rename)
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/ext4.rst (see auto_da_alloc)

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-15 13:30:58 -07:00
Junio C Hamano 0099792400 Merge branch 'ns/core-fsyncmethod' into ps/fsync-refs
* ns/core-fsyncmethod:
  core.fsync: documentation and user-friendly aggregate options
  core.fsync: new option to harden the index
  core.fsync: add configuration parsing
  core.fsync: introduce granular fsync control infrastructure
  core.fsyncmethod: add writeout-only mode
  wrapper: make inclusion of Windows csprng header tightly scoped
2022-03-15 13:30:37 -07:00
Neeraj Singh b9f5d0358d core.fsync: documentation and user-friendly aggregate options
This commit adds aggregate options for the core.fsync setting that are
more user-friendly. These options are specified in terms of 'levels of
safety', indicating which Git operations are considered to be sync
points for durability.

The new documentation is also included here in its entirety for ease of
review.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-15 12:32:55 -07:00
Neeraj Singh ba95e96d4c core.fsync: new option to harden the index
This commit introduces the new ability for the user to harden
the index. In the event of a system crash, the index must be
durable for the user to actually find a file that has been added
to the repo and then deleted from the working tree.

We use the presence of the COMMIT_LOCK flag and absence of the
alternate_index_output as a proxy for determining whether we're
updating the persistent index of the repo or some temporary
index. We don't sync these temporary indexes.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10 15:10:22 -08:00
Neeraj Singh 020406eaa5 core.fsync: introduce granular fsync control infrastructure
This commit introduces the infrastructure for the core.fsync
configuration knob. The repository components we want to sync
are identified by flags so that we can turn on or off syncing
for specific components.

If core.fsyncObjectFiles is set and the core.fsync configuration
also includes FSYNC_COMPONENT_LOOSE_OBJECT, we will fsync any
loose objects. This picks the strictest data integrity behavior
if core.fsync and core.fsyncObjectFiles are set to conflicting values.

This change introduces the currently unused fsync_component
helper, which will be used by a later patch that adds fsyncing to
the refs backend.

Actual configuration and documentation of the fsync components
list are in other patches in the series to separate review of
the underlying mechanism from the policy of how it's configured.

Helped-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10 15:10:22 -08:00
Neeraj Singh abf38abec2 core.fsyncmethod: add writeout-only mode
This commit introduces the `core.fsyncMethod` configuration
knob, which can currently be set to `fsync` or `writeout-only`.

The new writeout-only mode attempts to tell the operating system to
flush its in-memory page cache to the storage hardware without issuing a
CACHE_FLUSH command to the storage controller.

Writeout-only fsync is significantly faster than a vanilla fsync on
common hardware, since data is written to a disk-side cache rather than
all the way to a durable medium. Later changes in this patch series will
take advantage of this primitive to implement batching of hardware
flushes.

When git_fsync is called with FSYNC_WRITEOUT_ONLY, it may fail and the
caller is expected to do an ordinary fsync as needed.

On Apple platforms, the fsync system call does not issue a CACHE_FLUSH
directive to the storage controller. This change updates fsync to do
fcntl(F_FULLFSYNC) to make fsync actually durable. We maintain parity
with existing behavior on Apple platforms by setting the default value
of the new core.fsyncMethod option.

Signed-off-by: Neeraj Singh <neerajsi@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-10 15:10:22 -08:00
Junio C Hamano 82386b4496 Merge branch 'en/present-despite-skipped'
In sparse-checkouts, files mis-marked as missing from the working tree
could lead to later problems.  Such files were hard to discover, and
harder to correct.  Automatically detecting and correcting the marking
of such files has been added to avoid these problems.

* en/present-despite-skipped:
  repo_read_index: add config to expect files outside sparse patterns
  Accelerate clear_skip_worktree_from_present_files() by caching
  Update documentation related to sparsity and the skip-worktree bit
  repo_read_index: clear SKIP_WORKTREE bit from files present in worktree
  unpack-trees: fix accidental loss of user changes
  t1011: add testcase demonstrating accidental loss of user modifications
2022-03-09 13:38:23 -08:00
Elijah Newren ecc7c8841d repo_read_index: add config to expect files outside sparse patterns
Typically with sparse checkouts, we expect files outside the sparsity
patterns to be marked as SKIP_WORKTREE and be missing from the working
tree.  Sometimes this expectation would be violated however; including
in cases such as:
  * users grabbing files from elsewhere and writing them to the worktree
    (perhaps by editing a cached copy in an editor, copying/renaming, or
     even untarring)
  * various git commands having incomplete or no support for the
    SKIP_WORKTREE bit[1,2]
  * users attempting to "abort" a sparse-checkout operation with a
    not-so-early Ctrl+C (updating $GIT_DIR/info/sparse-checkout and the
    working tree is not atomic)[3].
When the SKIP_WORKTREE bit in the index did not reflect the presence of
the file in the working tree, it traditionally caused confusion and was
difficult to detect and recover from.  So, in a sparse checkout, since
af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
in worktree, 2022-01-14), Git automatically clears the SKIP_WORKTREE
bit at index read time for entries corresponding to files that are
present in the working tree.

There is another workflow, however, where it is expected that paths
outside the sparsity patterns appear to exist in the working tree and
that they do not lose the SKIP_WORKTREE bit, at least until they get
modified.  A Git-aware virtual file system[4] takes advantage of its
position as a file system driver to expose all files in the working
tree, fetch them on demand using partial clone on access, and tell Git
to pay attention to them on demand by updating the sparse checkout
pattern on writes.  This means that commands like "git status" only have
to examine files that have potentially been modified, whereas commands
like "ls" are able to show the entire codebase without requiring manual
updates to the sparse checkout pattern.

Thus since af6a51875a, Git with such Git-aware virtual file systems
unsets the SKIP_WORKTREE bit for all files and commands like "git
status" have to fetch and examine them all.

Introduce a configuration setting sparse.expectFilesOutsideOfPatterns to
allow limiting the tracked set of files to a small set once again.  A
Git-aware virtual file system or other application that wants to
maintain files outside of the sparse checkout can set this in a
repository to instruct Git not to check for the presence of
SKIP_WORKTREE files.  The setting defaults to false, so most users of
sparse checkout will still get the benefit of an automatically updating
index to recover from the variety of difficult issues detailed in
af6a51875a for paths with SKIP_WORKTREE set despite the path being
present.

[1] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/
[2] The three long paragraphs in the middle of
    https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/
[3] https://lore.kernel.org/git/CABPp-BFnFpzwGC11TLoLs8YK5yiisA5D5-fFjXnJsbESVDwZsA@mail.gmail.com/
[4] such as the vfsd described in
https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-03-01 23:37:48 -08:00
Ævar Arnfjörð Bjarmason 6aea6baeb3 object-file API: pass an enum to read_object_with_reference()
Change the read_object_with_reference() function to take an "enum
object_type". It was not prepared to handle an arbitrary "const
char *type", as it was itself calling type_from_string().

Let's change the only caller that passes in user data to use
type_from_string(), and convert the rest to use e.g. "OBJ_TREE"
instead of "tree_type".

The "cat-file" caller is not on the codepath that
handles"--allow-unknown", so the type_from_string() there is safe. Its
use of type_from_string() doesn't functionally differ from that of the
pre-image.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25 17:16:32 -08:00
Ævar Arnfjörð Bjarmason 44439c1c58 object-file API: have hash_object_file() take "enum object_type"
Change the hash_object_file() function to take an "enum
object_type".

Since a preceding commit all of its callers are passing either
"{commit,tree,blob,tag}_type", or the result of a call to type_name(),
the parse_object() caller that would pass NULL is now using
stream_object_signature().

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25 17:16:32 -08:00
Ævar Arnfjörð Bjarmason 0f156dbb04 object-file API: split up and simplify check_object_signature()
Split up the check_object_signature() function into that non-streaming
version (it accepts an already filled "buf"), and a new
stream_object_signature() which will retrieve the object from storage,
and hash it on-the-fly.

All of the callers of check_object_signature() were effectively
calling two different functions, if we go by cyclomatic
complexity. I.e. they'd either take the early "if (map)" branch and
return early, or not. This has been the case since the "if (map)"
condition was added in 090ea12671 (parse_object: avoid putting whole
blob in core, 2012-03-07).

We can then further simplify the resulting check_object_signature()
function since only one caller wanted to pass a non-NULL "buf" and a
non-NULL "real_oidp". That "read_loose_object()" codepath used by "git
fsck" can instead use hash_object_file() followed by oideq().

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25 17:16:31 -08:00
Ævar Arnfjörð Bjarmason ee213de22d object API users + docs: check <0, not !0 with check_object_signature()
Change those users of the object API that misused
check_object_signature() by assuming it returned any non-zero when the
OID didn't match the expected value to check <0 instead. In practice
all of this code worked before, but it wasn't consistent with rest of
the users of the API.

Let's also clarify what the <0 return value means in API docs.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25 17:16:31 -08:00
Ævar Arnfjörð Bjarmason cdcaaec9a6 object API docs: move check_object_signature() docs to cache.h
Move the API documentation for check_object_signature() to cache.h,
where its prototype is declared. This is in preparation for adding a
companion function.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-25 17:16:31 -08:00
Junio C Hamano 0a01df08c0 Merge branch 'ab/date-mode-release'
Plug (some) memory leaks around parse_date_format().

* ab/date-mode-release:
  date API: add and use a date_mode_release()
  date API: add basic API docs
  date API: provide and use a DATE_MODE_INIT
  date API: create a date.h, split from cache.h
  cache.h: remove always unused show_date_human() declaration
2022-02-25 15:47:36 -08:00
Junio C Hamano 00e38ba6d8 Merge branch 'ab/auto-detect-zlib-compress2'
The build procedure has been taught to notice older version of zlib
and enable our replacement uncompress2() automatically.

* ab/auto-detect-zlib-compress2:
  compat: auto-detect if zlib has uncompress2()
2022-02-16 15:14:30 -08:00
Ævar Arnfjörð Bjarmason 88c7b4c3c8 date API: create a date.h, split from cache.h
Move the declaration of the date.c functions from cache.h, and adjust
the relevant users to include the new date.h header.

The show_ident_date() function belonged in pretty.h (it's defined in
pretty.c), its two users outside of pretty.c didn't strictly need to
include pretty.h, as they get it indirectly, but let's add it to them
anyway.

Similarly, the change to "builtin/{fast-import,show-branch,tag}.c"
isn't needed as far as the compiler is concerned, but since they all
use the "DATE_MODE()" macro we now define in date.h, let's have them
include it.

We could simply include this new header in "cache.h", but as this
change shows these functions weren't common enough to warrant
including in it in the first place. By moving them out of cache.h
changes to this API will no longer cause a (mostly) full re-build of
the project when "make" is run.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-16 09:40:00 -08:00
Ævar Arnfjörð Bjarmason f6c71f81f9 cache.h: remove always unused show_date_human() declaration
There has never been a show_date_human() function on the "master"
branch in git.git. This declaration was added in b841d4ff43 (Add
`human` format to test-tool, 2019-01-28).

A look at the ML history reveals that it was leftover cruft from an
earlier version of that commit[1].

1. https://lore.kernel.org/git/20190118061805.19086-5-ischis2@cox.net/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-02-16 09:40:00 -08:00
Junio C Hamano ee52b35e50 Merge branch 'ms/update-index-racy'
"git update-index --refresh" has been taught to deal better with
racy timestamps (just like "git status" already does).

* ms/update-index-racy:
  update-index: refresh should rewrite index in case of racy timestamps
  t7508: add tests capturing racy timestamp handling
  t7508: fix bogus mtime verification
  test-lib: introduce API for verifying file mtime
2022-02-05 09:42:32 -08:00
Junio C Hamano 008028a910 Merge branch 'ab/cat-file'
Assorted updates to "git cat-file", especially "-h".

* ab/cat-file:
  cat-file: s/_/-/ in typo'd usage_msg_optf() message
  cat-file: don't whitespace-pad "(...)" in SYNOPSIS and usage output
  cat-file: use GET_OID_ONLY_TO_DIE in --(textconv|filters)
  object-name.c: don't have GET_OID_ONLY_TO_DIE imply *_QUIETLY
  cat-file: correct and improve usage information
  cat-file: fix remaining usage bugs
  cat-file: make --batch-all-objects a CMDMODE
  cat-file: move "usage" variable to cmd_cat_file()
  cat-file docs: fix SYNOPSIS and "-h" output
  parse-options API: add a usage_msg_optf()
  cat-file tests: test messaging on bad objects/paths
  cat-file tests: test bad usage
2022-02-05 09:42:31 -08:00
Ævar Arnfjörð Bjarmason 07564773c2 compat: auto-detect if zlib has uncompress2()
We have a copy of uncompress2() implementation in compat/ so that we
can build with an older version of zlib that lack the function, and
the build procedure selects if it is used via the NO_UNCOMPRESS2
$(MAKE) variable.  This is yet another "annoying" knob the porters
need to tweak on platforms that are not common enough to have the
default set in the config.mak.uname file.

Attempt to instead ask the system header <zlib.h> to decide if we
need the compatibility implementation.  This is a deviation from the
way we have been handling the "compatiblity" features so far, and if
it can be done cleanly enough, it could work as a model for features
that need compatibility definition we discover in the future.  With
that goal in mind, avoid expedient but ugly hacks, like shoving the
code that is conditionally compiled into an unrelated .c file, which
may not work in future cases---instead, take an approach that uses a
file that is independently compiled and stands on its own.

Compile and link compat/zlib-uncompress2.c file unconditionally, but
conditionally hide the implementation behind #if/#endif when zlib
version is 1.2.9 or newer, and unconditionally archive the resulting
object file in the libgit.a to be picked up by the linker.

There are a few things to note in the shape of the code base after
this change:

 - We no longer use NO_UNCOMPRESS2 knob; if the system header
   <zlib.h> claims a version that is more cent than the library
   actually is, this would break, but it is easy to add it back when
   we find such a system.

 - The object file compat/zlib-uncompress2.o is always compiled and
   archived in libgit.a, just like a few other compat/ object files
   already are.

 - The inclusion of <zlib.h> is done in <git-compat-util.h>; we used
   to do so from <cache.h> which includes <git-compat-util.h> as the
   first thing it does, so from the *.c codes, there is no practical
   change.

 - Until objects in libgit.a that is already used gains a reference
   to the function, the reftable code will be the only one that
   wants it, so libgit.a on the linker command line needs to appear
   once more at the end to satisify the mutual dependency.

 - Beat found a trick used by OpenSSL to avoid making the
   conditionally-compiled object truly empty (apparently because
   they had to deal with compilers that do not want to see an
   effectively empty input file).  Our compat/zlib-uncompress2.c
   file borrows the same trick for portabilty.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Helped-by: Beat Bolli <dev+git@drbeat.li>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-26 09:05:55 -08:00
Junio C Hamano 453cef7455 Merge branch 'ma/header-dup-cleanup'
Code clean-up.

* ma/header-dup-cleanup:
  cache.h: drop duplicate `ensure_full_index()` declaration
2022-01-12 15:11:43 -08:00
Martin Ågren 97d6fb5a1f cache.h: drop duplicate ensure_full_index() declaration
There are two identical declarations of `ensure_full_index()` in
cache.h.

Commit 3964fc2aae ("sparse-index: add guard to ensure full index",
2021-03-30) provided an empty implementation of `ensure_full_index()`,
declaring it in a new file sparse-index.h. When commit 4300f8442a
("sparse-index: implement ensure_full_index()", 2021-03-30) fleshed out
the implementation, it added an identical declaration to cache.h.

Then 118a2e8bde ("cache: move ensure_full_index() to cache.h",
2021-04-01) favored having the declaration in cache.h. Because of the
double declaration, at that point we could have just dropped the one in
sparse-index.h, but instead it got moved to cache.h.

As a result, cache.h contains the exact same function declaration twice.
Drop the one under "/* Name hashing */", in favor of the one under
"/* Initialize and use the cache information */".

Signed-off-by: Martin Ågren <martin.agren@gmail.com>
Acked-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-10 11:30:33 -08:00
Marc Strapetz 2ede073fd2 update-index: refresh should rewrite index in case of racy timestamps
'git update-index --refresh' and '--really-refresh' should force writing
of the index file if racy timestamps have been encountered, as
'git status' already does [1].

Note that calling 'git update-index --refresh' still does not guarantee
that there will be no more racy timestamps afterwards (the same holds
true for 'git status'):

- calling 'git update-index --refresh' immediately after touching and
  adding a file may still leave racy timestamps if all three operations
  occur within the racy-tolerance (usually 1 second unless USE_NSEC has
  been defined)

- calling 'git update-index --refresh' for timestamps which are set into
  the future will leave them racy

To guarantee that such racy timestamps will be resolved would require to
wait until the system clock has passed beyond these timestamps and only
then write the index file. Especially for future timestamps, this does
not seem feasible because of possibly long delays/hangs.

[1] https://lore.kernel.org/git/d3dd805c-7c1d-30a9-6574-a7bfcb7fc013@syntevo.com/

Signed-off-by: Marc Strapetz <marc.strapetz@syntevo.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2022-01-07 12:37:31 -08:00
Junio C Hamano da81d473fc Merge branch 'en/keep-cwd'
Many git commands that deal with working tree files try to remove a
directory that becomes empty (i.e. "git switch" from a branch that
has the directory to another branch that does not would attempt
remove all files in the directory and the directory itself).  This
drops users into an unfamiliar situation if the command was run in
a subdirectory that becomes subject to removal due to the command.
The commands have been taught to keep an empty directory if it is
the directory they were started in to avoid surprising users.

* en/keep-cwd:
  t2501: simplify the tests since we can now assume desired behavior
  dir: new flag to remove_dir_recurse() to spare the original_cwd
  dir: avoid incidentally removing the original_cwd in remove_path()
  stash: do not attempt to remove startup_info->original_cwd
  rebase: do not attempt to remove startup_info->original_cwd
  clean: do not attempt to remove startup_info->original_cwd
  symlinks: do not include startup_info->original_cwd in dir removal
  unpack-trees: add special cwd handling
  unpack-trees: refuse to remove startup_info->original_cwd
  setup: introduce startup_info->original_cwd
  t2501: add various tests for removing the current working directory
2022-01-05 14:01:28 -08:00
Ævar Arnfjörð Bjarmason 245b948815 cat-file: use GET_OID_ONLY_TO_DIE in --(textconv|filters)
Change the cat_one_file() logic that calls get_oid_with_context()
under --textconv and --filters to use the GET_OID_ONLY_TO_DIE flag,
thus improving the error messaging emitted when e.g. <path> is missing
but <rev> is not.

To service the "cat-file" use-case we need to introduce a new
"GET_OID_REQUIRE_PATH" flag, otherwise it would exit early as soon as
a valid "HEAD" was resolved, but in the "cat-file" case being changed
we always need a valid revision and path.

This arguably makes the "<bad rev>:<bad path>" and "<bad
rev>:<good (in HEAD) path>" use cases worse, as we won't quote the
<path> component at the user anymore, but let's just use the existing
logic "git log" et al use for now. We can improve the messaging for
those cases as a follow-up for all callers.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-30 13:05:29 -08:00
Junio C Hamano 63a2e8b41e Merge branch 'ew/test-wo-fsync'
Allow running our tests while disabling fsync.

* ew/test-wo-fsync:
  tests: disable fsync everywhere
2021-12-15 09:39:52 -08:00
Junio C Hamano f0850875fd Merge branch 'vd/sparse-reset'
Various operating modes of "git reset" have been made to work
better with the sparse index.

* vd/sparse-reset:
  unpack-trees: improve performance of next_cache_entry
  reset: make --mixed sparse-aware
  reset: make sparse-aware (except --mixed)
  reset: integrate with sparse index
  reset: expand test coverage for sparse checkouts
  sparse-index: update command for expand/collapse test
  reset: preserve skip-worktree bit in mixed reset
  reset: rename is_missing to !is_in_reset_tree
2021-12-10 14:35:12 -08:00
Elijah Newren e6f8861bd4 setup: introduce startup_info->original_cwd
Removing the current working directory causes all subsequent git
commands run from that directory to get confused and fail with a message
about being unable to read the current working directory:

    $ git status
    fatal: Unable to read current working directory: No such file or directory

Non-git commands likely have similar warnings or even errors, e.g.

    $ bash -c 'echo hello'
    shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
    hello

This confuses end users, particularly since the command they get the
error from is not the one that caused the problem; the problem came from
the side-effect of some previous command.

We would like to avoid removing the current working directory of our
parent process; towards this end, introduce a new variable,
startup_info->original_cwd, that tracks the current working directory
that we inherited from our parent process.  For convenience of later
comparisons, we prefer that this new variable store a path relative to
the toplevel working directory (thus much like 'prefix'), except without
the trailing slash.

Subsequent commits will make use of this new variable.

Acked-by: Derrick Stolee <stolee@gmail.com>
Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-12-09 13:33:12 -08:00
Victoria Dye 20ec2d034c reset: make sparse-aware (except --mixed)
Remove `ensure_full_index` guard on `prime_cache_tree` and update
`prime_cache_tree_rec` to correctly reconstruct sparse directory entries in
the cache tree. While processing a tree's entries, `prime_cache_tree_rec`
must determine whether a directory entry is sparse or not by searching for
it in the index (*without* expanding the index). If a matching sparse
directory index entry is found, no subtrees are added to the cache tree
entry and the entry count is set to 1 (representing the sparse directory
itself). Otherwise, the tree is assumed to not be sparse and its subtrees
are recursively added to the cache tree.

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Victoria Dye <vdye@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-29 12:51:26 -08:00
Jeff King 9b591b9403 strbuf_addftime(): handle "%s" manually
The strftime() function has a non-standard "%s" extension, which prints
the number of seconds since the epoch. But the "struct tm" we get has
already been adjusted for a particular time zone; going back to an epoch
time requires knowing that zone offset. Since strftime() doesn't take
such an argument, round-tripping to a "struct tm" and back to the "%s"
format may produce the wrong value (off by tz_offset seconds).

Since we're already passing in the zone offset courtesy of c3fbf81a85
(strbuf: let strbuf_addftime handle %z and %Z itself, 2017-06-15), we
can use that same value to adjust our epoch seconds accordingly.

Note that the description above makes it sound like strftime()'s "%s" is
useless (and really, the issue is shared by mktime(), which is what
strftime() would use under the hood). But it gets the two cases for
which it's designed correct:

  - the result of gmtime() will have a zero offset, so no adjustment is
    necessary

  - the result of localtime() will be offset by the local zone offset,
    and mktime() and strftime() are defined to assume this offset when
    converting back (there's actually some magic here; some
    implementations record this in the "struct tm", but we can't
    portably access or manipulate it. But they somehow "know" whether a
    "struct tm" is from gmtime() or localtime()).

This latter point means that "format-local:%s" actually works correctly
already, because in that case we rely on the system routines due to
6eced3ec5e (date: use localtime() for "-local" time formats,
2017-06-15). Our problem comes when trying to show times in the author's
zone, as the system routines provide no mechanism for converting in
non-local zones. So in those cases we have a "struct tm" that came from
gmtime(), but has been manipulated according to our offset.

The tests cover the broken round-trip by formatting "%s" for a time in a
non-system timezone. We use the made-up "+1234" here, which has two
advantages. One, we know it won't ever be the real system zone (and so
we're actually testing a case that would break). And two, since it has a
minute component, we're testing the full decoding of the +HHMM zone into
a number of seconds. Likewise, we test the "-1234" variant to make sure
there aren't any sign mistakes.

There's one final test, which covers "format-local:%s". As noted, this
already passes, but it's important to check that we didn't regress this
case. In particular, the caller in show_date() is relying on localtime()
to have done the zone adjustment, independent of any tz_offset we
compute ourselves. These should match up, since our local_tzoffset() is
likewise built around localtime(). But it would be easy for a caller to
forget to pass in a correct tz_offset to strbuf_addftime(). Fortunately
show_date() does this correctly (it has to because of the existing
handling of %z), and the test continues to pass. So this one is just
future-proofing against a change in our assumptions.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-11-04 12:38:09 -07:00
Eric Wong 412e4caee3 tests: disable fsync everywhere
The "GIT_TEST_FSYNC" environment variable now exists for
disabling fsync() even on packfiles and other "critical" data.

Running "make test -j8 NO_SVN_TESTS=1" on a noisy 8-core system
on an HDD, test runtime drops from ~4 minutes down to ~3 minutes.
Using "GIT_TEST_FSYNC=1" re-enables fsync() for comparison
purposes.

SVN interopability tests are minimally affected since SVN will
still use fsync in various places.

This will also be useful for 3rd-party tools which create
throwaway git repositories of temporary data, but remains
undocumented for end users.

Signed-off-by: Eric Wong <e@80x24.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-29 10:22:40 -07:00
Junio C Hamano 2c428e4205 Merge branch 'ab/fix-commit-error-message-upon-unwritable-object-store'
"git commit" gave duplicated error message when the object store
was unwritable, which has been corrected.

* ab/fix-commit-error-message-upon-unwritable-object-store:
  commit: fix duplication regression in permission error output
  unwritable tests: assert exact error output
2021-10-25 16:06:57 -07:00
Junio C Hamano 061a21d36d Merge branch 'ab/fsck-unexpected-type'
"git fsck" has been taught to report mismatch between expected and
actual types of an object better.

* ab/fsck-unexpected-type:
  fsck: report invalid object type-path combinations
  fsck: don't hard die on invalid object types
  object-file.c: stop dying in parse_loose_header()
  object-file.c: return ULHR_TOO_LONG on "header too long"
  object-file.c: use "enum" return type for unpack_loose_header()
  object-file.c: simplify unpack_loose_short_header()
  object-file.c: make parse_loose_header_extended() public
  object-file.c: return -1, not "status" from unpack_loose_header()
  object-file.c: don't set "typep" when returning non-zero
  cat-file tests: test for current --allow-unknown-type behavior
  cat-file tests: add corrupt loose object test
  cat-file tests: test for missing/bogus object with -t, -s and -p
  cat-file tests: move bogus_* variable declarations earlier
  fsck tests: test for garbage appended to a loose object
  fsck tests: test current hash/type mismatch behavior
  fsck tests: refactor one test to use a sub-repo
  fsck tests: add test for fsck-ing an unknown type
2021-10-25 16:06:56 -07:00
Ævar Arnfjörð Bjarmason 4ef91a2d79 commit: fix duplication regression in permission error output
Fix a regression in the error output emitted when .git/objects can't
be written to. Before 9c4d6c0297 (cache-tree: Write updated
cache-tree after commit, 2014-07-13) we'd emit only one "insufficient
permission" error, now we'll do so again.

The cause is rather straightforward, we've got WRITE_TREE_SILENT for
the use-case of wanting to prepare an index silently, quieting any
permission etc. error output. Then when we attempt to update to
that (possibly broken) index we'll run into the same errors again.

But with 9c4d6c0297 the gap between the cache-tree API and the object
store wasn't closed in terms of asking write_object_file() to be
silent. I.e. post-9c4d6c0297b the first call is to prepare_index(),
and after that we'll call prepare_to_commit(). We only want verbose
error output from the latter.

So let's add and use that facility with a corresponding HASH_SILENT
flag, its only user is cache-tree.c's update_one(), which will set it
if its "WRITE_TREE_SILENT" flag is set.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-10-12 11:16:59 -07:00
Junio C Hamano 404c4a5462 Merge branch 'ab/designated-initializers'
Code clean-up.

* ab/designated-initializers:
  cbtree.h: define cb_init() in terms of CBTREE_INIT
  *.h: move some *_INIT to designated initializers
  *.h _INIT macros: don't specify fields equal to 0
  *.[ch] *_INIT macros: use { 0 } for a "zero out" idiom
  submodule-config.h: remove unused SUBMODULE_INIT macro
2021-10-11 10:21:48 -07:00