Reduce memory usage during "diff --quiet" in a worktree with too
many stat-unmatched paths.
* jk/diff-memuse-optim-with-stat-unmatch:
diff: discard blob data from stat-unmatched pairs
When performing a tree-level diff against the working tree, we may find
that our index stat information is dirty, so we queue a filepair to be
examined later. If the actual content hasn't changed, we call this a
stat-unmatch; the stat information was out of date, but there's no
actual diff. Normally diffcore_std() would detect and remove these
identical filepairs via diffcore_skip_stat_unmatch(). However, when
"--quiet" is used, we want to stop the diff as soon as we see any
changes, so we check for stat-unmatches immediately in diff_change().
That check may require us to actually load the file contents into the
pair of diff_filespecs. If we find that the pair isn't a stat-unmatch,
then no big deal; we'd likely load the contents later anyway to generate
a patch, do rename detection, etc, so we want to hold on to it. But if
it is a stat-unmatch, then we have no more use for that data; the whole
point is that we're going discard the pair. However, we never free the
allocated diff_filespec data.
In most cases, keeping that data isn't a problem. We don't expect a lot
of stat-unmatch entries, and since we're using --quiet, we'd quit as
soon as we saw such a real change anyway. However, there are extreme
cases where it makes a big difference:
1. We'd generally mmap() the working tree half of the pair. And since
the OS may limit the total number of maps, we can run afoul of this
in large repositories. E.g.:
$ cd linux
$ git ls-files | wc -l
67959
$ sysctl vm.max_map_count
vm.max_map_count = 65530
$ git ls-files | xargs touch ;# everything is stat-dirty!
$ git diff --quiet
fatal: mmap failed: Cannot allocate memory
It should be unusual to have so many files stat-dirty, but it's
possible if you've just run a script like "sed -i" or similar.
After this patch, the above correctly exits with code 0.
2. Even if you don't hit mmap limits, the index half of the pair will
have been pulled from the object database into heap memory. Again
in a clone of linux.git, running:
$ git ls-files | head -n 10000 | xargs touch
$ git diff --quiet
peaks at 145MB heap before this patch, and 94MB after.
This patch solves the problem by freeing any diff_filespec data we
picked up during the "--quiet" stat-unmatch check in diff_changes.
Nobody is going to need that data later, so there's no point holding on
to it. There are a few things to note:
- we could skip queueing the pair entirely, which could in theory save
a little work. But there's not much to save, as we need a
diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway.
And since we cache the result of the stat-unmatch checks, a later
call to diffcore_skip_stat_unmatch() call will quickly skip over
them. The diffcore code also counts up the number of stat-unmatched
pairs as it removes them. It's doubtful any callers would care about
that in combination with --quiet, but we'd have to reimplement the
logic here to be on the safe side. So it's not really worth the
trouble.
- I didn't write a test, because we always produce the correct output
unless we run up against system mmap limits, which are both
unportable and expensive to test against. Measuring peak heap
would be interesting, but our perf suite isn't yet capable of that.
- note that diff without "--quiet" does not suffer from the same
problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch
entries and drop them immediately, so we're not carrying their data
around.
- you _can_ still trigger the mmap limit problem if you truly have
that many files with actual changes. But it's rather unlikely. The
stat-unmatch check avoids loading the file contents if the sizes
don't match, so you'd need a pretty trivial change in every single
file. Likewise, inexact rename detection might load the data for
many files all at once. But you'd need not just 64k changes, but
that many deletions and additions. The most likely candidate is
perhaps break-detection, which would load the data for all pairs and
keep it around for the content-level diff. But again, you'd need 64k
actually changed files in the first place.
So it's still possible to trigger this case, but it seems like "I
accidentally made all my files stat-dirty" is the most likely case
in the real world.
Reported-by: Jan Christoph Uhde <Jan@UhdeJc.com>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `diff.relative` boolean option set to `true` shows only changes in
the current directory/value specified by the `path` argument of the
`relative` option and shows pathnames relative to the aforementioned
directory.
Teach `--no-relative` to override earlier `--relative`
Add for git-format-patch(1) options documentation `--relative` and
`--no-relative`
Signed-off-by: Laurent Arnoud <laurent@spkdev.net>
Acked-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git diff" in a partial clone learned to avoid lazy loading blob
objects in more casese when they are not needed.
* jt/avoid-prefetch-when-able-in-diff:
diff: restrict when prefetching occurs
diff: refactor object read
diff: make diff_populate_filespec_options struct
promisor-remote: accept 0 as oid_nr in function
Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
optimized "diff" by prefetching blobs in a partial clone, but there are
some cases wherein blobs do not need to be prefetched. In these cases,
any command that uses the diff machinery will unnecessarily fetch blobs.
diffcore_std() may read blobs when it calls the following functions:
(1) diffcore_skip_stat_unmatch() (controlled by the config variable
diff.autorefreshindex)
(2) diffcore_break() and diffcore_merge_broken() (for break-rewrite
detection)
(3) diffcore_rename() (for rename detection)
(4) diffcore_pickaxe() (for detecting addition/deletion of specified
string)
Instead of always prefetching blobs, teach diffcore_skip_stat_unmatch(),
diffcore_break(), and diffcore_rename() to prefetch blobs upon the first
read of a missing object. This covers (1), (2), and (3): to cover the
rest, teach diffcore_std() to prefetch if the output type is one that
includes blob data (and hence blob data will be required later anyway),
or if it knows that (4) will be run.
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Refactor the object reads in diff_populate_filespec() to have the first
object read not be in an if/else branch, because in a future patch, a
retry will be added to that first object read.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The behavior of diff_populate_filespec() currently can be customized
through a bitflag, but a subsequent patch requires it to support a
non-boolean option. Replace the bitflag with an options struct.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are 3 callers to promisor_remote_get_direct() that first check if
the number of objects to be fetched is equal to 0. Fold that check into
promisor_remote_get_direct(), and in doing so, be explicit as to what
promisor_remote_get_direct() does if oid_nr is 0 (it returns 0, success,
immediately).
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Now that we have the codebase wired up to pass any additional metadata
to filters, let's collect the additional metadata that we'd like to
pass.
The two main places we pass this metadata are checkouts and archives.
In these two situations, reading HEAD isn't a valid option, since HEAD
isn't updated for checkouts until after the working tree is written and
archives can accept an arbitrary tree. In other situations, HEAD will
usually reflect the refname of the branch in current use.
We pass a smaller amount of data in other cases, such as git cat-file,
where we can really only logically know about the blob.
This commit updates only the parts of the checkout code where we don't
use unpack_trees. That function and callers of it will be handled in a
future commit.
In the archive code, we leak a small amount of memory, since nothing we
pass in the archiver argument structure is freed.
Signed-off-by: brian m. carlson <bk2204@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are a variety of situations where a filter process can make use of
some additional metadata. For example, some people find the ident
filter too limiting and would like to include the commit or the branch
in their smudged files. This information isn't available during
checkout as HEAD hasn't been updated at that point, and it wouldn't be
available in archives either.
Let's add a way to pass this metadata down to the filter. We pass the
blob we're operating on, the treeish (preferring the commit over the
tree if one exists), and the ref we're operating on. Note that we won't
pass this information in all cases, such as when renormalizing or when
we're performing diffs, since it doesn't make sense in those cases.
The data we currently get from the filter process looks like the
following:
command=smudge
pathname=git.c
0000
With this change, we'll get data more like this:
command=smudge
pathname=git.c
refname=refs/tags/v2.25.1
treeish=c522f061d551c9bb8684a7c3859b2ece4499b56b
blob=7be7ad34bd053884ec48923706e70c81719a8660
0000
There are a couple things to note about this approach. For operations
like checkout, treeish will always be a commit, since we cannot check
out individual trees, but for other operations, like archive, we can end
up operating on only a particular tree, so we'll provide only a tree as
the treeish. Similar comments apply for refname, since there are a
variety of cases in which we won't have a ref.
This commit wires up the code to print this information, but doesn't
pass any of it at this point. In a future commit, we'll have various
code paths pass the actual useful data down.
Signed-off-by: brian m. carlson <bk2204@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Some codepaths were given a repository instance as a parameter to
work in the repository, but passed the_repository instance to its
callees, which has been cleaned up (somewhat).
* mt/use-passed-repo-more-in-funcs:
sha1-file: allow check_object_signature() to handle any repo
sha1-file: pass git_hash_algo to hash_object_file()
sha1-file: pass git_hash_algo to write_object_file_prepare()
streaming: allow open_istream() to handle any repo
pack-check: use given repo's hash_algo at verify_packfile()
cache-tree: use given repo's hash_algo at verify_one()
diff: make diff_populate_filespec() honor its repo argument
We parse diff.wsErrorHighlight in git_diff_ui_config(), meaning that it
doesn't take effect for plumbing commands, only for porcelains like
git-diff itself. This is mildly annoying as it means scripts like
add--interactive, which produce a user-visible diff with color, don't
respect the option.
We could teach that script to parse the config and pass it along as
--ws-error-highlight to the diff plumbing. But there's a simpler
solution.
It should be reasonably safe for plumbing to respect this option, as it
only kicks in when color is otherwise enabled. And anybody parsing
colorized output must already deal with the fact that color.diff.* may
change the exact output they see; those options have been part of
git_diff_basic_config() since its inception in 9a1805a872 (add a "basic"
diff config callback, 2008-01-04).
So we can just move it to the "basic" config, which fixes
add--interactive, along with any other script in the same boat, with a
very low risk of hurting any plumbing users.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
diff_populate_filespec() takes a struct repository argument but it
doesn't get passed down to read_object_file().
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The beginning of rewriting "git add -i" in C.
* js/builtin-add-i:
built-in add -i: implement the `help` command
built-in add -i: use color in the main loop
built-in add -i: support `?` (prompt help)
built-in add -i: show unique prefixes of the commands
built-in add -i: implement the main loop
built-in add -i: color the header in the `status` command
built-in add -i: implement the `status` command
diff: export diffstat interface
Start to implement a built-in version of `git add --interactive`
Make the diffstat interface (namely, the diffstat_t struct and
compute_diffstat) no longer be internal to diff.c and allow it to be used
by other parts of git.
This is helpful for code that may want to easily extract information
from files using the diff machinery, while flushing it differently from
how the show_* functions used by diff_flush() do it. One example is the
builtin implementation of git-add--interactive's status.
Signed-off-by: Daniel Ferreira <bnmvco@gmail.com>
Signed-off-by: Slavica Đukić <slawica92@hotmail.com>
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Preparation for SHA-256 upgrade continues.
* bc/object-id-part17: (26 commits)
midx: switch to using the_hash_algo
builtin/show-index: replace sha1_to_hex
rerere: replace sha1_to_hex
builtin/receive-pack: replace sha1_to_hex
builtin/index-pack: replace sha1_to_hex
packfile: replace sha1_to_hex
wt-status: convert struct wt_status to object_id
cache: remove null_sha1
builtin/worktree: switch null_sha1 to null_oid
builtin/repack: write object IDs of the proper length
pack-write: use hash_to_hex when writing checksums
sequencer: convert to use the_hash_algo
bisect: switch to using the_hash_algo
sha1-lookup: switch hard-coded constants to the_hash_algo
config: use the_hash_algo in abbrev comparison
combine-diff: replace GIT_SHA1_HEXSZ with the_hash_algo
bundle: switch to use the_hash_algo
connected: switch GIT_SHA1_HEXSZ to the_hash_algo
show-index: switch hard-coded constants to the_hash_algo
blame: remove needless comparison with GIT_SHA1_HEXSZ
...
Since these macros already take a `keyvar' pointer of a known type,
we can rely on OFFSETOF_VAR to get the correct offset without
relying on non-portable `__typeof__' and `offsetof'.
Argument order is also rearranged, so `keyvar' and `member' are
sequential as they are used as: `keyvar->member'
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
While we cannot rely on a `__typeof__' operator being portable
to use with `offsetof'; we can calculate the pointer offset
using an existing pointer and the address of a member using
pointer arithmetic for compilers without `__typeof__'.
This allows us to simplify usage of hashmap iterator macros
by not having to specify a type when a pointer of that type
is already given.
In the future, list iterator macros (e.g. list_for_each_entry)
may also be implemented using OFFSETOF_VAR to save hackers the
trouble of using container_of/list_entry macros and without
relying on non-portable `__typeof__'.
v3: use `__typeof__' to avoid clang warnings
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
`hashmap_free_entries' behaves like `container_of' and passes
the offset of the hashmap_entry struct to the internal
`hashmap_free_' function, allowing the function to free any
struct pointer regardless of where the hashmap_entry field
is located.
`hashmap_free' no longer takes any arguments aside from
the hashmap itself.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Another step in eliminating the requirement of hashmap_entry
being the first member of a struct.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Using `container_of' can be verbose and choosing names for
intermediate "struct hashmap_entry" pointers is a hard problem.
So introduce "*_entry" APIs inspired by similar linked-list
APIs in the Linux kernel.
Unfortunately, `__typeof__' is not portable C, so we need an
extra parameter to specify the type.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is a step towards removing the requirement for
hashmap_entry being the first field of a struct.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is less error-prone than "const void *" as the compiler
now detects invalid types being passed.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is less error-prone than "void *" as the compiler now
detects invalid types being passed.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This is less error-prone than "const void *" as the compiler
now detects invalid types being passed.
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Otherwise, the hashmap_entry.next field appears to remain
uninitialized, which can lead to problems when
add_lines_to_move_detection calls hashmap_add.
I found this through manual inspection when converting
hashmap_add callers to take "struct hashmap_entry *".
Signed-off-by: Eric Wong <e@80x24.org>
Reviewed-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Teach the lazy clone machinery that there can be more than one
promisor remote and consult them in order when downloading missing
objects on demand.
* cc/multi-promisor:
Move core_partial_clone_filter_default to promisor-remote.c
Move repository_format_partial_clone to promisor-remote.c
Remove fetch-object.{c,h} in favor of promisor-remote.{c,h}
remote: add promisor and partial clone config to the doc
partial-clone: add multiple remotes in the doc
t0410: test fetching from many promisor remotes
builtin/fetch: remove unique promisor remote limitation
promisor-remote: parse remote.*.partialclonefilter
Use promisor_remote_get_direct() and has_promisor_remote()
promisor-remote: use repository_format_partial_clone
promisor-remote: add promisor_remote_reinit()
promisor-remote: implement promisor_remote_get_direct()
Add initial support for many promisor remotes
fetch-object: make functions return an error code
t0410: remove pipes after git commands
On-demand object fetching in lazy clone incorrectly tried to fetch
commits from submodule projects, while still working in the
superproject, which has been corrected.
* jt/diff-lazy-fetch-submodule-fix:
diff: skip GITLINK when lazy fetching missing objs
In 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08),
diff was taught to batch the fetching of missing objects when operating
on a partial clone, but was not taught to refrain from fetching
GITLINKs. Teach diff to check if an object is a GITLINK before including
it in the set to be fetched.
(As stated in the commit message of that commit, unpack-trees was also
taught a similar thing prior, but unpack-trees correctly checks for
GITLINK before including objects in the set to be fetched.)
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Convert the two separate patch-id implementations to use the_hash_algo
in their implementation.
Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The line count in the outer diff's hunk headers of a range diff is not
all that interesting. It merely shows how far along the inner diff
are on both sides. That number is of no use for human readers, and
range-diffs are not meant to be machine readable.
In a subsequent commit we're going to add some more contextual
information such as the filename corresponding to the diff to the hunk
headers. Remove the unnecessary information, and just keep the "@@"
to indicate that a new hunk of the outer diff is starting.
Signed-off-by: Thomas Gummerer <t.gummerer@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When running an external diff from, say, a diff tool, it is safe to
assume that we want to write the files in question. On Windows, that
means that there cannot be any other process holding an open handle to
said files, or even just a mapped region.
So let's make sure that `git diff` itself is not holding any open handle
to the files in question.
In fact, we will just release the file pair right away, as the external
diff uses the files we just wrote, so we do not need to hold the file
contents in memory anymore.
This fixes https://github.com/git-for-windows/git/issues/1315
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Instead of using the repository_format_partial_clone global
and fetch_objects() directly, let's use has_promisor_remote()
and promisor_remote_get_direct().
This way all the configured promisor remotes will be taken
into account, not only the one specified by
extensions.partialClone.
Also when cloning or fetching using a partial clone filter,
remote.origin.promisor will be set to "true" instead of
setting extensions.partialClone to "origin". This makes it
possible to use many promisor remote just by fetching from
them.
Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "--base" option of "format-patch" computed the patch-ids for
prerequisite patches in an unstable way, which has been updated to
compute in a way that is compatible with "git patch-id --stable".
* sb/format-patch-base-patch-id-fix:
format-patch: make --base patch-id output stable
format-patch: inform user that patch-id generation is unstable
Fix two typos introduced by the following commits:
+ 31fba9d3b4 (diff-parseopt: convert --[src|dst]-prefix, 2019-03-24)
+ ed8b4132c8 (remote-curl: mark all error messages for translation,
2019-03-05)
Signed-off-by: Jiang Xin <worldhello.net@gmail.com>
Acked-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A brown-paper-bag bugfix to a change already in 'master'.
* nd/diff-parseopt:
parse-options: check empty value in OPT_INTEGER and OPT_ABBREV
diff-parseopt: restore -U (no argument) behavior
diff-parseopt: correct variable types that are used by parseopt
Before d473e2e0e8 (diff.c: convert -U|--unified, 2019-01-27), -U and
--unified are implemented with a custom parser opt_arg() in diff.c. I
didn't check this code carefully and not realize that it's the
equivalent of PARSE_OPT_NONEG | PARSE_OPT_OPTARG.
In other words, if -U is specified without any argument, the option
should be accepted, and the default value should be used. Without
PARSE_OPT_OPTARG, parse_options() will reject this case and cause a
regression.
Reported-by: Bryan Turner <bturner@atlassian.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We weren't flushing the context each time we processed a hunk in the
patch-id generation code in diff.c, but we were doing that when we
generated "stable" patch-ids with the 'patch-id' tool. Let's port that
similar logic over from patch-id.c into diff.c so we can get the same
hash when we're generating patch-ids for 'format-patch --base=' types of
command invocations.
Cc: Xiaolong Ye <xiaolong.ye@intel.com>
Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
While running "git diff" in a lazy clone, we can upfront know which
missing blobs we will need, instead of waiting for the on-demand
machinery to discover them one by one. Aim to achieve better
performance by batching the request for these promised blobs.
* jt/batch-fetch-blobs-in-diff:
diff: batch fetching of missing blobs
sha1-file: support OBJECT_INFO_FOR_PREFETCH
When running a command like "git show" or "git diff" in a partial clone,
batch all missing blobs to be fetched as one request.
This is similar to c0c578b33c ("unpack-trees: batch fetching of missing
blobs", 2017-12-08), but for another command.
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
diff_opt_parse() is a heavy hammer to just set diff filter. But it's
the only way because of the diff_status_letters[] mapping. Add a new
API to set diff filter and use it in git-am. diff_opt_parse()'s only
remaining call site in revision.c will be gone soon and having it here
just because of git-am does not make sense.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This option is added in commit b73bcbac4a (diff: allow
--no-color-moved-ws - 2018-11-23) in pw/diff-color-moved-ws-fix. To ease
merge conflict resolution, re-implement the option handling here so that
the conflict could be resolved by taking this side of change.
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>