development/git - HydraGit

mirror of https://github.com/git/git synced 2024-11-05 18:59:29 +00:00

Author	SHA1	Message	Date
Junio C Hamano	afe8a9070b	tree-wide: apply equals-null.cocci Signed-off-by: Junio C Hamano <gitster@pobox.com>	2022-05-02 09:50:37 -07:00
brian m. carlson	14228447c9	hash: provide per-algorithm null OIDs Up until recently, object IDs did not have an algorithm member, only a hash. Consequently, it was possible to share one null (all-zeros) object ID among all hash algorithms. Now that we're going to be handling objects from multiple hash algorithms, it's important to make sure that all object IDs have a correct algorithm field. Introduce a per-algorithm null OID, and add it to struct hash_algo. Introduce a wrapper function as well, and use it everywhere we used to use the null_oid constant. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-04-27 16:31:39 +09:00
René Scharfe	ca56dadb4b	use CALLOC_ARRAY Add and apply a semantic patch for converting code that open-codes CALLOC_ARRAY to use it instead. It shortens the code and infers the element size automatically. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2021-03-13 16:00:09 -08:00
Junio C Hamano	bf0a430f70	Merge branch 'en/strmap' A specialization of hashmap that uses a string as key has been introduced. Hopefully it will see wider use over time. * en/strmap: shortlog: use strset from strmap.h Use new HASHMAP_INIT macro to simplify hashmap initialization strmap: take advantage of FLEXPTR_ALLOC_STR when relevant strmap: enable allocations to come from a mem_pool strmap: add a strset sub-type strmap: split create_entry() out of strmap_put() strmap: add functions facilitating use as a string->int map strmap: enable faster clearing and reusing of strmaps strmap: add more utility functions strmap: new utility functions hashmap: provide deallocation function names hashmap: introduce a new hashmap_partial_clear() hashmap: allow re-use after hashmap_free() hashmap: adjust spacing to fix argument alignment hashmap: add usage documentation explaining hashmap_free[_entries]()	2020-11-21 15:14:38 -08:00
Elijah Newren	6da1a25814	hashmap: provide deallocation function names hashmap_free(), hashmap_free_entries(), and hashmap_free_() have existed for a while, but aren't necessarily the clearest names, especially with hashmap_partial_clear() being added to the mix and lazy-initialization now being supported. Peff suggested we adopt the following names[1]: - hashmap_clear() - remove all entries and de-allocate any hashmap-specific data, but be ready for reuse - hashmap_clear_and_free() - ditto, but free the entries themselves - hashmap_partial_clear() - remove all entries but don't deallocate table - hashmap_partial_clear_and_free() - ditto, but free the entries This patch provides the new names and converts all existing callers over to the new naming scheme. [1] https://lore.kernel.org/git/20201030125059.GA3277724@coredump.intra.peff.net/ Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-11-02 12:15:50 -08:00
Philippe Blain	3af31e8786	blame: simplify 'setup_blame_bloom_data' interface The penultimate commit moved the initialization of 'sb.path' in 'builtin/blame.c::cmd_blame' before the call to 'blame.c::setup_blame_bloom_data'. Since 'cmd_blame' is the only caller of 'setup_blame_bloom_data', it is now unnecessary for 'setup_blame_bloom_data' to receive 'path' as a separate argument, as 'sb.path' is already initialized. Remove this argument from setup_blame_bloom_data's interface and use the 'path' field of the 'sb' 'struct blame_scoreboard' instead. Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-11-01 15:54:15 -08:00
Philippe Blain	88894aaeea	blame: simplify 'setup_scoreboard' interface The previous commit moved the initialization of 'sb.path' in 'builtin/blame.c::cmd_blame' before the call to 'blame.c::setup_scoreboard'. Since 'cmd_blame' is the only caller of 'setup_scoreboard', it is now unnecessary for 'setup_scoreboard' to receive 'path' as a separate argument, as 'sb.path' is already initialized. Remove this argument from setup_scoreboard's interface and use the 'path' field of the 'sb' 'struct blame_scoreboard' instead. Signed-off-by: Philippe Blain <levraiphilippeblain@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-11-01 15:54:15 -08:00
René Scharfe	db7d07f610	blame: handle deref_tag() returning NULL Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-10-12 12:25:14 -07:00
Junio C Hamano	288ed98bf7	Merge branch 'tb/bloom-improvements' "git commit-graph write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters option. * tb/bloom-improvements: commit-graph: introduce 'commitGraph.maxNewFilters' builtin/commit-graph.c: introduce '--max-new-filters=<n>' commit-graph: rename 'split_commit_graph_opts' bloom: encode out-of-bounds filters as non-empty bloom/diff: properly short-circuit on max_changes bloom: use provided 'struct bloom_filter_settings' bloom: split 'get_bloom_filter()' in two commit-graph.c: store maximum changed paths commit-graph: respect 'commitGraph.readChangedPaths' t/helper/test-read-graph.c: prepare repo settings commit-graph: pass a 'struct repository *' in more places t4216: use an '&&'-chain commit-graph: introduce 'get_bloom_filter_settings()'	2020-09-29 14:01:20 -07:00
Junio C Hamano	e1dd499513	Merge branch 'ea/blame-use-oideq' Code cleanup. * ea/blame-use-oideq: blame.c: replace instance of !oidcmp for oideq	2020-09-18 17:58:01 -07:00
Taylor Blau	312cff5207	bloom: split 'get_bloom_filter()' in two 'get_bloom_filter' takes a flag to control whether it will compute a Bloom filter if the requested one is missing. In the next patch, we'll add yet another parameter to this method, which would force all but one caller to specify an extra 'NULL' parameter at the end. Instead of doing this, split 'get_bloom_filter' into two functions: 'get_bloom_filter' and 'get_or_compute_bloom_filter'. The former only looks up a Bloom filter (and does not compute one if it's missing, thus dropping the 'compute_if_not_present' flag). The latter does compute missing Bloom filters, with an additional parameter to store whether or not it needed to do so. This simplifies many call-sites, since the majority of existing callers to 'get_bloom_filter' do not want missing Bloom filters to be computed (so they can drop the parameter entirely and use the simpler version of the function). While we're at it, instrument the new 'get_or_compute_bloom_filter()' with counters in the 'write_commit_graph_context' struct which store the number of filters that we did and didn't compute, as well as filters that were truncated. It would be nice to drop the 'compute_if_not_present' flag entirely, since all remaining callers of 'get_or_compute_bloom_filter' pass it as '1', but this will change in a future patch and hence cannot be removed. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-09-17 09:31:25 -07:00
Taylor Blau	4f3644056a	commit-graph: introduce 'get_bloom_filter_settings()' Many places in the code often need a pointer to the commit-graph's 'struct bloom_filter_settings', in which case they often take the value from the top-most commit-graph. In the non-split case, this works as expected. In the split case, however, things get a little tricky. Not all layers in a chain of incremental commit-graphs are required to themselves have Bloom data, and so whether or not some part of the code uses Bloom filters depends entirely on whether or not the top-most level of the commit-graph chain has Bloom filters. This has been the behavior since Bloom filters were introduced, and has been codified into the tests since `a759bfa9ee` (t4216: add end to end tests for git log with Bloom filters, 2020-04-06). In fact, t4216.130 requires that Bloom filters are not used in exactly the case described earlier. There is no reason that this needs to be the case, since it is perfectly valid for commits in an earlier layer to have Bloom filters when commits in a newer layer do not. Since Bloom settings are guaranteed in practice to be the same for any layer in a chain that has Bloom data, it is sufficient to traverse the '->base_graph' pointer until either (1) a non-null 'struct bloom_filter_settings ' is found, or (2) until we are at the root of the commit-graph chain. Introduce a 'get_bloom_filter_settings()' function that does just this, and use it instead of purely dereferencing the top-most graph's '->bloom_filter_settings' pointer. While we're at it, add an additional test in t5324 to guard against code in the commit-graph writing machinery that doesn't correctly handle a NULL 'struct bloom_filter '. Co-authored-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-09-09 12:51:48 -07:00
Edmundo Carmona Antoranz	1302badd16	blame.c: replace instance of !oidcmp for oideq `0906ac2b` (blame: use changed-path Bloom filters, 2020-04-16) introduced a call to oidcmp() that should have been oideq(), which was introduced in `14438c44` (introduce hasheq() and oideq(), 2018-08-28). Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-09-08 15:57:26 -07:00
Jeff King	c2ebaa27d6	blame: only coalesce lines that are adjacent in result After blame has finished but before we produce any output, we coalesce groups of lines that were adjacent in the original suspect (which may have been split apart by lines in intermediate commits which went away). However, this can cause incorrect output if the lines are not also adjacent in the result. For instance, the case in t8003 has: ABC DEF which becomes ABC SPLIT DEF Blaming only lines 1 and 3 in the result yields two blame groups (one for each line) that were adjacent in the original. That's enough for us to coalesce them into a single group, but that loses information: our output routines assume they're adjacent in the result as well, and we output: <oid> 1) ABC <oid> 2) SPLIT This is nonsense for two reasons: - we were asked about line 3, not line 2; we should not output the SPLIT line at all - commit <oid> did not touch the SPLIT line at all! We found the correct blame for line 3, but the bug is actually in the output stage, which is showing the wrong line number and content from the final file. We can fix this by only coalescing when both the suspect and result lines are adjacent. That fixes this bug, but keeps coalescing in cases where want it (e.g., the existing test in t8003 where SPLIT goes away, and the lines really are adjacent in the result). Reported-by: Nuthan Munaiah <nm6061@rit.edu> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-08-13 10:09:38 -07:00
Abhishek Kumar	c49c82aa4c	commit: move members graph_pos, generation to a slab We remove members `graph_pos` and `generation` from the struct commit. The default assignments in init_commit_node() are no longer valid, which is fine as the slab helpers return appropriate default values and the assignments are removed. We will replace existing use of commit->generation and commit->graph_pos by commit_graph_data_slab helpers using `contrib/coccinelle/commit.cocci'. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-06-17 14:37:30 -07:00
Jeff King	fe88f9f91f	blame: drop unused parameter from maybe_changed_path We don't use the "parent" parameter at all (probably because the bloom filter for a commit is always defined against a single parent anyway). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-04-23 14:37:03 -07:00
Derrick Stolee	0906ac2b54	blame: use changed-path Bloom filters The changed-path Bloom filters help reduce the amount of tree parsing required during history queries. Before calculating a diff, we can ask the filter if a path changed between a commit and its first parent. If the filter says "no" then we can move on without parsing trees. If the filter says "maybe" then we parse trees to discover if the answer is actually "yes" or "no". When computing a blame, there is a section in find_origin() that computes a diff between a commit and one of its parents. When this is the first parent, we can check the Bloom filters before calling diff_tree_oid(). In order to make this work with the blame machinery, we need to initialize a struct bloom_key with the initial path. But also, we need to add more keys to a list if a rename is detected. We then check to see if _any_ of these keys answer "maybe" in the diff. During development, I purposefully left out this "add a new key when a rename is detected" to see if the test suite would catch my error. That is how I discovered the issues with GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS from the previous change. With that change, we can feel some confidence in the coverage of this change. If a user requests copy detection using "git blame -C", then there are more places where the set of "important" files can expand. I do not know enough about how this happens in the blame machinery. Thus, the Bloom filter integration is explicitly disabled in this mode. A later change could expand the bloom_key data with an appropriate call (or calls) to add_bloom_key(). If we did not disable this mode, then the following tests would fail: t8003-blame-corner-cases.sh t8011-blame-split-file.sh Generally, this is a performance enhancement and should not change the behavior of 'git blame' in any way. If a repo has a commit-graph file with computed changed-path Bloom filters, then they should notice improved performance for their 'git blame' commands. Here are some example timings that I found by blaming some paths in the Linux kernel repository: git blame arch/x86/kernel/topology.c >/dev/null Before: 0.83s After: 0.24s git blame kernel/time/time.c >/dev/null Before: 0.72s After: 0.24s git blame tools/perf/ui/stdio/hist.c >/dev/null Before: 0.27s After: 0.11s I specifically looked for "deep" paths that were also edited many times. As a counterpoint, the MAINTAINERS file was edited many times but is located in the root tree. This means that the cost of computing a diff relative to the pathspec is very small. Here are the timings for that command: git blame MAINTAINERS >/dev/null Before: 20.1s After: 18.0s These timings are the best of five. The worst-case runs were on the order of 2.5 minutes for both cases. Note that the MAINTAINERS file has 18,740 lines across 17,000+ commits. This happens to be one of the cases where this change provides the least improvement. The lack of improvement for the MAINTAINERS file and the relatively modest improvement for the other examples can be easily explained. The blame machinery needs to compute line-level diffs to determine which lines were changed by each commit. That makes up a large proportion of the computation time, and this change does not attempt to improve on that section of the algorithm. The MAINTAINERS file is large and changed often, so it takes time to determine which lines were updated by which commit. In contrast, the code files are much smaller, and it takes longer to comute the line-by-line diff for a single patch on the Linux mailing lists. Outside of the "-C" integration, I believe there is little more to gain from the changed-path Bloom filters for 'git blame' after this patch. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2020-04-16 15:38:06 -07:00
Junio C Hamano	5efabc7ed9	Merge branch 'ew/hashmap' Code clean-up of the hashmap API, both users and implementation. * ew/hashmap: hashmap_entry: remove first member requirement from docs hashmap: remove type arg from hashmap_{get,put,remove}_entry OFFSETOF_VAR macro to simplify hashmap iterators hashmap: introduce hashmap_free_entries hashmap: hashmap_{put,remove} return hashmap_entry * hashmap: use _entry APIs for iteration hashmap_cmp_fn takes hashmap_entry params hashmap_get{,_from_hash} return "struct hashmap_entry " hashmap: use _entry APIs to wrap container_of hashmap_get_next returns "struct hashmap_entry " introduce container_of macro hashmap_put takes "struct hashmap_entry " hashmap_remove takes "const struct hashmap_entry " hashmap_get takes "const struct hashmap_entry " hashmap_add takes "struct hashmap_entry " hashmap_get_next takes "const struct hashmap_entry " hashmap_entry_init takes "struct hashmap_entry " packfile: use hashmap_entry in delta_base_cache_entry coccicheck: detect hashmap_entry.hash assignment diff: use hashmap_entry_init on moved_entry.ent	2019-10-15 13:48:02 +09:00
Eric Wong	404ab78e39	hashmap: remove type arg from hashmap_{get,put,remove}_entry Since these macros already take a `keyvar' pointer of a known type, we can rely on OFFSETOF_VAR to get the correct offset without relying on non-portable `__typeof__' and `offsetof'. Argument order is also rearranged, so `keyvar' and `member' are sequential as they are used as: `keyvar->member' Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:12 +09:00
Eric Wong	23dee69f53	OFFSETOF_VAR macro to simplify hashmap iterators While we cannot rely on a `__typeof__' operator being portable to use with `offsetof'; we can calculate the pointer offset using an existing pointer and the address of a member using pointer arithmetic for compilers without `__typeof__'. This allows us to simplify usage of hashmap iterator macros by not having to specify a type when a pointer of that type is already given. In the future, list iterator macros (e.g. list_for_each_entry) may also be implemented using OFFSETOF_VAR to save hackers the trouble of using container_of/list_entry macros and without relying on non-portable `__typeof__'. v3: use `__typeof__' to avoid clang warnings Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:11 +09:00
Eric Wong	c8e424c9c9	hashmap: introduce hashmap_free_entries `hashmap_free_entries' behaves like `container_of' and passes the offset of the hashmap_entry struct to the internal `hashmap_free_' function, allowing the function to free any struct pointer regardless of where the hashmap_entry field is located. `hashmap_free' no longer takes any arguments aside from the hashmap itself. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:11 +09:00
Eric Wong	87571c3f71	hashmap: use *_entry APIs for iteration Inspired by list_for_each_entry in the Linux kernel. Once again, these are somewhat compromised usability-wise by compilers lacking __typeof__ support. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:11 +09:00
Eric Wong	f23a465132	hashmap_get{,_from_hash} return "struct hashmap_entry *" Update callers to use hashmap_get_entry, hashmap_get_entry_from_hash or container_of as appropriate. This is another step towards eliminating the requirement of hashmap_entry being the first field in a struct. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:11 +09:00
Eric Wong	28ee794128	hashmap_remove takes "const struct hashmap_entry " This is less error-prone than "const void " as the compiler now detects invalid types being passed. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:10 +09:00
Eric Wong	b6c5241606	hashmap_get takes "const struct hashmap_entry " This is less error-prone than "const void " as the compiler now detects invalid types being passed. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:10 +09:00
Eric Wong	b94e5c1df6	hashmap_add takes "struct hashmap_entry " This is less error-prone than "void " as the compiler now detects invalid types being passed. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:10 +09:00
Eric Wong	d22245a2e3	hashmap_entry_init takes "struct hashmap_entry " C compilers do type checking to make life easier for us. So rely on that and update all hashmap_entry_init callers to take "struct hashmap_entry " to avoid future bugs while improving safety and readability. Signed-off-by: Eric Wong <e@80x24.org> Reviewed-by: Derrick Stolee <stolee@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-10-07 10:20:09 +09:00
brian m. carlson	fee49308a1	blame: remove needless comparison with GIT_SHA1_HEXSZ When faking a working tree commit, we read in lines from MERGE_HEAD into a strbuf. Because the strbuf is NUL-terminated and get_oid_hex will fail if it unexpectedly encounters a NUL, the check for the length of the line is unnecessary. There is no optimization benefit from this case, either, since on failure we call die. Remove this check, since it is no longer needed. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-08-19 15:04:57 -07:00
Junio C Hamano	1eb0a12ec3	Merge branch 'nd/tree-walk-with-repo' The tree-walk API learned to pass an in-core repository instance throughout more codepaths. * nd/tree-walk-with-repo: t7814: do not generate same commits in different repos Use the right 'struct repository' instead of the_repository match-trees.c: remove the_repo from shift_tree*() tree-walk.c: remove the_repo from get_tree_entry_follow_symlinks() tree-walk.c: remove the_repo from get_tree_entry() tree-walk.c: remove the_repo from fill_tree_descriptor() sha1-file.c: remove the_repo from read_object_with_reference()	2019-07-19 11:30:21 -07:00
Junio C Hamano	209f075593	Merge branch 'br/blame-ignore' "git blame" learned to "ignore" commits in the history, whose effects (as well as their presence) get ignored. * br/blame-ignore: t8014: remove unnecessary braces blame: drop some unused function parameters blame: add a test to cover blame_coalesce() blame: use the fingerprint heuristic to match ignored lines blame: add a fingerprint heuristic to match ignored lines blame: optionally track line fingerprints during fill_blame_origin() blame: add config options for the output of ignored or unblamable lines blame: add the ability to ignore commits and their changes blame: use a helper function in blame_chunk() Move oidset_parse_file() to oidset.c fsck: rename and touch up init_skiplist()	2019-07-19 11:30:20 -07:00
Jeff King	07a54dc307	blame: drop some unused function parameters These unused parameters were introduced recently as part of the br/blame-ignore topic. I assume they are not indicative of bugs, but are just leftovers from the development process (they were introduced by the series but not used in any of its iterations). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-28 10:13:56 -07:00
Nguyễn Thái Ngọc Duy	50ddb089ff	tree-walk.c: remove the_repo from get_tree_entry() Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-27 12:45:17 -07:00
Barret Rhoden	a07a97760c	blame: use the fingerprint heuristic to match ignored lines This commit integrates the fuzzy fingerprint heuristic into guess_line_blames(). We actually make two passes. The first pass uses the fuzzy algorithm to find a match within the current diff chunk. If that fails, the second pass searches the entire parent file for the best match. For an example of scanning the entire parent for a match, consider: commit-a 30) #include <sys/header_a.h> commit-b 31) #include <header_b.h> commit-c 32) #include <header_c.h> Then commit X alphabetizes them: commit-X 30) #include <header_b.h> commit-X 31) #include <header_c.h> commit-X 32) #include <sys/header_a.h> If we just check the parent's chunk (i.e. the first pass), we'd get: commit-b 30) #include <header_b.h> commit-c 31) #include <header_c.h> commit-X 32) #include <sys/header_a.h> That's because commit X actually consists of two chunks: one chunk is removing sys/header_a.h, then some context, and the second chunk is adding sys/header_a.h. If we scan the entire parent file, we get: commit-b 30) #include <header_b.h> commit-c 31) #include <header_c.h> commit-a 32) #include <sys/header_a.h> Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-20 13:38:09 -07:00
Michael Platings	1d028dc682	blame: add a fingerprint heuristic to match ignored lines This algorithm will replace the heuristic used to identify lines from ignored commits with one that finds likely candidate lines in the parent's version of the file. The actual replacement occurs in an upcoming commit. The old heuristic simply assigned lines in the target to the same line number (plus offset) in the parent. The new function uses a fingerprinting algorithm to detect similarity between lines. The new heuristic is designed to accurately match changes made mechanically by formatting tools such as clang-format and clang-tidy. These tools make changes such as breaking up lines to fit within a character limit or changing identifiers to fit with a naming convention. The heuristic is not intended to match more extensive refactoring changes and may give misleading results in such cases. In most cases formatting tools preserve line ordering, so the heuristic is optimised for such cases. (Some types of changes do reorder lines e.g. sorting keep the line content identical, the git blame -M option can already be used to address this). The reason that it is advantageous to rely on ordering is due to source code repeating the same character sequences often e.g. declaring an identifier on one line and using that identifier on several subsequent lines. This means that lines can look very similar to each other which presents a problem when doing fuzzy matching. Relying on ordering gives us extra clues to point towards the true match. The heuristic operates on a single diff chunk change at a time. It creates a “fingerprint” for each line on each side of the change. Fingerprints are described in detail in the comment for `struct fingerprint`, but essentially are a multiset of the character pairs in a line. The heuristic first identifies the line in the target entry whose fingerprint is most clearly matched to a line fingerprint in the parent entry. Where fingerprints match identically, the position of the lines is used as a tie-break. The heuristic locks in the best match, and subtracts the fingerprint of the line in the target entry from the fingerprint of the line in the parent entry to prevent other lines being matched on the same parts of that line. It then repeats the process recursively on the section of the chunk before the match, and then the section of the chunk after the match. Here's an example of the difference the fingerprinting makes. Consider a file with two commits: commit-a 1) void func_1(void x, void y); commit-b 2) void func_2(void x, void y); After a commit 'X', we have: commit-X 1) void func_1(void x, commit-X 2) void y); commit-X 3) void func_2(void x, commit-X 4) void y); When we blame-ignored with the old algorithm, we get: commit-a 1) void func_1(void x, commit-b 2) void y); commit-X 3) void func_2(void x, commit-X 4) void y); Where commit-b is blamed for 2 instead of 3. With the fingerprint algorithm, we get: commit-a 1) void func_1(void x, commit-a 2) void y); commit-b 3) void func_2(void x, commit-b 4) void y); Note line 2 could be matched with either commit-a or commit-b as it is equally similar to both lines, but is matched with commit-a because its position as a fraction of the new line range is more similar to commit-a as a fraction of the old line range. Line 4 is also equally similar to both lines, but as it appears after line 3 which will be matched first it cannot be matched with an earlier line. For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains example parent and target files and the line numbers in the parent that must be matched. Signed-off-by: Michael Platings <michael@platin.gs> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-06-20 13:38:08 -07:00
Barret Rhoden	1fc73384ba	blame: optionally track line fingerprints during fill_blame_origin() fill_blame_origin() is a convenient place to store data that we will use throughout the lifetime of a blame_origin. Some heuristics for ignoring commits during a blame session can make use of this storage. In particular, we will calculate a fingerprint for each line of a file for blame_origins involved in an ignored commit. In this commit, we only calculate the line_starts, reusing the existing code from the scoreboard's line_starts. In an upcoming commit, we will actually compute the fingerprints. This feature will be used when we attempt to pass blame entries to parents when we "ignore" a commit. Most uses of fill_blame_origin() will not require this feature, hence the flag parameter. Multiple calls to fill_blame_origin() are idempotent, and any of them can request the creation of the fingerprints structure. Suggested-by: Michael Platings <michael@platin.gs> Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-05-16 11:36:23 +09:00
Barret Rhoden	8934ac8c92	blame: add config options for the output of ignored or unblamable lines When ignoring commits, the commit that is blamed might not be responsible for the change, due to the inaccuracy of our heuristic. Users might want to know when a particular line has a potentially inaccurate blame. Furthermore, guess_line_blames() may fail to find any parent commit for a given line touched by an ignored commit. Those 'unblamable' lines remain blamed on an ignored commit. Users might want to know if a line is unblamable so that they do not spend time investigating a commit they know is uninteresting. This patch adds two config options to mark these two types of lines in the output of blame. The first option can identify ignored lines by specifying blame.markIgnoredLines. When this option is set, each blame line that was blamed on a commit other than the ignored commit is marked with a '?'. For example: 278b6158d6fdb (Barret Rhoden 2016-04-11 13:57:54 -0400 26) appears as: ?278b6158d6fd (Barret Rhoden 2016-04-11 13:57:54 -0400 26) where the '?' is placed before the commit, and the hash has one fewer characters. Sometimes we are unable to even guess at what ancestor commit touched a line. These lines are 'unblamable.' The second option, blame.markUnblamableLines, will mark the line with ''. For example, say we ignore e5e8d36d04cbe, yet we are unable to blame this line on another commit: e5e8d36d04cbe (Barret Rhoden 2016-04-11 13:57:54 -0400 26) appears as: e5e8d36d04cb (Barret Rhoden 2016-04-11 13:57:54 -0400 26) When these config options are used together, every line touched by an ignored commit will be marked with either a '?' or a '*'. Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-05-16 11:36:23 +09:00
Barret Rhoden	ae3f36dea1	blame: add the ability to ignore commits and their changes Commits that make formatting changes or function renames are often not interesting when blaming a file. A user may deem such a commit as 'not interesting' and want to ignore and its changes it when assigning blame. For example, say a file has the following git history / rev-list: ---O---A---X---B---C---D---Y---E---F Commits X and Y both touch a particular line, and the other commits do not: X: "Take a third parameter" -MyFunc(1, 2); +MyFunc(1, 2, 3); Y: "Remove camelcase" -MyFunc(1, 2, 3); +my_func(1, 2, 3); git-blame will blame Y for the change. I'd like to be able to ignore Y: both the existence of the commit as well as any changes it made. This differs from -S rev-list, which specifies the list of commits to process for the blame. We would still process Y, but just don't let the blame 'stick.' This patch adds the ability for users to ignore a revision with --ignore-rev=rev, which may be repeated. They can specify a set of files of full object names of revs, e.g. SHA-1 hashes, one per line. A single file may be specified with the blame.ignoreRevFile config option or with --ignore-rev-file=file. Both the config option and the command line option may be repeated multiple times. An empty file name "" will clear the list of revs from previously processed files. Config options are processed before command line options. For a typical use case, projects will maintain the file containing revisions for commits that perform mass reformatting, and their users have the option to ignore all of the commits in that file. Additionally, a user can use the --ignore-rev option for one-off investigation. To go back to the example above, X was a substantive change to the function, but not the change the user is interested in. The user inspected X, but wanted to find the previous change to that line - perhaps a commit that introduced that function call. To make this work, we can't simply remove all ignored commits from the rev-list. We need to diff the changes introduced by Y so that we can ignore them. We let the blames get passed to Y, just like when processing normally. When Y is the target, we make sure that Y does not keep any blames. Any changes that Y is responsible for get passed to its parent. Note we make one pass through all of the scapegoats (parents) to attempt to pass blame normally; we don't know if we need to ignore the commit until we've checked all of the parents. The blame_entry will get passed up the tree until we find a commit that has a diff chunk that affects those lines. One issue is that the ignored commit did make some change, and there is no general solution to finding the line in the parent commit that corresponds to a given line in the ignored commit. That makes it hard to attribute a particular line within an ignored commit's diff correctly. For example, the parent of an ignored commit has this, say at line 11: commit-a 11) #include "a.h" commit-b 12) #include "b.h" Commit X, which we will ignore, swaps these lines: commit-X 11) #include "b.h" commit-X 12) #include "a.h" We can pass that blame entry to the parent, but line 11 will be attributed to commit A, even though "include b.h" came from commit B. The blame mechanism will be looking at the parent's view of the file at line number 11. ignore_blame_entry() is set up to allow alternative algorithms for guessing per-line blames. Any line that is not attributed to the parent will continue to be blamed on the ignored commit as if that commit was not ignored. Upcoming patches have the ability to detect these lines and mark them in the blame output. The existing algorithm is simple: blame each line on the corresponding line in the parent's diff chunk. Any lines beyond that stay with the target. For example, the parent of an ignored commit has this, say at line 11: commit-a 11) void new_func_1(void x, void y); commit-b 12) void new_func_2(void x, void y); commit-c 13) some_line_c commit-d 14) some_line_d After a commit 'X', we have: commit-X 11) void new_func_1(void x, commit-X 12) void y); commit-X 13) void new_func_2(void x, commit-X 14) void y); commit-c 15) some_line_c commit-d 16) some_line_d Commit X nets two additionally lines: 13 and 14. The current guess_line_blames() algorithm will not attribute these to the parent, whose diff chunk is only two lines - not four. When we ignore with the current algorithm, we get: commit-a 11) void new_func_1(void x, commit-b 12) void y); commit-X 13) void new_func_2(void x, commit-X 14) void y); commit-c 15) some_line_c commit-d 16) some_line_d Note that line 12 was blamed on B, though B was the commit for new_func_2(), not new_func_1(). Even when guess_line_blames() finds a line in the parent, it may still be incorrect. Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-05-16 11:36:23 +09:00
Barret Rhoden	55f808fbc5	blame: use a helper function in blame_chunk() The same code for splitting a blame_entry at a particular line was used twice in blame_chunk(), and I'll use the helper again in an upcoming patch. Signed-off-by: Barret Rhoden <brho@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-05-16 11:36:23 +09:00
Junio C Hamano	96379f043f	Merge branch 'en/merge-directory-renames' "git merge-recursive" backend recently learned a new heuristics to infer file movement based on how other files in the same directory moved. As this is inherently less robust heuristics than the one based on the content similarity of the file itself (rather than based on what its neighbours are doing), it sometimes gives an outcome unexpected by the end users. This has been toned down to leave the renamed paths in higher/conflicted stages in the index so that the user can examine and confirm the result. * en/merge-directory-renames: merge-recursive: switch directory rename detection default merge-recursive: give callers of handle_content_merge() access to contents merge-recursive: track information associated with directory renames t6043: fix copied test description to match its purpose merge-recursive: switch from (oid,mode) pairs to a diff_filespec merge-recursive: cleanup handle_rename_* function signatures merge-recursive: track branch where rename occurred in rename struct merge-recursive: remove ren[12]_other fields from rename_conflict_info merge-recursive: shrink rename_conflict_info merge-recursive: move some struct declarations together merge-recursive: use 'ci' for rename_conflict_info variable name merge-recursive: rename locals 'o' and 'a' to 'obuf' and 'abuf' merge-recursive: rename diff_filespec 'one' to 'o' merge-recursive: rename merge_options argument from 'o' to 'opt' Use 'unsigned short' for mode, like diff_filespec does	2019-05-09 00:37:22 +09:00
Junio C Hamano	4d8c4da950	Merge branch 'dk/blame-keep-origin-blob' Performance fix around "git blame", especially in a linear history (which is the norm we should optimize for). * dk/blame-keep-origin-blob: blame.c: don't drop origin blobs as eagerly	2019-04-25 16:41:17 +09:00
Elijah Newren	5ec1e72823	Use 'unsigned short' for mode, like diff_filespec does struct diff_filespec defines mode to be an 'unsigned short'. Several other places in the API which we'd like to interact with using a diff_filespec used a plain unsigned (or unsigned int). This caused problems when taking addresses, so switch to unsigned short. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-04-08 16:02:07 +09:00
David Kastrup	f892014943	blame.c: don't drop origin blobs as eagerly When a parent blob already has chunks queued up for blaming, dropping the blob at the end of one blame step will cause it to get reloaded right away, doubling the amount of I/O and unpacking when processing a linear history. Keeping such parent blobs in memory seems like a reasonable optimization that should incur additional memory pressure mostly when processing the merges from old branches. Signed-off-by: David Kastrup <dak@gnu.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-04-03 16:45:26 +09:00
Junio C Hamano	4e021dc28e	Merge branch 'wh/author-committer-ident-config' Four new configuration variables {author,committer}.{name,email} have been introduced to override user.{name,email} in more specific cases. * wh/author-committer-ident-config: config: allow giving separate author and committer idents	2019-03-07 09:59:53 +09:00
William Hubbs	39ab4d0951	config: allow giving separate author and committer idents The author.email, author.name, committer.email and committer.name settings are analogous to the GIT_AUTHOR_* and GIT_COMMITTER_* environment variables, but for the git config system. This allows them to be set separately for each repository. Git supports setting different authorship and committer information with environment variables. However, environment variables are set in the shell, so if different authorship and committer information is needed for different repositories an external tool is required. This adds support to git config for author.email, author.name, committer.email and committer.name settings so this information can be set per repository. Also, it generalizes the fmt_ident function so it can handle author vs committer identification. Signed-off-by: William Hubbs <williamh@gentoo.org> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-02-04 12:18:13 -08:00
Nguyễn Thái Ngọc Duy	e1ff0a32e4	read-cache.c: kill read_index() read_index() shares the same problem as hold_locked_index(): it assumes $GIT_DIR/index. Move all call sites to repo_read_index() instead. read_index_preload() and read_index_unmerged() are also killed as a consequence. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2019-01-14 12:13:04 -08:00
Nguyễn Thái Ngọc Duy	fb998eae6c	blame.c: remove implicit dependency the_repository Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-11-12 14:50:05 +09:00
Junio C Hamano	11877b9ebe	Merge branch 'nd/the-index' Various codepaths in the core-ish part learn to work on an arbitrary in-core index structure, not necessarily the default instance "the_index". * nd/the-index: (23 commits) revision.c: reduce implicit dependency the_repository revision.c: remove implicit dependency on the_index ws.c: remove implicit dependency on the_index tree-diff.c: remove implicit dependency on the_index submodule.c: remove implicit dependency on the_index line-range.c: remove implicit dependency on the_index userdiff.c: remove implicit dependency on the_index rerere.c: remove implicit dependency on the_index sha1-file.c: remove implicit dependency on the_index patch-ids.c: remove implicit dependency on the_index merge.c: remove implicit dependency on the_index merge-blobs.c: remove implicit dependency on the_index ll-merge.c: remove implicit dependency on the_index diff-lib.c: remove implicit dependency on the_index read-cache.c: remove implicit dependency on the_index diff.c: remove implicit dependency on the_index grep.c: remove implicit dependency on the_index diff.c: remove the_index dependency in textconv() functions blame.c: rename "repo" argument to "r" combine-diff.c: remove implicit dependency on the_index ...	2018-10-19 13:34:02 +09:00
Nguyễn Thái Ngọc Duy	e675765235	diff.c: remove implicit dependency on the_index A new variant repo_diff_setup() is added that takes 'struct repository *' and diff_setup() becomes a thin macro around it that is protected by NO_THE_REPOSITORY_COMPATIBILITY_MACROS, similar to NO_THE_INDEX_.... The plan is these macros will always be defined for all library files and the macros are only accessible in builtin/ Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-09-21 09:48:10 -07:00
Nguyễn Thái Ngọc Duy	6afaf80785	diff.c: remove the_index dependency in textconv() functions Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-09-21 09:48:10 -07:00
Nguyễn Thái Ngọc Duy	a470beea39	blame.c: rename "repo" argument to "r" The current naming convention for 'struct repository *' is 'r' for temporary variables or arguments. I did not notice this. Since we're updating blame.c again in the next patch, let's fix this. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2018-09-21 09:48:10 -07:00

1 2 3

148 commits