git/t/t8014-blame-ignore-fuzzy.sh
Michael Platings 1d028dc682 blame: add a fingerprint heuristic to match ignored lines
This algorithm will replace the heuristic used to identify lines from
ignored commits with one that finds likely candidate lines in the
parent's version of the file.  The actual replacement occurs in an
upcoming commit.

The old heuristic simply assigned lines in the target to the same line
number (plus offset) in the parent. The new function uses a
fingerprinting algorithm to detect similarity between lines.

The new heuristic is designed to accurately match changes made
mechanically by formatting tools such as clang-format and clang-tidy.
These tools make changes such as breaking up lines to fit within a
character limit or changing identifiers to fit with a naming convention.
The heuristic is not intended to match more extensive refactoring
changes and may give misleading results in such cases.

In most cases formatting tools preserve line ordering, so the heuristic
is optimised for such cases. (Some types of changes do reorder lines
e.g. sorting keep the line content identical, the git blame -M option
can already be used to address this). The reason that it is advantageous
to rely on ordering is due to source code repeating the same character
sequences often e.g. declaring an identifier on one line and using that
identifier on several subsequent lines.  This means that lines can look
very similar to each other which presents a problem when doing fuzzy
matching. Relying on ordering gives us extra clues to point towards the
true match.

The heuristic operates on a single diff chunk change at a time. It
creates a “fingerprint” for each line on each side of the change.
Fingerprints are described in detail in the comment for `struct
fingerprint`, but essentially are a multiset of the character pairs in a
line. The heuristic first identifies the line in the target entry whose
fingerprint is most clearly matched to a line fingerprint in the parent
entry. Where fingerprints match identically, the position of the lines
is used as a tie-break. The heuristic locks in the best match, and
subtracts the fingerprint of the line in the target entry from the
fingerprint of the line in the parent entry to prevent other lines being
matched on the same parts of that line. It then repeats the process
recursively on the section of the chunk before the match, and then the
section of the chunk after the match.

Here's an example of the difference the fingerprinting makes. Consider
a file with two commits:

        commit-a 1) void func_1(void *x, void *y);
        commit-b 2) void func_2(void *x, void *y);

After a commit 'X', we have:

        commit-X 1) void func_1(void *x,
        commit-X 2)             void *y);
        commit-X 3) void func_2(void *x,
        commit-X 4)             void *y);

When we blame-ignored with the old algorithm, we get:

        commit-a 1) void func_1(void *x,
        commit-b 2)             void *y);
        commit-X 3) void func_2(void *x,
        commit-X 4)             void *y);

Where commit-b is blamed for 2 instead of 3.  With the fingerprint
algorithm, we get:

        commit-a 1) void func_1(void *x,
        commit-a 2)             void *y);
        commit-b 3) void func_2(void *x,
        commit-b 4)             void *y);

Note line 2 could be matched with either commit-a or commit-b as it is
equally similar to both lines, but is matched with commit-a because its
position as a fraction of the new line range is more similar to commit-a
as a fraction of the old line range. Line 4 is also equally similar to
both lines, but as it appears after line 3 which will be matched first
it cannot be matched with an earlier line.

For many more examples, see t/t8014-blame-ignore-fuzzy.sh which contains
example parent and target files and the line numbers in the parent that
must be matched.

Signed-off-by: Michael Platings <michael@platin.gs>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-20 13:38:08 -07:00

441 lines
7.8 KiB
Bash
Executable file

#!/bin/sh
test_description='git blame ignore fuzzy heuristic'
. ./test-lib.sh
# short circuit until blame has the fuzzy capabilities
test_done
pick_author='s/^[0-9a-f^]* *(\([^ ]*\) .*/\1/'
# Each test is composed of 4 variables:
# titleN - the test name
# aN - the initial content
# bN - the final content
# expectedN - the line numbers from aN that we expect git blame
# on bN to identify, or "Final" if bN itself should
# be identified as the origin of that line.
# We start at test 2 because setup will show as test 1
title2="Regression test for partially overlapping search ranges"
cat <<EOF >a2
1
2
3
abcdef
5
6
7
ijkl
9
10
11
pqrs
13
14
15
wxyz
17
18
19
EOF
cat <<EOF >b2
abcde
ijk
pqr
wxy
EOF
cat <<EOF >expected2
4
8
12
16
EOF
title3="Combine 3 lines into 2"
cat <<EOF >a3
if ((maxgrow==0) ||
( single_line_field && (field->dcols < maxgrow)) ||
(!single_line_field && (field->drows < maxgrow)))
EOF
cat <<EOF >b3
if ((maxgrow == 0) || (single_line_field && (field->dcols < maxgrow)) ||
(!single_line_field && (field->drows < maxgrow))) {
EOF
cat <<EOF >expected3
2
3
EOF
title4="Add curly brackets"
cat <<EOF >a4
if (rows) *rows = field->rows;
if (cols) *cols = field->cols;
if (frow) *frow = field->frow;
if (fcol) *fcol = field->fcol;
EOF
cat <<EOF >b4
if (rows) {
*rows = field->rows;
}
if (cols) {
*cols = field->cols;
}
if (frow) {
*frow = field->frow;
}
if (fcol) {
*fcol = field->fcol;
}
EOF
cat <<EOF >expected4
1
1
Final
2
2
Final
3
3
Final
4
4
Final
EOF
title5="Combine many lines and change case"
cat <<EOF >a5
for(row=0,pBuffer=field->buf;
row<height;
row++,pBuffer+=width )
{
if ((len = (int)( After_End_Of_Data( pBuffer, width ) - pBuffer )) > 0)
{
wmove( win, row, 0 );
waddnstr( win, pBuffer, len );
EOF
cat <<EOF >b5
for (Row = 0, PBuffer = field->buf; Row < Height; Row++, PBuffer += Width) {
if ((Len = (int)(afterEndOfData(PBuffer, Width) - PBuffer)) > 0) {
wmove(win, Row, 0);
waddnstr(win, PBuffer, Len);
EOF
cat <<EOF >expected5
1
5
7
8
EOF
title6="Rename and combine lines"
cat <<EOF >a6
bool need_visual_update = ((form != (FORM *)0) &&
(form->status & _POSTED) &&
(form->current==field));
if (need_visual_update)
Synchronize_Buffer(form);
if (single_line_field)
{
growth = field->cols * amount;
if (field->maxgrow)
growth = Minimum(field->maxgrow - field->dcols,growth);
field->dcols += growth;
if (field->dcols == field->maxgrow)
EOF
cat <<EOF >b6
bool NeedVisualUpdate = ((Form != (FORM *)0) && (Form->status & _POSTED) &&
(Form->current == field));
if (NeedVisualUpdate) {
synchronizeBuffer(Form);
}
if (SingleLineField) {
Growth = field->cols * amount;
if (field->maxgrow) {
Growth = Minimum(field->maxgrow - field->dcols, Growth);
}
field->dcols += Growth;
if (field->dcols == field->maxgrow) {
EOF
cat <<EOF >expected6
1
3
4
5
6
Final
7
8
10
11
12
Final
13
14
EOF
# Both lines match identically so position must be used to tie-break.
title7="Same line twice"
cat <<EOF >a7
abc
abc
EOF
cat <<EOF >b7
abcd
abcd
EOF
cat <<EOF >expected7
1
2
EOF
title8="Enforce line order"
cat <<EOF >a8
abcdef
ghijkl
ab
EOF
cat <<EOF >b8
ghijk
abcd
EOF
cat <<EOF >expected8
2
3
EOF
title9="Expand lines and rename variables"
cat <<EOF >a9
int myFunction(int ArgumentOne, Thing *ArgTwo, Blah XuglyBug) {
Squiggle FabulousResult = squargle(ArgumentOne, *ArgTwo,
XuglyBug) + EwwwGlobalWithAReallyLongNameYepTooLong;
return FabulousResult * 42;
}
EOF
cat <<EOF >b9
int myFunction(int argument_one, Thing *arg_asdfgh,
Blah xugly_bug) {
Squiggle fabulous_result = squargle(argument_one,
*arg_asdfgh, xugly_bug)
+ g_ewww_global_with_a_really_long_name_yep_too_long;
return fabulous_result * 42;
}
EOF
cat <<EOF >expected9
1
1
2
3
3
4
5
EOF
title10="Two close matches versus one less close match"
cat <<EOF >a10
abcdef
abcdef
ghijkl
EOF
cat <<EOF >b10
gh
abcdefx
EOF
cat <<EOF >expected10
Final
2
EOF
# The first line of b matches best with the last line of a, but the overall
# match is better if we match it with the the first line of a.
title11="Piggy in the middle"
cat <<EOF >a11
abcdefg
ijklmn
abcdefgh
EOF
cat <<EOF >b11
abcdefghx
ijklm
EOF
cat <<EOF >expected11
1
2
EOF
title12="No trailing newline"
printf "abc\ndef" >a12
printf "abx\nstu" >b12
cat <<EOF >expected12
1
Final
EOF
title13="Reorder includes"
cat <<EOF >a13
#include "c.h"
#include "b.h"
#include "a.h"
#include "e.h"
#include "d.h"
EOF
cat <<EOF >b13
#include "a.h"
#include "b.h"
#include "c.h"
#include "d.h"
#include "e.h"
EOF
cat <<EOF >expected13
3
2
1
5
4
EOF
last_test=13
test_expect_success setup '
{ for i in $(test_seq 2 $last_test)
do
# Append each line in a separate commit to make it easy to
# check which original line the blame output relates to.
line_count=0 &&
{ while IFS= read line
do
line_count=$((line_count+1)) &&
echo "$line" >>"$i" &&
git add "$i" &&
test_tick &&
GIT_AUTHOR_NAME="$line_count" git commit -m "$line_count"
done } <"a$i"
done } &&
{ for i in $(test_seq 2 $last_test)
do
# Overwrite the files with the final content.
cp b$i $i &&
git add $i
done } &&
test_tick &&
# Commit the final content all at once so it can all be
# referred to with the same commit ID.
GIT_AUTHOR_NAME=Final git commit -m Final &&
IGNOREME=$(git rev-parse HEAD)
'
for i in $(test_seq 2 $last_test); do
eval title="\$title$i"
test_expect_success "$title" \
"git blame -M9 --ignore-rev $IGNOREME $i >output &&
sed -e \"$pick_author\" output >actual &&
test_cmp expected$i actual"
done
# This invoked a null pointer dereference when the chunk callback was called
# with a zero length parent chunk and there were no more suspects.
test_expect_success 'Diff chunks with no suspects' '
test_write_lines xy1 A B C xy1 >file &&
git add file &&
test_tick &&
GIT_AUTHOR_NAME=1 git commit -m 1 &&
test_write_lines xy2 A B xy2 C xy2 >file &&
git add file &&
test_tick &&
GIT_AUTHOR_NAME=2 git commit -m 2 &&
REV_2=$(git rev-parse HEAD) &&
test_write_lines xy3 A >file &&
git add file &&
test_tick &&
GIT_AUTHOR_NAME=3 git commit -m 3 &&
REV_3=$(git rev-parse HEAD) &&
test_write_lines 1 1 >expected &&
git blame --ignore-rev $REV_2 --ignore-rev $REV_3 file >output &&
sed -e "$pick_author" output >actual &&
test_cmp expected actual
'
test_expect_success 'position matching' '
test_write_lines abc def >file2 &&
git add file2 &&
test_tick &&
GIT_AUTHOR_NAME=1 git commit -m 1 &&
test_write_lines abc def abc def >file2 &&
git add file2 &&
test_tick &&
GIT_AUTHOR_NAME=2 git commit -m 2 &&
test_write_lines abcx defx abcx defx >file2 &&
git add file2 &&
test_tick &&
GIT_AUTHOR_NAME=3 git commit -m 3 &&
REV_3=$(git rev-parse HEAD) &&
test_write_lines abcy defy abcx defx >file2 &&
git add file2 &&
test_tick &&
GIT_AUTHOR_NAME=4 git commit -m 4 &&
REV_4=$(git rev-parse HEAD) &&
test_write_lines 1 1 2 2 >expected &&
git blame --ignore-rev $REV_3 --ignore-rev $REV_4 file2 >output &&
sed -e "$pick_author" output >actual &&
test_cmp expected actual
'
# This fails if each blame entry is processed independently instead of
# processing each diff change in full.
test_expect_success 'preserve order' '
test_write_lines bcde >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=1 git commit -m 1 &&
test_write_lines bcde fghij >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=2 git commit -m 2 &&
test_write_lines bcde fghij abcd >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=3 git commit -m 3 &&
test_write_lines abcdx fghijx bcdex >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=4 git commit -m 4 &&
REV_4=$(git rev-parse HEAD) &&
test_write_lines abcdx fghijy bcdex >file3 &&
git add file3 &&
test_tick &&
GIT_AUTHOR_NAME=5 git commit -m 5 &&
REV_5=$(git rev-parse HEAD) &&
test_write_lines 1 2 3 >expected &&
git blame --ignore-rev $REV_4 --ignore-rev $REV_5 file3 >output &&
sed -e "$pick_author" output >actual &&
test_cmp expected actual
'
test_done