Commit graph

26 commits

Author SHA1 Message Date
René Scharfe 8072766cc6 xdiff: import new 32-bit version of count_masked_bytes()
Import the latest 32-bit implementation of count_masked_bytes() from
Linux (arch/x86/include/asm/word-at-a-time.h).  It's shorter and avoids
overflows and negative numbers.

This fixes test failures on 32-bit, where negative partial results had
been shifted right using the "wrong" method (logical shift right instead
of arithmetic short right).  The compiler is free to chose the method,
so it was only wrong in the sense that it didn't work as intended by us.

Reported-by: Øyvind A. Holm <sunny@sunbase.org>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-23 09:10:17 -07:00
René Scharfe 7e356a9794 xdiff: avoid more compiler warnings with XDL_FAST_HASH on 32-bit machines
Hide literals that can cause compiler warnings for 32-bit architectures in
expressions that evaluate to small numbers there.  Some compilers warn that
0x0001020304050608 won't fit into a 32-bit long, others that shifting right
by 56 bits clears a 32-bit value completely.

The correct values are calculated in the 64-bit case, which is all that matters
in this if-branch.

Reported-by: Øyvind A. Holm <sunny@sunbase.org>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Acked-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-23 09:10:03 -07:00
René Scharfe 9322ce21ee xdiff: avoid compiler warnings with XDL_FAST_HASH on 32-bit machines
Import macro REPEAT_BYTE from Linux (arch/x86/include/asm/word-at-a-time.h)
to avoid 64-bit integer literals, which cause some 32-bit compilers to
print warnings.

Reported-by: Øyvind A. Holm <sunny@sunbase.org>
Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-22 14:39:49 -07:00
Thomas Rast 6f1af028ce xdiff: choose XDL_FAST_HASH code on sizeof(long) instead of __WORDSIZE
Darwin does not define __WORDSIZE, and compiles the 32-bit code path
on 64-bit systems, resulting in a totally broken git.

I could not find an alternative -- other than the platform symbols
(__x86_64__ etc.) -- that does the test in the preprocessor.  However,
we can also just test for the size of a 'long', which is what really
matters here.  Any compiler worth its salt will leave only the branch
relevant for its platform, and indeed on Linux/GCC the numbers don't
change:

 Test                                  tr/darwin-xdl-fast-hash   origin/next              origin/master
 ------------------------------------------------------------------------------------------------------------------
 4000.1: log -3000 (baseline)          0.09(0.07+0.01)           0.09(0.07+0.01) -5.5%*   0.09(0.07+0.01) -4.1%
 4000.2: log --raw -3000 (tree-only)   0.47(0.41+0.05)           0.47(0.40+0.05) -0.5%    0.45(0.38+0.06) -3.5%.
 4000.3: log -p -3000 (Myers)          1.81(1.67+0.12)           1.81(1.67+0.13) +0.3%    1.99(1.84+0.12) +10.2%***
 4000.4: log -p -3000 --histogram      1.79(1.66+0.11)           1.80(1.67+0.11) +0.4%    1.96(1.82+0.10) +9.2%***
 4000.5: log -p -3000 --patience       2.17(2.02+0.13)           2.20(2.04+0.13) +1.3%.   2.33(2.18+0.13) +7.4%***
 ------------------------------------------------------------------------------------------------------------------
 Significance hints:  '.' 0.1  '*' 0.05  '**' 0.01  '***' 0.001

Noticed-by: Brian Gernhardt <brian@gernhardtsoftware.com>
Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-05-01 12:19:06 -07:00
Thomas Rast 6942efcfa9 xdiff: load full words in the inner loop of xdl_hash_record
Redo the hashing loop in xdl_hash_record in a way that loads an entire
'long' at a time, using masking tricks to see when and where we found
the terminating '\n'.

I stole inspiration and code from the posts by Linus Torvalds around

  https://lkml.org/lkml/2012/3/2/452
  https://lkml.org/lkml/2012/3/5/6

His method reads the buffers in sizeof(long) increments, and may thus
overrun it by at most sizeof(long)-1 bytes before it sees the final
newline (or hits the buffer length check).  I considered padding out
all buffers by a suitable amount to "catch" the overrun, but

* this does not work for mmap()'d buffers: if you map 4096+8 bytes
  from a 4096 byte file, accessing the last 8 bytes results in a
  SIGBUS on my machine; and

* it would also be extremely ugly because it intrudes deep into the
  unpacking machinery.

So I adapted it to not read beyond the buffer at all.  Instead, it
reads the final partial word byte-by-byte and strings it together.
Then it can use the same logic as before to finish the hashing.

So far we enable this only on x86_64, where it provides nice speedup
for diff-related work:

  Test                                  origin/next      tr/xdiff-fast-hash
  -----------------------------------------------------------------------------
  4000.1: log -3000 (baseline)          0.07(0.05+0.02)  0.08(0.06+0.02) +14.3%
  4000.2: log --raw -3000 (tree-only)   0.37(0.33+0.04)  0.37(0.32+0.04) +0.0%
  4000.3: log -p -3000 (Myers)          1.75(1.65+0.09)  1.60(1.49+0.10) -8.6%
  4000.4: log -p -3000 --histogram      1.73(1.62+0.09)  1.58(1.49+0.08) -8.7%
  4000.5: log -p -3000 --patience       2.11(2.00+0.10)  1.94(1.80+0.11) -8.1%

Perhaps other platforms could also benefit.  However it does NOT work
on big-endian systems!

[jc: minimum style and compilation fixes]

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-04-09 17:03:25 -07:00
Tay Ray Chuan 739864b1ff xdiff: do away with xdl_mmfile_next()
Given our simple mmfile structure, xdl_mmfile_next() calls are
redundant. Do away with calls to them.

Signed-off-by: Tay Ray Chuan <rctay89@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-08-03 10:15:16 -07:00
Tay Ray Chuan 86abba8015 xdiff/xprepare: use a smaller sample size for histogram diff
For histogram diff, we can afford a smaller sample size and thus a
poorer estimate of the number of lines, as the hash table (rhash) won't
be filled up/grown. This is safe as the final count of lines (xdf.nrecs)
will be updated correctly anyway by xdl_prepare_ctx().

This gives us a small boost in performance.

Signed-off-by: Tay Ray Chuan <rctay89@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-07-12 09:30:00 -07:00
Tay Ray Chuan 1d26b252f1 xdiff/xpatience: factor out fall-back-diff function
This is in preparation for the histogram diff algorithm, which will also
re-use much of the code to call the default Meyers diff algorithm.

Signed-off-by: Tay Ray Chuan <rctay89@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-07-07 09:41:24 -07:00
Jonathan Nieder 349362cc20 xdiff: cast arguments for ctype functions to unsigned char
The ctype functions isspace(), isalnum(), et al take an integer
argument representing an unsigned character, or -1 for EOF.  On
platforms with a signed char, it is unsafe to pass a char to them
without casting it to unsigned char first.

Most of git is already shielded against this by the ctype
implementation in git-compat-util.h, but xdiff, which uses libc
ctype.h, ought to be fixed.

Noticed-by: der Mouse <mouse@Rodents-Montreal.ORG>
Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-10-06 10:46:45 -07:00
Dylan Reid b4cf0f1784 xdiff: optimise for no whitespace difference when ignoring whitespace.
In xdl_recmatch, do the memcmp to check if the two lines are equal before
checking if whitespace flags are set.  If the lines are identical, then
there is no need to check if they differ only in whitespace.
This makes the common case (there is no whitespace difference) faster.
It costs the case where lines are the same length and contain
whitespace differences, but the common case is more than 20% faster.

Signed-off-by: Dylan Reid <dgreid@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-07-05 23:27:41 -07:00
Junio C Hamano 3b5ef0e216 xutils: Fix xdl_recmatch() on incomplete lines
Thell Fowler noticed that various "ignore whitespace" options to git diff
do not work well on an incomplete line.

The loop control of the function responsible for these bugs was extremely
difficult to follow.  This patch restructures the loops for three variants
of "ignore whitespace" logic.

The basic idea of the re-written logic is:

 - A loop runs while the characters from both strings we are looking at
   match.  We declare unmatch immediately when we find something that does
   not match and return false from the function.  We break out of the loop
   if we ran out of either side of the string.

   The way we skip spaces inside this loop varies depending on the style
   of ignoring whitespaces.

 - After the above loop breaks, we know that the parts of the strings we
   inspected so far match, ignoring the whitespaces.  The lines can match
   only if the remainder consists of nothing but whitespaces.  This part
   of the logic is shared across all three styles.

The new code is more obvious and should be much easier to follow.

Tested-by: Thell Fowler <git@tbfowler.name>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-23 14:38:43 -07:00
Junio C Hamano 78ed710fcf xutils: Fix hashing an incomplete line with whitespaces at the end
Upon seeing a whitespace, xdl_hash_record_with_whitespace() first skipped
the run of whitespaces (excluding LF) that begins there, ensuring that the
pointer points at the last whitespace character in the run, and assumed
that the next character must be LF at the end of the line.  This does not
work when hashing an incomplete line, which lacks the LF at the end.

Introduce "at_eol" variable that is true when either we are at the end of
line (looking at LF) or at the end of an incomplete line, and use that
instead throughout the code.

Noticed by Thell Fowler.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-08-23 13:17:59 -07:00
Keith Cascio 6d12acefd5 Fix combined use of whitespace ignore options to diff
The code used to misbehave when options to ignore certain whitespaces
(-w -b and --ignore-at-eol) were combined.

Signed-off-by: Keith Cascio <keith@cs.ucla.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-01-19 21:21:58 -08:00
Guido Ostkamp a777e9ca54 Remove unreachable statements
Solaris Workshop Compiler found a few unreachable statements.

Signed-off-by: Guido Ostkamp <git@ostkamp.fastmail.fm>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-15 21:23:47 -08:00
Junio C Hamano a6080a0a44 War on whitespace
This uses "git-apply --whitespace=strip" to fix whitespace errors that have
crept in to our source files over time.  There are a few files that need
to have trailing whitespaces (most notably, test vectors).  The results
still passes the test, and build result in Documentation/ area is unchanged.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-06-07 00:04:01 -07:00
Johannes Schindelin 824f782c3f xdiff/xutils.c(xdl_hash_record): factor out whitespace handling
Since in at least one use case, xdl_hash_record() takes over 15% of the
CPU time, it makes sense to even micro-optimize it. For many cases, no
whitespace special handling is needed, and in these cases we should not
even bother to check for whitespace in _every_ iteration of the loop.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-19 22:17:25 -07:00
Johannes Schindelin 859f9c4581 teach diff machinery about --ignore-space-at-eol
`git diff --ignore-space-at-eol` will ignore whitespace at the
line ends.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-13 21:40:42 -08:00
Johannes Schindelin c7c24889bb diff -b: ignore whitespace at end of line
This is _not_ the same as "treat eol as whitespace", since that would mean
that multiple empty lines would be treated as equal to e.g. a space.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-04 16:30:54 -08:00
Johannes Schindelin 2344d47fba diff: fix 2 whitespace issues
When whitespace or whitespace change was ignored, the function
xdl_recmatch() returned memcmp() style differences, which is wrong,
since it should return 0 on non-match.

Also, there were three horrible off-by-one bugs, even leading to wrong
hashes in the whitespace special handling.

The issue was noticed by Ray Lehtiniemi.

For good measure, this commit adds a test.

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-10-12 09:30:14 -07:00
Johannes Schindelin 0d21efa51c Teach diff about -b and -w flags
This adds -b (--ignore-space-change) and -w (--ignore-all-space) flags to
diff. The main part of the patch is teaching libxdiff about it.

[jc: renamed xdl_line_match() to xdl_recmatch() since the former is used
 for different purposes in xpatchi.c which is in the parts of the upstream
 source we do not use.]

Signed-off-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-06-23 17:35:27 -07:00
Junio C Hamano d281786fcd xdiff: minor changes to match libxdiff-0.21
This reformats the change 621c53cc08
introduced to match what upstream author implemented in libxdiff-0.21
without changing any logic (hopefully ;-).  This is to help keep
us in sync with the upstream.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-06-19 18:43:49 -07:00
Davide Libenzi ca557afff9 Clean-up trivially redundant diff.
Also corrects the line numbers in unified output when using
zero lines context.
2006-04-04 00:11:09 -07:00
Mark Wooding acb7257729 xdiff: Show function names in hunk headers.
The speed of the built-in diff generator is nice; but the function names
shown by `diff -p' are /really/ nice.  And I hate having to choose.  So,
we hack xdiff to find the function names and print them.

xdiff has grown a flag to say whether to dig up the function names.  The
builtin_diff function passes this flag unconditionally.  I suppose it
could parse GIT_DIFF_OPTS, but it doesn't at the moment.  I've also
reintroduced the `function name' into the test suite, from which it was
removed in commit 3ce8f089.

The function names are parsed by a particularly stupid algorithm at the
moment: it just tries to find a line in the `old' file, from before the
start of the hunk, whose first character looks plausible.  Still, it's
most definitely a start.

Signed-off-by: Mark Wooding <mdw@distorted.org.uk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-27 18:43:51 -08:00
Junio C Hamano 3ce8f08944 built-in diff: minimum tweaks
This fixes up a couple of minor issues with the real built-in
diff to be more usable:

 - Omit ---/+++ header unless we emit diff output;

 - Detect and punt binary diff like GNU does;

 - Honor GIT_DIFF_OPTS minimally (only -u<number> and
   --unified=<number> are currently supported);

 - Omit line count of 1 from "@@ -l,k +m,n @@" hunk header
   (i.e. when k == 1 or n == 1)

 - Adjust testsuite for the lack of -p support.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 16:50:00 -08:00
Linus Torvalds 621c53cc08 builtin-diff: \No newline at end of file.
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 16:49:59 -08:00
Linus Torvalds 3443546f6e Use a *real* built-in diff generator
This uses a simplified libxdiff setup to generate unified diffs _without_
doing  fork/execve of GNU "diff".

This has several huge advantages, for example:

Before:

	[torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null

	real    0m24.818s
	user    0m13.332s
	sys     0m8.664s

After:

	[torvalds@g5 linux]$ time git diff v2.6.16.. > /dev/null

	real    0m4.563s
	user    0m2.944s
	sys     0m1.580s

and the fact that this should be a lot more portable (ie we can ignore all
the issues with doing fork/execve under Windows).

Perhaps even more importantly, this allows us to do diffs without actually
ever writing out the git file contents to a temporary file (and without
any of the shell quoting issues on filenames etc etc).

NOTE! THIS PATCH DOES NOT DO THAT OPTIMIZATION YET! I was lazy, and the
current "diff-core" code actually will always write the temp-files,
because it used to be something that you simply had to do. So this current
one actually writes a temp-file like before, and then reads it into memory
again just to do the diff. Stupid.

But if this basic infrastructure is accepted, we can start switching over
diff-core to not write temp-files, which should speed things up even
further, especially when doing big tree-to-tree diffs.

Now, in the interest of full disclosure, I should also point out a few
downsides:

 - the libxdiff algorithm is different, and I bet GNU diff has gotten a
   lot more testing. And the thing is, generating a diff is not an exact
   science - you can get two different diffs (and you will), and they can
   both be perfectly valid. So it's not possible to "validate" the
   libxdiff output by just comparing it against GNU diff.

 - GNU diff does some nice eye-candy, like trying to figure out what the
   last function was, and adding that information to the "@@ .." line.
   libxdiff doesn't do that.

 - The libxdiff thing has some known deficiencies. In particular, it gets
   the "\No newline at end of file" case wrong. So this is currently for
   the experimental branch only. I hope Davide will help fix it.

That said, I think the huge performance advantage, and the fact that it
integrates better is definitely worth it. But it should go into a
development branch at least due to the missing newline issue.

Technical note: this is based on libxdiff-0.17, but I did some surgery to
get rid of the extraneous fat - stuff that git doesn't need, and seriously
cutting down on mmfile_t, which had much more capabilities than the diff
algorithm either needed or used. In this version, "mmfile_t" is just a
trivial <pointer,length> tuple.

That said, I tried to keep the differences to simple removals, so that you
can do a diff between this and the libxdiff origin, and you'll basically
see just things getting deleted. Even the mmfile_t simplifications are
left in a state where the diffs should be readable.

Apologies to Davide, whom I'd love to get feedback on this all from (I
wrote my own "fill_mmfile()" for the new simpler mmfile_t format: the old
complex format had a helper function for that, but I did my surgery with
the goal in mind that eventually we _should_ just do

	mmfile_t mf;

	buf = read_sha1_file(sha1, type, &size);
	mf->ptr = buf;
	mf->size = size;
	.. use "mf" directly ..

which was really a nightmare with the old "helpful" mmfile_t, and really
is that easy with the new cut-down interfaces).

[ Btw, as any hawk-eye can see from the diff, this was actually generated
  with itself, so it is "self-hosting". That's about all the testing it
  has gotten, along with the above kernel diff, which eye-balls correctly,
  but shows the newline issue when you double-check it with "git-apply" ]

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-03-25 16:49:58 -08:00