development/git - HydraGit

mirror of https://github.com/git/git synced 2024-11-05 18:59:29 +00:00

Author	SHA1	Message	Date
Junio C Hamano	df3755888b	utf8: accept "latin-1" as ISO-8859-1 Even though latin-1 is still seen in e-mail headers, some platforms only install ISO-8859-1. "iconv -f ISO-8859-1" succeeds, while "iconv -f latin-1" fails on such a system. Using the same fallback_encoding() mechanism factored out in the previous step, teach ourselves that "ISO-8859-1" has a better chance of being accepted than "latin-1". Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-26 18:16:23 -07:00
Junio C Hamano	3270741ea8	utf8: refactor code to decide fallback encoding The codepath we use to call iconv_open() has a provision to use a fallback encoding when it fails, hoping that "UTF-8" being spelled differently could be the reason why the library function did not like the encoding names we gave it. Essentially, we turn what we have observed to be used as variants of "UTF-8" (e.g. "utf8") into the most official spelling and use that as a fallback. We do the same thing for input and output encoding. Introduce a helper function to do just one side and call that twice. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2016-09-26 18:16:23 -07:00
Karthik Nayak	110dcda50d	utf8: add function to align a string into given strbuf Add strbuf_utf8_align() which will align a given string into a strbuf as per given align_type and width. If the width is greater than the string length then no alignment is performed. Helped-by: Eric Sunshine <sunshine@sunshineco.com> Mentored-by: Christian Couder <christian.couder@gmail.com> Mentored-by: Matthieu Moy <matthieu.moy@grenoble-inp.fr> Signed-off-by: Karthik Nayak <karthik.188@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-09-17 10:02:48 -07:00
Junio C Hamano	dde843e737	utf8-bom: introduce skip_utf8_bom() helper With the recent change to ignore the UTF8 BOM at the beginning of .gitignore files, we now have two codepaths that do such a skipping (the other one is for reading the configuration files). Introduce utf8_bom[] constant string and skip_utf8_bom() helper and teach .gitignore code how to use it. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2015-04-16 11:35:06 -07:00
Junio C Hamano	7ba46269a0	Merge branch 'maint-2.1' into maint * maint-2.1: is_hfs_dotgit: loosen over-eager match of \u{..47}	2015-01-07 13:28:10 -08:00
Junio C Hamano	3c84ac86fc	Merge branch 'maint-2.0' into maint-2.1 * maint-2.0: is_hfs_dotgit: loosen over-eager match of \u{..47}	2015-01-07 13:27:56 -08:00
Junio C Hamano	282616c72d	Merge branch 'maint-1.9' into maint-2.0 * maint-1.9: is_hfs_dotgit: loosen over-eager match of \u{..47}	2015-01-07 13:27:19 -08:00
Junio C Hamano	64a03e970a	Merge branch 'maint-1.8.5' into maint-1.9 * maint-1.8.5: is_hfs_dotgit: loosen over-eager match of \u{..47}	2015-01-07 13:27:13 -08:00
Jeff King	6aaf956b08	is_hfs_dotgit: loosen over-eager match of \u{..47} Our is_hfs_dotgit function relies on the hackily-implemented next_hfs_char to give us the next character that an HFS+ filename comparison would look at. It's hacky because it doesn't implement the full case-folding table of HFS+; it gives us just enough to see if the path matches ".git". At the end of next_hfs_char, we use tolower() to convert our 32-bit code point to lowercase. Our tolower() implementation only takes an 8-bit char, though; it throws away the upper 24 bits. This means we can't have any false negatives for is_hfs_dotgit. We only care about matching 7-bit ASCII characters in ".git", and we will correctly process 'G' or 'g'. However, we _can_ have false positives. Because we throw away the upper bits, code point \u{0147} (for example) will look like 'G' and get downcased to 'g'. It's not known whether a sequence of code points whose truncation ends up as ".git" is meaningful in any language, but it does not hurt to be more accurate here. We can just pass out the full 32-bit code point, and compare it manually to the upper and lowercase characters we care about. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-12-29 12:06:27 -08:00
Junio C Hamano	77933f4449	Sync with v2.1.4 * maint-2.1: Git 2.1.4 Git 2.0.5 Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index	2014-12-17 11:46:57 -08:00
Junio C Hamano	58f1d950e3	Sync with v2.0.5 * maint-2.0: Git 2.0.5 Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index	2014-12-17 11:42:28 -08:00
Junio C Hamano	5e519fb8b0	Sync with v1.9.5 * maint-1.9: Git 1.9.5 Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index	2014-12-17 11:28:54 -08:00
Junio C Hamano	6898b79721	Sync with v1.8.5.6 * maint-1.8.5: Git 1.8.5.6 fsck: complain about NTFS ".git" aliases in trees read-cache: optionally disallow NTFS .git variants path: add is_ntfs_dotgit() helper fsck: complain about HFS+ ".git" aliases in trees read-cache: optionally disallow HFS+ .git variants utf8: add is_hfs_dotgit() helper fsck: notice .git case-insensitively t1450: refactor ".", "..", and ".git" fsck tests verify_dotfile(): reject .git case-insensitively read-tree: add tests for confusing paths like ".." and ".git" unpack-trees: propagate errors adding entries to the index	2014-12-17 11:20:31 -08:00
Jeff King	6162a1d323	utf8: add is_hfs_dotgit() helper We do not allow paths with a ".git" component to be added to the index, as that would mean repository contents could overwrite our repository files. However, asking "is this path the same as .git" is not as simple as strcmp() on some filesystems. HFS+'s case-folding does more than just fold uppercase into lowercase (which we already handle with strcasecmp). It may also skip past certain "ignored" Unicode code points, so that (for example) ".gi\u200ct" is mapped ot ".git". The full list of folds can be found in the tables at: https://www.opensource.apple.com/source/xnu/xnu-1504.15.3/bsd/hfs/hfscommon/Unicode/UCStringCompareData.h Implementing a full "is this path the same as that path" comparison would require us importing the whole set of tables. However, what we want to do is much simpler: we only care about checking ".git". We know that 'G' is the only thing that folds to 'g', and so on, so we really only need to deal with the set of ignored code points, which is much smaller. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-12-17 11:04:39 -08:00
Junio C Hamano	56feed1c76	Merge branch 'rs/export-strbuf-addchars' Code clean-up. * rs/export-strbuf-addchars: strbuf: use strbuf_addchars() for adding a char multiple times strbuf: export strbuf_addchars()	2014-09-19 11:38:39 -07:00
Junio C Hamano	1764e8124e	Merge branch 'nd/strbuf-utf8-replace' * nd/strbuf-utf8-replace: utf8.c: fix strbuf_utf8_replace() consuming data beyond input string	2014-09-09 12:54:02 -07:00
René Scharfe	d07235a027	strbuf: export strbuf_addchars() Move strbuf_addchars() to strbuf.c, where it belongs, and make it available for other callers. Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-09-08 11:26:45 -07:00
Nguyễn Thái Ngọc Duy	430875969a	utf8.c: fix strbuf_utf8_replace() consuming data beyond input string The main loop in strbuf_utf8_replace() could summed up as: while ('src' is still valid) { 1) advance 'src' to copy ANSI escape sequences 2) advance 'src' to copy/replace visible characters } The problem is after #1, 'src' may have reached the end of the string (so 'src' points to NUL) and #2 will continue to copy that NUL as if it's a normal character. Because the output is stored in a strbuf, this NUL accounted in the 'len' field as well. Check after #1 and break the loop if necessary. The test does not look obvious, but the combination of %>>() should make a call trace like this show_log() pretty_print_commit() format_commit_message() strbuf_expand() format_commit_item() format_and_pad_commit() strbuf_utf8_replace() where %C(auto)%d would insert a color reset escape sequence in the end of the string given to strbuf_utf8_replace() and show_log() uses fwrite() to send everything to stdout (including the incorrect NUL inserted by strbuf_utf8_replace) Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-08-11 11:52:22 -07:00
Junio C Hamano	334d40e951	Merge branch 'tb/unicode-6.3-zero-width' Update the logic to compute the display width needed for utf8 strings and allow us to more easily maintain the tables used in that logic. We may want to let the users choose if codepoints with ambiguous widths are treated as a double or single width in a follow-up patch. * tb/unicode-6.3-zero-width: utf8: make it easier to auto-update git_wcwidth() utf8.c: use a table for double_width	2014-06-06 11:29:38 -07:00
Torsten Bögershausen	9c94389c3e	utf8: make it easier to auto-update git_wcwidth() The function git_wcwidth() returns for a given unicode code point the width on the display: -1 for control characters, 0 for combining or other non-visible code points 1 for e.g. ASCII 2 for double-width code points. This table had been originally been extracted for one Unicode version, probably 3.2. We now use two tables these days, one for zero-width and another for double-width. Make it easier to update these tables to a later version of Unicode by factoring out the table from utf8.c into unicode_width.h and add the script update_unicode.sh to update the table based on the latest Unicode specification files. Thanks to Peter Krefting <peter@softwolves.pp.se> and Kevin Bracey <kevin@bracey.fi> for helping with their Unicode knowledge. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-05-12 10:38:01 -07:00
Torsten Bögershausen	08460345b5	utf8.c: use a table for double_width Refactor git_wcwidth() and replace the if-else-if chain. Use the table double_width which is scanned by the bisearch() function, which is already used to find combining code points. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-05-12 10:20:46 -07:00
Junio C Hamano	9fd911a810	Merge branch 'tb/unicode-6.3-zero-width' Teach our display-column-counting logic about decomposed umlauts and friends. * tb/unicode-6.3-zero-width: utf8.c: partially update to version 6.3	2014-04-16 13:38:57 -07:00
Torsten Bögershausen	d813ab970d	utf8.c: partially update to version 6.3 Unicode 6.3 defines more code points as combining or accents. For example, the character "ö" could be expressed as an "o" followed by U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above). We should consider that such a sequence of two codepoints occupies one display column for the alignment purposes, and for that, git_wcwidth() should return 0 for them. Affected codepoints are: U+0358..U+035C U+0487 U+05A2, U+05BA, U+05C5, U+05C7 U+0604, U+0616..U+061A, U+0659..U+065F Earlier unicode standards had defined these as "reserved". Only the range 0..U+07FF has been checked to see which codepoints need to be marked as 0-width while preparing for this commit; more updates may be needed. Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-04-09 10:14:05 -07:00
John Keeping	a68a67dea3	utf8: use correct type for values in interval table We treat these as unsigned everywhere and compare against unsigned values, so declare them using the typedef we already have for this. While we're here, fix the indentation as well. Signed-off-by: John Keeping <john@keeping.me.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-02-18 15:51:40 -08:00
John Keeping	df5213b70d	utf8: fix iconv error detection iconv(3) returns "(size_t) -1" on error. Make sure that we cast the "-1" properly when checking for this. Signed-off-by: John Keeping <john@keeping.me.uk> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2014-02-18 15:51:33 -08:00
Ramsay Jones	980419b993	pretty: Fix bug in truncation support for %>, %< and %>< Some systems experience failures in t4205-*.sh (tests 18-20, 27) which all relate to the use of truncation with the %< padding placeholder. This capability was added in the commit `a7f01c6b` ("pretty: support truncating in %>, %< and %><", 19-04-2013). The truncation support was implemented with the assistance of a new strbuf function (strbuf_utf8_replace). This function contains the following code: strbuf_attach(sb_src, strbuf_detach(&sb_dst, NULL), sb_dst.len, sb_dst.alloc); Unfortunately, this code is subject to unspecified behaviour. In particular, the order of evaluation of the argument expressions (along with the associated side effects) is not specified by the C standard. Note that the second argument expression is a call to strbuf_detach() which, as a side effect, sets the 'len' and 'alloc' fields of the sb_dst argument to zero. Depending on the order of evaluation of the argument expressions to the strbuf_attach call, this can lead to assigning an empty string to 'sb_src'. In order to remove the undesired behaviour, we replace the above line of code with: strbuf_swap(sb_src, &sb_dst); strbuf_release(&sb_dst); which achieves the desired effect without provoking unspecified behaviour. Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk> Acked-by: Duy Nguyen <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-04-28 12:09:37 -07:00
Nguyễn Thái Ngọc Duy	1640632b4f	pretty: support %>> that steal trailing spaces This is pretty useful in `%<(100)%s%Cred%>(20)% an' where %s does not use up all 100 columns and %an needs more than 20 columns. By replacing %>(20) with %>>(20), %an can steal spaces from %s. %>> understands escape sequences, so %Cred does not stop it from stealing spaces in %<(100). Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-04-18 16:28:29 -07:00
Nguyễn Thái Ngọc Duy	a7f01c6b4d	pretty: support truncating in %>, %< and %>< %>(N,trunc) truncates the right part after N columns and replace the last two letters with "..". ltrunc does the same on the left. mtrunc cuts the middle out. Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-04-18 16:28:29 -07:00
Nguyễn Thái Ngọc Duy	b782bbab94	utf8.c: add reencode_string_len() that can handle NULs in string Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-04-18 16:28:28 -07:00
Nguyễn Thái Ngọc Duy	2bc1e7ecba	utf8.c: add utf8_strnwidth() with the ability to skip ansi sequences Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-04-18 16:28:28 -07:00
Nguyễn Thái Ngọc Duy	4247fe7956	utf8.c: move display_mode_esc_sequence_len() for use by other functions Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-04-18 16:28:27 -07:00
Junio C Hamano	573f1a9cf1	Merge branch 'ks/rfc2047-one-char-at-a-time' When "format-patch" quoted a non-ascii strings on the header files, it incorrectly applied rfc2047 and chopped a single character in the middle of it. * ks/rfc2047-one-char-at-a-time: format-patch: RFC 2047 says multi-octet character may not be split	2013-03-25 14:00:46 -07:00
Junio C Hamano	31b12a1999	Merge branch 'jk/utf-8-can-be-spelled-differently' Some platforms and users spell UTF-8 differently; retry with the most official "UTF-8" when the system does not understand the user-supplied encoding name that are the common alternative spellings of UTF-8. * jk/utf-8-can-be-spelled-differently: utf8: accept alternate spellings of UTF-8	2013-03-21 14:02:58 -07:00
Kirill Smelkov	6cd3c05327	format-patch: RFC 2047 says multi-octet character may not be split Even though an earlier attempt (bafc478..41dd00bad) cleaned up RFC 2047 encoding, pretty.c::add_rfc2047() still decides where to split the output line by going through the input one byte at a time, and potentially splits a character in the middle. A subject line may end up showing like this: ".... fö?? bar". (instead of ".... föö bar".) if split incorrectly. RFC 2047, section 5 (3) explicitly forbids such beaviour Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. that means that e.g. for Subject: .... föö bar encoding Subject: =?UTF-8?q?....=20f=C3=B6=C3=B6?= =?UTF-8?q?=20bar?= is correct, and Subject: =?UTF-8?q?....=20f=C3=B6=C3?= <-- NOTE ö is broken here =?UTF-8?q?=B6=20bar?= is not, because "ö" character UTF-8 encoding C3 B6 is split here across adjacent encoded words. To fix the problem, make the loop grab one _character_ at a time and determine its output length to see where to break the output line. Note that this version only knows about UTF-8, but the logic to grab one character is abstracted out in mbs_chrlen() function to make it possible to extend it to other encodings with the help of iconv in the future. Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-03-09 11:11:19 -08:00
Jeff King	5c680be113	utf8: accept alternate spellings of UTF-8 The iconv implementation on many platforms will accept variants of UTF-8, including "UTF8", "utf-8", and "utf8", but some do not. We make allowances in our code to treat them all identically, but we sometimes hand the string from the user directly to iconv. In this case, the platform iconv may or may not work. There are really four levels of platform iconv support for these synonyms: 1. All synonyms understood (e.g., glibc). 2. Only the official "UTF-8" understood (e.g., Windows). 3. Official "UTF-8" not understood, but some other synonym understood (it's not known whether such a platform exists). 4. Neither "UTF-8" nor any synonym understood (e.g., ancient systems, or ones without utf8 support installed). This patch teaches git to fall back to using the official "UTF-8" spelling when iconv_open fails (and the encoding was one of the synonym spellings). This makes things more convenient to users of type 2 systems, as they can now use any of the synonyms for the log output encoding. Type 1 systems are not affected, as iconv already works on the first try. Type 4 systems are not affected, as both attempts already fail. Type 3 systems will not benefit from the feature, but because we only use "UTF-8" as a fallback, they will not be regressed (i.e., you can continue to use "utf8" if your platform supports it). We could try all the various synonyms, but since such systems are not even known to exist, it's not worth the effort. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-02-25 13:17:22 -08:00
Junio C Hamano	3cc3cf970c	Merge branch 'jx/utf8-printf-width' Use a new helper that prints a message and counts its display width to align the help messages parse-options produces. * jx/utf8-printf-width: Add utf8_fprintf helper that returns correct number of columns	2013-02-14 10:29:08 -08:00
Jiang Xin	c082196575	Add utf8_fprintf helper that returns correct number of columns Since command usages can be translated, they may include utf-8 encoded strings, and the output in console may not align well any more. This is because strlen() is different from strwidth() on utf-8 strings. A wrapper utf8_fprintf() can help to return the correct number of columns required. Signed-off-by: Jiang Xin <worldhello.net@gmail.com> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Reviewed-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2013-02-11 11:29:45 -08:00
Junio C Hamano	71288e15df	Merge branch 'sp/shortlog-missing-lf' When a line to be wrapped has a solid run of non space characters whose length exactly is the wrap width, "git shortlog -w" failed to add a newline after such a line. * sp/shortlog-missing-lf: strbuf_add_wrapped*(): Remove unused return value shortlog: fix wrapping lines of wraplen	2013-01-02 10:40:34 -08:00
Steffen Prohaska	e0db1765c3	strbuf_add_wrapped*(): Remove unused return value Since shortlog isn't using the return value anymore (see previous commit), the functions can be changed to void. Signed-off-by: Steffen Prohaska <prohaska@zib.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-12-11 10:05:17 -08:00
Junio C Hamano	fff26a6805	Merge branch 'jc/same-encoding' into maint Various codepaths checked if two encoding names are the same using ad-hoc code and some of them ended up asking iconv() to convert between "utf8" and "UTF-8". The former is not a valid way to spell the encoding name, but often people use it by mistake, and we equated them in some but not all codepaths. Introduce a new helper function to make these codepaths consistent. * jc/same-encoding: reencode_string(): introduce and use same_encoding()	2012-12-07 14:10:56 -08:00
Junio C Hamano	fd778c09b1	Merge branch 'js/format-2047' into maint Various rfc2047 quoting issues around a non-ASCII name on the From: line in the output from format-patch have been corrected. * js/format-2047: format-patch tests: check quoting/encoding in To: and Cc: headers format-patch: fix rfc2047 address encoding with respect to rfc822 specials format-patch: make rfc2047 encoding more strict format-patch: introduce helper function last_line_length() format-patch: do not wrap rfc2047 encoded headers too late format-patch: do not wrap non-rfc2047 headers too early utf8: fix off-by-one wrapping of text	2012-11-20 09:57:44 -08:00
Junio C Hamano	6b8731258d	Merge branch 'jc/same-encoding' Various codepaths checked if two encoding names are the same using ad-hoc code and some of them ended up asking iconv() to convert between "utf8" and "UTF-8". The former is not a valid way to spell the encoding name, but often people use it by mistake, and we equated them in some but not all codepaths. Introduce a new helper function to make these codepaths consistent. * jc/same-encoding: reencode_string(): introduce and use same_encoding() Conflicts: builtin/mailinfo.c	2012-11-15 10:24:05 -08:00
Jeff King	64b22a5894	Merge branch 'js/format-2047' Fixes many rfc2047 quoting issues in the output from format-patch. * js/format-2047: format-patch tests: check quoting/encoding in To: and Cc: headers format-patch: fix rfc2047 address encoding with respect to rfc822 specials format-patch: make rfc2047 encoding more strict format-patch: introduce helper function last_line_length() format-patch: do not wrap rfc2047 encoded headers too late format-patch: do not wrap non-rfc2047 headers too early utf8: fix off-by-one wrapping of text	2012-11-09 12:42:32 -05:00
Junio C Hamano	0e18bcd5e9	reencode_string(): introduce and use same_encoding() Callers of reencode_string() that re-encodes a string from one encoding to another all used ad-hoc way to bypass the case where the input and the output encodings are the same. Some did strcmp(), some did strcasecmp(), yet some others when converting to UTF-8 used is_encoding_utf8(). Introduce same_encoding() helper function to make these callers use the same logic. Notably, is_encoding_utf8() has a work-around for common misconfiguration to use "utf8" to name UTF-8 encoding, which does not match "UTF-8" hence strcasecmp() would not consider the same. Make use of it in this helper function. Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-11-04 08:10:33 -05:00
Jan H. Schönherr	14e1a4e1ff	utf8: fix off-by-one wrapping of text The wrapping logic in strbuf_add_wrapped_text() does currently not allow lines that entirely fill the allowed width, instead it wraps the line one character too early. For example, the text "This is the sixth commit." formatted via "%w(11,1,2)" (wrap at 11 characters, 1 char indent of first line, 2 char indent of following lines) results in four lines: " This is", " the", " sixth", " commit." This is wrong, because " the sixth" is exactly 11 characters long, and thus allowed. Fix this by allowing the (width+1) character of a line to be a valid wrapping point if it is a whitespace character. Signed-off-by: Jan H. Schönherr <schnhrr@cs.tu-berlin.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-10-18 14:20:49 -07:00
Torsten Bögershausen	76759c7dff	git on Mac OS and precomposed unicode Mac OS X mangles file names containing unicode on file systems HFS+, VFAT or SAMBA. When a file using unicode code points outside ASCII is created on a HFS+ drive, the file name is converted into decomposed unicode and written to disk. No conversion is done if the file name is already decomposed unicode. Calling open("\xc3\x84", ...) with a precomposed "Ä" yields the same result as open("\x41\xcc\x88",...) with a decomposed "Ä". As a consequence, readdir() returns the file names in decomposed unicode, even if the user expects precomposed unicode. Unlike on HFS+, Mac OS X stores files on a VFAT drive (e.g. an USB drive) in precomposed unicode, but readdir() still returns file names in decomposed unicode. When a git repository is stored on a network share using SAMBA, file names are send over the wire and written to disk on the remote system in precomposed unicode, but Mac OS X readdir() returns decomposed unicode to be compatible with its behaviour on HFS+ and VFAT. The unicode decomposition causes many problems: - The names "git add" and other commands get from the end user may often be precomposed form (the decomposed form is not easily input from the keyboard), but when the commands read from the filesystem to see what it is going to update the index with already is on the filesystem, readdir() will give decomposed form, which is different. - Similarly "git log", "git mv" and all other commands that need to compare pathnames found on the command line (often but not always precomposed form; a command line input resulting from globbing may be in decomposed) with pathnames found in the tree objects (should be precomposed form to be compatible with other systems and for consistency in general). - The same for names stored in the index, which should be precomposed, that may need to be compared with the names read from readdir(). NFS mounted from Linux is fully transparent and does not suffer from the above. As Mac OS X treats precomposed and decomposed file names as equal, we can - wrap readdir() on Mac OS X to return the precomposed form, and - normalize decomposed form given from the command line also to the precomposed form, to ensure that all pathnames used in Git are always in the precomposed form. This behaviour can be requested by setting "core.precomposedunicode" configuration variable to true. The code in compat/precomposed_utf8.c implements basically 4 new functions: precomposed_utf8_opendir(), precomposed_utf8_readdir(), precomposed_utf8_closedir() and precompose_argv(). The first three are to wrap opendir(3), readdir(3), and closedir(3) functions. The argv[] conversion allows to use the TAB filename completion done by the shell on command line. It tolerates other tools which use readdir() to feed decomposed file names into git. When creating a new git repository with "git init" or "git clone", "core.precomposedunicode" will be set "false". The user needs to activate this feature manually. She typically sets core.precomposedunicode to "true" on HFS and VFAT, or file systems mounted via SAMBA. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Torsten Bögershausen <tboegi@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2012-07-08 22:03:46 -07:00
Jeff King	98acc837a1	strbuf: add fixed-length version of add_wrapped_text The function strbuf_add_wrapped_text takes a NUL-terminated string. This makes it annoying to wrap strings we have as a pointer and a length. Refactoring strbuf_add_wrapped_text and all of its sub-functions to handle fixed-length strings turned out to be really ugly. So this implementation is lame; it just strdups the text and operates on the NUL-terminated version. This should be fine as the strings we are wrapping are generally pretty short. If it becomes a problem, we can optimize later. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2011-02-23 13:44:36 -08:00
Junio C Hamano	32ae5b3425	Merge branch 'rs/optim-text-wrap' * rs/optim-text-wrap: utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text() utf8.c: remove strbuf_write() utf8.c: remove print_spaces() utf8.c: remove print_wrapped_text()	2010-03-02 12:44:10 -08:00
René Scharfe	462749b728	utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text() is_utf8() works by calling utf8_width() for each character at the supplied location. In strbuf_add_wrapped_text(), we do that anyway while wrapping the lines. So instead of checking the encoding beforehand, optimistically assume that it's utf-8 and wrap along until an invalid character is hit, and when that happens start over. This pays off if the text consists only of valid utf-8 characters. The following command was run against the Linux kernel repo with git 1.7.0: $ time git log --format='%b' v2.6.32 >/dev/null real 0m2.679s user 0m2.580s sys 0m0.100s $ time git log --format='%w(60,4,8)%b' >/dev/null real 0m4.342s user 0m4.230s sys 0m0.110s And with this patch series: $ time git log --format='%w(60,4,8)%b' >/dev/null real 0m3.741s user 0m3.630s sys 0m0.110s So the cost of wrapping is reduced to 70% in this case. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2010-02-20 09:22:44 -08:00
René Scharfe	68ad5e1e9c	utf8.c: remove strbuf_write() The patch before the previous one made sure that all callers of strbuf_add_wrapped_text() supply a strbuf. Replace all calls of strbuf_write() with regular strbuf functions and remove it. Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx> Signed-off-by: Junio C Hamano <gitster@pobox.com>	2010-02-20 09:19:35 -08:00

1 2

73 commits