git/t/t4210-log-i18n.sh

94 lines
2.5 KiB
Bash
Raw Normal View History

log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
#!/bin/sh
test_description='test log with i18n features'
. ./lib-gettext.sh
log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
# two forms of é
utf8_e=$(printf '\303\251')
latin1_e=$(printf '\351')
# invalid UTF-8
invalid_e=$(printf '\303\50)') # ")" at end to close opening "("
log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
test_expect_success 'create commits in different encodings' '
test_tick &&
cat >msg <<-EOF &&
utf8
t${utf8_e}st
EOF
git add msg &&
git -c i18n.commitencoding=utf8 commit -F msg &&
cat >msg <<-EOF &&
latin1
t${latin1_e}st
EOF
git add msg &&
git -c i18n.commitencoding=ISO-8859-1 commit -F msg
'
test_expect_success 'log --grep searches in log output encoding (utf8)' '
cat >expect <<-\EOF &&
latin1
utf8
EOF
git log --encoding=utf8 --format=%s --grep=$utf8_e >actual &&
test_cmp expect actual
'
test_expect_success !MINGW 'log --grep searches in log output encoding (latin1)' '
log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
cat >expect <<-\EOF &&
latin1
utf8
EOF
git log --encoding=ISO-8859-1 --format=%s --grep=$latin1_e >actual &&
test_cmp expect actual
'
test_expect_success !MINGW 'log --grep does not find non-reencoded values (utf8)' '
log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
test_must_be_empty actual
log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
'
test_expect_success !MINGW 'log --grep does not find non-reencoded values (latin1)' '
log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
test_must_be_empty actual
log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
'
for engine in fixed basic extended perl
do
prereq=
if test $engine = "perl"
then
prereq="PCRE"
else
prereq=""
fi
force_regex=
if test $engine != "fixed"
then
force_regex=.*
fi
test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
cat >expect <<-\EOF &&
latin1
utf8
EOF
LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$latin1_e\" >actual &&
test_cmp expect actual
"
test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not find non-reencoded values (latin1 + locale)" "
LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$utf8_e\" >actual &&
test_must_be_empty actual
"
test_expect_success !MINGW,GETTEXT_LOCALE,$prereq "-c grep.patternType=$engine log --grep does not die on invalid UTF-8 value (latin1 + locale + invalid needle)" "
LC_ALL=\"$is_IS_locale\" git -c grep.patternType=$engine log --encoding=ISO-8859-1 --format=%s --grep=\"$force_regex$invalid_e\" >actual &&
test_must_be_empty actual
"
done
log: re-encode commit messages before grepping If you run "git log --grep=foo", we will run your regex on the literal bytes of the commit message. This can provide confusing results if the commit message is not in the same encoding as your grep expression (or worse, you have commits in multiple encodings, in which case your regex would need to be written to match either encoding). On top of this, we might also be grepping in the commit's notes, which are already re-encoded, potentially leading to grepping in a buffer with mixed encodings concatenated. This is insanity, but most people never noticed, because their terminal and their commit encodings all match. Instead, let's massage the to-be-grepped commit into a standardized encoding. There is not much point in adding a flag for "this is the encoding I expect my grep pattern to match"; the only sane choice is for it to use the log output encoding. That is presumably what the user's terminal is using, and it means that the patterns found by the grep will match the output produced by git. As a bonus, this fixes a potential segfault in commit_match when commit->buffer is NULL, as we now build on logmsg_reencode, which handles reading the commit buffer from disk if necessary. The segfault can be triggered with: git commit -m 'text1' --allow-empty git commit -m 'text2' --allow-empty git log --graph --no-walk --grep 'text2' which arguably does not make any sense (--graph inherently wants a connected history, and by --no-walk the command line is telling us to show discrete points in history without connectivity), and we probably should forbid the combination, but that is a separate issue. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-02-11 20:59:58 +00:00
test_done