Commit graph

33 commits

Author SHA1 Message Date
Jeff Hostetler
f59d15bb42 convert: add classification for conv_attrs struct
Create `enum conv_attrs_classification` to express the different ways
that attributes are handled for a blob during checkout.

This will be used in a later commit when deciding whether to add a file
to the parallel or delayed queue during checkout. For now, we can also
use it in get_stream_filter_ca() to simplify the function (as the
classifying logic is the same).

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18 13:56:40 -07:00
Jeff Hostetler
3e9e82c0d8 convert: add get_stream_filter_ca() variant
Like the previous patch, we will also need to call get_stream_filter()
with a precomputed `struct conv_attrs`, when we add support for parallel
checkout workers. So add the _ca() variant which takes the conversion
attributes struct as a parameter.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18 13:56:40 -07:00
Jeff Hostetler
55b4ad0ead convert: add [async_]convert_to_working_tree_ca() variants
Separate the attribute gathering from the actual conversion by adding
_ca() variants of the conversion functions. These variants receive a
precomputed 'struct conv_attrs', not relying, thus, on an index state.
They will be used in a future patch adding parallel checkout support,
for two reasons:

- We will already load the conversion attributes in checkout_entry(),
  before conversion, to decide whether a path is eligible for parallel
  checkout. Therefore, it would be wasteful to load them again later,
  for the actual conversion.

- The parallel workers will be responsible for reading, converting and
  writing blobs to the working tree. They won't have access to the main
  process' index state, so they cannot load the attributes. Instead,
  they will receive the preloaded ones and call the _ca() variant of
  the conversion functions. Furthermore, the attributes machinery is
  optimized to handle paths in sequential order, so it's better to leave
  it for the main process, anyway.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18 13:56:40 -07:00
Jeff Hostetler
38e95844e8 convert: make convert_attrs() and convert structs public
Move convert_attrs() declaration from convert.c to convert.h, together
with the conv_attrs struct and the crlf_action enum. This function and
the data structures will be used outside convert.c in the upcoming
parallel checkout implementation. Note that crlf_action is renamed to
convert_crlf_action, which is more appropriate for the global namespace.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2021-03-18 13:56:40 -07:00
brian m. carlson
c397aac02f convert: provide additional metadata to filters
Now that we have the codebase wired up to pass any additional metadata
to filters, let's collect the additional metadata that we'd like to
pass.

The two main places we pass this metadata are checkouts and archives.
In these two situations, reading HEAD isn't a valid option, since HEAD
isn't updated for checkouts until after the working tree is written and
archives can accept an arbitrary tree.  In other situations, HEAD will
usually reflect the refname of the branch in current use.

We pass a smaller amount of data in other cases, such as git cat-file,
where we can really only logically know about the blob.

This commit updates only the parts of the checkout code where we don't
use unpack_trees.  That function and callers of it will be handled in a
future commit.

In the archive code, we leak a small amount of memory, since nothing we
pass in the archiver argument structure is freed.

Signed-off-by: brian m. carlson <bk2204@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-16 11:37:02 -07:00
brian m. carlson
ab90ecae99 convert: permit passing additional metadata to filter processes
There are a variety of situations where a filter process can make use of
some additional metadata.  For example, some people find the ident
filter too limiting and would like to include the commit or the branch
in their smudged files.  This information isn't available during
checkout as HEAD hasn't been updated at that point, and it wouldn't be
available in archives either.

Let's add a way to pass this metadata down to the filter.  We pass the
blob we're operating on, the treeish (preferring the commit over the
tree if one exists), and the ref we're operating on.  Note that we won't
pass this information in all cases, such as when renormalizing or when
we're performing diffs, since it doesn't make sense in those cases.

The data we currently get from the filter process looks like the
following:

  command=smudge
  pathname=git.c
  0000

With this change, we'll get data more like this:

  command=smudge
  pathname=git.c
  refname=refs/tags/v2.25.1
  treeish=c522f061d551c9bb8684a7c3859b2ece4499b56b
  blob=7be7ad34bd053884ec48923706e70c81719a8660
  0000

There are a couple things to note about this approach.  For operations
like checkout, treeish will always be a commit, since we cannot check
out individual trees, but for other operations, like archive, we can end
up operating on only a particular tree, so we'll provide only a tree as
the treeish.  Similar comments apply for refname, since there are a
variety of cases in which we won't have a ref.

This commit wires up the code to print this information, but doesn't
pass any of it at this point.  In a future commit, we'll have various
code paths pass the actual useful data down.

Signed-off-by: brian m. carlson <bk2204@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-03-16 11:37:02 -07:00
brian m. carlson
2c65d90f75 am: reload .gitattributes after patching it
When applying multiple patches with git am, or when rebasing using the
am backend, it's possible that one of our patches has updated a
gitattributes file. Currently, we cache this information, so if a
file in a subsequent patch has attributes applied, the file will be
written out with the attributes in place as of the time we started the
rebase or am operation, not with the attributes applied by the previous
patch. This problem does not occur when using the -m or -i flags to
rebase.

To ensure we write the correct data into the working tree, expire the
cache after each patch that touches a path ending in ".gitattributes".
Since we load these attributes in multiple separate files, we must
expire them accordingly.

Verify that both the am and rebase code paths work correctly, including
the conflict marker size with am -3.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-03 15:16:18 -07:00
Junio C Hamano
5ade034464 Merge branch 'en/incl-forward-decl'
Code hygiene improvement for the header files.

* en/incl-forward-decl:
  Remove forward declaration of an enum
  compat/precompose_utf8.h: use more common include guard style
  urlmatch.h: fix include guard
  Move definition of enum branch_track from cache.h to branch.h
  alloc: make allocate_alloc_state and clear_alloc_state more consistent
  Add missing includes and forward declarations
2018-08-20 12:41:32 -07:00
Elijah Newren
ef3ca95475 Add missing includes and forward declarations
I looped over the toplevel header files, creating a temporary two-line C
program for each consisting of
  #include "git-compat-util.h"
  #include $HEADER
This patch is the result of manually fixing errors in compiling those
tiny programs.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-15 11:52:09 -07:00
Nguyễn Thái Ngọc Duy
7f944e264e convert.c: remove an implicit dependency on the_index
Make the convert API take an index_state instead of assuming the_index
in convert.c. All external call sites are converted blindly to keep
the patch simple and retain current behavior. Individual call sites
may receive further updates to use the right index instead of
the_index.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-13 14:14:42 -07:00
Nguyễn Thái Ngọc Duy
546f70f377 convert.h: drop 'extern' from function declaration
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-08-03 10:42:55 -07:00
Junio C Hamano
1ac0ce4d32 Merge branch 'ls/checkout-encoding'
The new "checkout-encoding" attribute can ask Git to convert the
contents to the specified encoding when checking out to the working
tree (and the other way around when checking in).

* ls/checkout-encoding:
  convert: add round trip check based on 'core.checkRoundtripEncoding'
  convert: add tracing for 'working-tree-encoding' attribute
  convert: check for detectable errors in UTF encodings
  convert: add 'working-tree-encoding' attribute
  utf8: add function to detect a missing UTF-16/32 BOM
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: teach same_encoding() alternative UTF encoding names
  strbuf: add a case insensitive starts_with()
  strbuf: add xstrdup_toupper()
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
2018-05-08 15:59:22 +09:00
Lars Schneider
e92d622536 convert: add round trip check based on 'core.checkRoundtripEncoding'
UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16 11:40:56 +09:00
Lars Schneider
107642fe26 convert: add 'working-tree-encoding' attribute
Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git reencodes
the content to a canonical UTF-8 representation. On checkout Git will
reverse this operation.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-04-16 11:40:56 +09:00
brian m. carlson
1a750441a7 convert: convert to struct object_id
Convert convert.c to struct object_id.  Add a use of the_hash_algo to
replace hard-coded constants and change a strbuf_add to a strbuf_addstr
to avoid another hard-coded constant.

Note that a strict conversion using the hexsz constant would cause
problems in the future if the internal and user-visible hash algorithms
differed, as anticipated by the hash function transition plan.

Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-03-14 09:23:50 -07:00
Torsten Bögershausen
8462ff43e4 convert_to_git(): safe_crlf/checksafe becomes int conv_flags
When calling convert_to_git(), the checksafe parameter defined what
should happen if the EOL conversion (CRLF --> LF --> CRLF) does not
roundtrip cleanly. In addition, it also defined if line endings should
be renormalized (CRLF --> LF) or kept as they are.

checksafe was an safe_crlf enum with these values:
SAFE_CRLF_FALSE:       do nothing in case of EOL roundtrip errors
SAFE_CRLF_FAIL:        die in case of EOL roundtrip errors
SAFE_CRLF_WARN:        print a warning in case of EOL roundtrip errors
SAFE_CRLF_RENORMALIZE: change CRLF to LF
SAFE_CRLF_KEEP_CRLF:   keep all line endings as they are

In some cases the integer value 0 was passed as checksafe parameter
instead of the correct enum value SAFE_CRLF_FALSE. That was no problem
because SAFE_CRLF_FALSE is defined as 0.

FALSE/FAIL/WARN are different from RENORMALIZE and KEEP_CRLF. Therefore,
an enum is not ideal. Let's use a integer bit pattern instead and rename
the parameter to conv_flags to make it more generically usable. This
allows us to extend the bit pattern in a subsequent commit.

Reported-By: Randall S. Becker <rsbecker@nexbridge.com>
Helped-By: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-01-16 12:35:56 -08:00
Junio C Hamano
a17483fcfe Merge branch 'tb/apply-with-crlf'
"git apply" that is used as a better "patch -p1" failed to apply a
taken from a file with CRLF line endings to a file with CRLF line
endings.  The root cause was because it misused convert_to_git()
that tried to do "safe-crlf" processing by looking at the index
entry at the same path, which is a nonsense---in that mode, "apply"
is not working on the data in (or derived from) the index at all.
This has been fixed.

* tb/apply-with-crlf:
  apply: file commited with CRLF should roundtrip diff and apply
  convert: add SAFE_CRLF_KEEP_CRLF
2017-08-26 22:55:05 -07:00
Torsten Bögershausen
2fea9de618 convert: add SAFE_CRLF_KEEP_CRLF
When convert_to_git() is called, the caller may want to keep CRLF to
be kept as CRLF (and not converted into LF).

This will be used in the next commit, when apply works with files
that have CRLF and patches are applied onto these files.

Add the new value "SAFE_CRLF_KEEP_CRLF" to safe_crlf.

Prepare convert_to_git() to be able to run the clean filter, skip
the CRLF conversion and run the ident filter.

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-08-16 10:21:17 -07:00
Junio C Hamano
51b8aecabe Merge branch 'ls/filter-process-delayed'
The filter-process interface learned to allow a process with long
latency give a "delayed" response.

* ls/filter-process-delayed:
  convert: add "status=delayed" to filter process protocol
  convert: refactor capabilities negotiation
  convert: move multiple file filter error handling to separate function
  convert: put the flags field before the flag itself for consistent style
  t0021: write "OUT <size>" only on success
  t0021: make debug log file name configurable
  t0021: keep filter log files on comparison
2017-08-11 13:27:00 -07:00
Lars Schneider
2841e8f81c convert: add "status=delayed" to filter process protocol
Some `clean` / `smudge` filters may require a significant amount of
time to process a single blob (e.g. the Git LFS smudge filter might
perform network requests). During this process the Git checkout
operation is blocked and Git needs to wait until the filter is done to
continue with the checkout.

Teach the filter process protocol, introduced in edcc8581 ("convert: add
filter.<driver>.process option", 2016-10-16), to accept the status
"delayed" as response to a filter request. Upon this response Git
continues with the checkout operation. After the checkout operation Git
calls "finish_delayed_checkout" which queries the filter for remaining
blobs. If the filter is still working on the completion, then the filter
is expected to block. If the filter has completed all remaining blobs
then an empty response is expected.

Git has a multiple code paths that checkout a blob. Support delayed
checkouts only in `clone` (in unpack-trees.c) and `checkout` operations
for now. The optimization is most effective in these code paths as all
files of the tree are processed.

Signed-off-by: Lars Schneider <larsxschneider@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-30 13:50:41 -07:00
Brandon Williams
a33e0b2a77 convert: convert renormalize_buffer to take an index
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-13 11:40:51 -07:00
Brandon Williams
82b474e025 convert: convert convert_to_git to take an index
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-13 11:40:51 -07:00
Brandon Williams
d6c41c20e6 convert: convert convert_to_git_filter_fd to take an index
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-13 11:40:51 -07:00
Brandon Williams
a7609c54b3 convert: convert get_cached_convert_stats_ascii to take an index
Signed-off-by: Brandon Williams <bmwill@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2017-06-13 11:40:51 -07:00
Torsten Bögershausen
6523728499 convert: unify the "auto" handling of CRLF
Before this change,
$ echo "* text=auto" >.gitattributes
$ echo "* eol=crlf" >>.gitattributes

would have the same effect as
$ echo "* text" >.gitattributes
$ git config core.eol crlf

Since the 'eol' attribute had higher priority than 'text=auto', this may
corrupt binary files and is not what most users expect to happen.

Make the 'eol' attribute to obey 'text=auto' and now
$ echo "* text=auto" >.gitattributes
$ echo "* eol=crlf" >>.gitattributes
behaves the same as
$ echo "* text=auto" >.gitattributes
$ git config core.eol crlf

In other words,
$ echo "* text=auto eol=crlf" >.gitattributes
has the same effect as
$ git config core.autocrlf true

and
$ echo "* text=auto eol=lf" >.gitattributes
has the same effect as
$ git config core.autocrlf input

Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-07-06 11:53:51 -07:00
Torsten Bögershausen
a7630bd427 ls-files: add eol diagnostics
When working in a cross-platform environment, a user may want to
check if text files are stored normalized in the repository and
if .gitattributes are set appropriately.

Make it possible to let Git show the line endings in the index and
in the working tree and the effective text/eol attributes.

The end of line ("eolinfo") are shown like this:

    "-text"        binary (or with bare CR) file
    "none"         text file without any EOL
    "lf"           text file with LF
    "crlf"         text file with CRLF
    "mixed"        text file with mixed line endings.

The effective text/eol attribute is one of these:

    "", "-text", "text", "text=auto", "text eol=lf", "text eol=crlf"

git ls-files --eol gives an output like this:

    i/none   w/none   attr/text=auto      t/t5100/empty
    i/-text  w/-text  attr/-text          t/test-binary-2.png
    i/lf     w/lf     attr/text eol=lf    t/t5100/rfc2047-info-0007
    i/lf     w/crlf   attr/text eol=crlf  doit.bat
    i/mixed  w/mixed  attr/               locale/XX.po

to show what eol convention is used in the data in the index ('i'),
and in the working tree ('w'), and what attribute is in effect,
for each path that is shown.

Add test cases in t0027.

Helped-By: Eric Sunshine <sunshine@sunshineco.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2016-01-18 19:48:43 -08:00
Steffen Prohaska
9035d75a2b convert: stream from fd to required clean filter to reduce used address space
The data is streamed to the filter process anyway.  Better avoid mapping
the file if possible.  This is especially useful if a clean filter
reduces the size, for example if it computes a sha1 for binary data,
like git media.  The file size that the previous implementation could
handle was limited by the available address space; large files for
example could not be handled with (32-bit) msysgit.  The new
implementation can filter files of any size as long as the filter output
is small enough.

The new code path is only taken if the filter is required.  The filter
consumes data directly from the fd.  If it fails, the original data is
not immediately available.  The condition can easily be handled as
a fatal error, which is expected for a required filter anyway.

If the filter was not required, the condition would need to be handled
in a different way, like seeking to 0 and reading the data.  But this
would require more restructuring of the code and is probably not worth
it.  The obvious approach of falling back to reading all data would not
help achieving the main purpose of this patch, which is to handle large
files with limited address space.  If reading all data is an option, we
can simply take the old code path right away and mmap the entire file.

The environment variable GIT_MMAP_LIMIT, which has been introduced in
a previous commit is used to test that the expected code path is taken.
A related test that exercises required filters is modified to verify
that the data actually has been modified on its way from the file system
to the object store.

Signed-off-by: Steffen Prohaska <prohaska@zib.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-08-28 10:25:15 -07:00
Steffen Prohaska
7ce7c7607b convert: drop arguments other than 'path' from would_convert_to_git()
It is only the path that matters in the decision whether to filter
or not.  Clarify this by making path the only argument of
would_convert_to_git().

Signed-off-by: Steffen Prohaska <prohaska@zib.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2014-08-21 15:27:20 -07:00
Ondřej Bílka
749f763dbb typofix: in-code comments
Signed-off-by: Ondřej Bílka <neleai@seznam.cz>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-07-22 16:06:49 -07:00
Jeff King
92ac3197e4 teach convert_to_git a "dry run" mode
Some callers may want to know whether convert_to_git will
actually do anything before performing the conversion
itself (e.g., to decide whether to stream or handle blobs
in-core). This patch lets callers specify the dry run mode
by passing a NULL destination buffer. The return value,
instead of indicating whether conversion happened, will
indicate whether conversion would occur.

For readability, we also include a wrapper function which
makes it more obvious we are not actually performing the
conversion.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-02-24 14:11:27 -08:00
Junio C Hamano
4ae6670444 stream filter: add "no more input" to the filters
Some filters may need to buffer the input and look-ahead inside it
to decide what to output, and they may consume more than zero bytes
of input and still not produce any output. After feeding all the
input, pass NULL as input as keep calling stream_filter() to let
such filters know there is no more input coming, and it is time for
them to produce the remaining output based on the buffered input.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-05-26 16:47:15 -07:00
Junio C Hamano
b6691092d7 Add streaming filter API
This introduces an API to plug custom filters to an input stream.

The caller gets get_stream_filter("path") to obtain an appropriate
filter for the path, and then uses it when opening an input stream
via open_istream().  After that, the caller can read from the stream
with read_istream(), and close it with close_istream(), just like an
unfiltered stream.

This only adds a "null" filter that is a pass-thru filter, but later
changes can add LF-to-CRLF and other filters, and the callers of the
streaming API do not have to change.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-05-26 16:47:15 -07:00
Junio C Hamano
d1bf0e0831 convert.h: move declarations for conversion from cache.h
Before adding the streaming filter API to the conversion layer,
move the existing declarations related to the conversion to its
own header file.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-05-26 16:47:15 -07:00