* jc/stream-to-pack:
bulk-checkin: replace fast-import based implementation
csum-file: introduce sha1file_checkpoint
finish_tmp_packfile(): a helper function
create_tmp_packfile(): a helper function
write_pack_header(): a helper function
Conflicts:
pack.h
* nd/misc-cleanups:
unpack_object_header_buffer(): clear the size field upon error
tree_entry_interesting: make use of local pointer "item"
tree_entry_interesting(): give meaningful names to return values
read_directory_recursive: reduce one indentation level
get_tree_entry(): do not call find_tree_entry() on an empty tree
tree-walk.c: do not leak internal structure in tree_entry_len()
This extends the earlier approach to stream a large file directly from the
filesystem to its own packfile, and allows "git add" to send large files
directly into a single pack. Older code used to spawn fast-import, but the
new bulk-checkin API replaces it.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The callers do not use the returned size when the function says
it did not use any bytes and sets the type to OBJ_BAD, so this
should not matter in practice, but it is a good code hygiene
anyway.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jk/maint-pack-objects-compete-with-delete:
downgrade "packfile cannot be accessed" errors to warnings
pack-objects: protect against disappearing packs
These can happen if another process simultaneously prunes a
pack. But that is not usually an error condition, because a
properly-running prune should have repacked the object into
a new pack. So we will notice that the pack has disappeared
unexpectedly, print a message, try other packs (possibly
after re-scanning the list of packs), and find it in the new
pack.
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
It's possible that while pack-objects is running, a
simultaneously running prune process might delete a pack
that we are interested in. Because we load the pack indices
early on, we know that the pack contains our item, but by
the time we try to open and map it, it is gone.
Since c715f78, we already protect against this in the normal
object access code path, but pack-objects accesses the packs
at a lower level. In the normal access path, we call
find_pack_entry, which will call find_pack_entry_one on each
pack index, which does the actual lookup. If it gets a hit,
we will actually open and verify the validity of the
matching packfile (using c715f78's is_pack_valid). If we
can't open it, we'll issue a warning and pretend that we
didn't find it, causing us to go on to the next pack (or on
to loose objects).
Furthermore, we will cache the descriptor to the opened
packfile. Which means that later, when we actually try to
access the object, we are likely to still have that packfile
opened, and won't care if it has been unlinked from the
filesystem.
Notice the "likely" above. If there is another pack access
in the interim, and we run out of descriptors, we could
close the pack. And then a later attempt to access the
closed pack could fail (we'll try to re-open it, of course,
but it may have been deleted). In practice, this doesn't
happen because we tend to look up items and then access them
immediately.
Pack-objects does not follow this code path. Instead, it
accesses the packs at a much lower level, using
find_pack_entry_one directly. This means we skip the
is_pack_valid check, and may end up with the name of a
packfile, but no open descriptor.
We can add the same is_pack_valid check here. Unfortunately,
the access patterns of pack-objects are not quite as nice
for keeping lookup and object access together. We look up
each object as we find out about it, and the only later when
writing the packfile do we necessarily access it. Which
means that the opened packfile may be closed in the interim.
In practice, however, adding this check still has value, for
three reasons.
1. If you have a reasonable number of packs and/or a
reasonable file descriptor limit, you can keep all of
your packs open simultaneously. If this is the case,
then the race is impossible to trigger.
2. Even if you can't keep all packs open at once, you
may end up keeping the deleted one open (i.e., you may
get lucky).
3. The race window is shortened. You may notice early that
the pack is gone, and not try to access it. Triggering
the problem without this check means deleting the pack
any time after we read the list of index files, but
before we access the looked-up objects. Triggering it
with this check means deleting the pack means deleting
the pack after we do a lookup (and successfully access
the packfile), but before we access the object. Which
is a smaller window.
Acked-by: Nicolas Pitre <nico@fluxnic.net>
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When it needs to compare and add an alt object path to the
alt_odb_list, we normalize this path first since comparing normalized
path is easy to get correct result.
Use strbuf to replace some string operations, since it is cleaner and
safer.
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Hui Wang <Hui.Wang@windriver.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Cloning from a local repository blindly copies or hardlinks all the files
under objects/ hierarchy. This results in two issues:
- If the repository cloned has an "objects/info/alternates" file, and the
command line of clone specifies --reference, the ones specified on the
command line get overwritten by the copy from the original repository.
- An entry in a "objects/info/alternates" file can specify the object
stores it borrows objects from as a path relative to the "objects/"
directory. When cloning a repository with such an alternates file, if
the new repository is not sitting next to the original repository, such
relative paths needs to be adjusted so that they can be used in the new
repository.
This updates add_to_alternates_file() to take the path to the alternate
object store, including the "/objects" part at the end (earlier, it was
taking the path to $GIT_DIR and was adding "/objects" itself), as it is
technically possible to specify in objects/info/alternates file the path
of a directory whose name does not end with "/objects".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Git currently reports loose objects as 'corrupt' if they've been
deflated using a window size less than 32Kb, because the
experimental_loose_object() function doesn't recognise the header
byte as a zlib header. This patch makes the function tolerant of
all valid window sizes (15-bit to 8-bit) - but doesn't sacrifice
it's accuracy in distingushing the standard loose-object format
from the experimental (now abandoned) format.
On memory constrained systems zlib may use a much smaller window
size - working on Agit, I found that Android uses a 4KB window;
giving a header byte of 0x48, not 0x78. Consequently all loose
objects generated appear 'corrupt', which is why Agit is a read-only
Git client at this time - I don't want my client to generate Git
repos that other clients treat as broken :(
This patch makes Git tolerant of different deflate settings - it
might appear that it changes experimental_loose_object() to the point
where it could incorrectly identify the experimental format as the
standard one, but the two criteria (bitmask & checksum) can only
give a false result for an experimental object where both of the
following are true:
1) object size is exactly 8 bytes when uncompressed (bitmask)
2) [single-byte in-pack git type&size header] * 256
+ [1st byte of the following zlib header] % 31 = 0 (checksum)
As it happens, for all possible combinations of valid object type
(1-4) and window bits (0-7), the only time when the checksum will be
divisible by 31 is for 0x1838 - ie object type *1*, a Commit - which,
due the fields all Commit objects must contain, could never be as
small as 8 bytes in size.
Given this, the combination of the two criteria (bitmask & checksum)
always correctly determines the buffer format, and is more tolerant
than the previous version.
The alternative to this patch is simply removing support for the
experimental format, which I am also totally cool with.
References:
Android uses a 4KB window for deflation:
http://android.git.kernel.org/?p=platform/libcore.git;a=blob;f=luni/src/main/native/java_util_zip_Deflater.cpp;h=c0b2feff196e63a7b85d97cf9ae5bb2583409c28;hb=refs/heads/gingerbread#l53
Code snippet searching for false positives with the zlib checksum:
https://gist.github.com/1118177
Signed-off-by: Roberto Tyley <roberto.tyley@guardian.co.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/index-pack:
verify-pack: use index-pack --verify
index-pack: show histogram when emulating "verify-pack -v"
index-pack: start learning to emulate "verify-pack -v"
index-pack: a miniscule refactor
index-pack --verify: read anomalous offsets from v2 idx file
write_idx_file: need_large_offset() helper function
index-pack: --verify
write_idx_file: introduce a struct to hold idx customization options
index-pack: group the delta-base array entries also by type
Conflicts:
builtin/verify-pack.c
cache.h
sha1_file.c
* jc/zlib-wrap:
zlib: allow feeding more than 4GB in one go
zlib: zlib can only process 4GB at a time
zlib: wrap deflateBound() too
zlib: wrap deflate side of the API
zlib: wrap inflateInit2 used to accept only for gzip format
zlib: wrap remaining calls to direct inflate/inflateEnd
zlib wrapper: refactor error message formatter
Conflicts:
sha1_file.c
In a workload other than "git log" (without pathspec nor any option that
causes us to inspect trees and blobs), the recency pack order is said to
cause the access jump around quite a bit. Add a hook to allow us observe
how bad it is.
"git config core.logpackaccess /var/tmp/pal.txt" will give you the log
in the specified file.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The size of objects we read from the repository and data we try to put
into the repository are represented in "unsigned long", so that on larger
architectures we can handle objects that weigh more than 4GB.
But the interface defined in zlib.h to communicate with inflate/deflate
limits avail_in (how many bytes of input are we calling zlib with) and
avail_out (how many bytes of output from zlib are we ready to accept)
fields effectively to 4GB by defining their type to be uInt.
In many places in our code, we allocate a large buffer (e.g. mmap'ing a
large loose object file) and tell zlib its size by assigning the size to
avail_in field of the stream, but that will truncate the high octets of
the real size. The worst part of this story is that we often pass around
z_stream (the state object used by zlib) to keep track of the number of
used bytes in input/output buffer by inspecting these two fields, which
practically limits our callchain to the same 4GB limit.
Wrap z_stream in another structure git_zstream that can express avail_in
and avail_out in unsigned long. For now, just die() when the caller gives
a size that cannot be given to a single zlib call. In later patches in the
series, we would make git_inflate() and git_deflate() internally loop to
give callers an illusion that our "improved" version of zlib interface can
operate on a buffer larger than 4GB in one go.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Wrap deflateInit, deflate, and deflateEnd for everybody, and the sole use
of deflateInit2 in remote-curl.c to tell the library to use gzip header
and trailer in git_deflate_init_gzip().
There is only one caller that cares about the status from deflateEnd().
Introduce git_deflate_end_gently() to let that sole caller retrieve the
status and act on it (i.e. die) for now, but we would probably want to
make inflate_end/deflate_end die when they ran out of memory and get
rid of the _gently() kind.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Every time I look at the read-loose-object codepath, legacy_loose_object()
function makes my brain go through mental contortion. When we were playing
with the experimental loose object format, it may have made sense to call
the traditional format "legacy", in the hope that the experimental one
will some day replace it to become official, but it never happened.
This renames the function (and negates its return value) to detect if we
are looking at the experimental format, and move the code around in its
caller which used to do "if we are looing at legacy, do this special case,
otherwise the normal case is this". The codepath to read from the loose
objects in experimental format is the "unlikely" case.
Someday after Git 2.0, we should drop the support of this format.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This finally gets rid of the inefficient verify-pack implementation that
walks objects in the packfile in their object name order and replaces it
with a call to index-pack --verify. As a side effect, it also removes
packed_object_info_detail() API which is rather expensive.
As this changes the way errors are reported (verify-pack used to rely on
the usual runtime error detection routine unpack_entry() to diagnose the
CRC errors in an entry in the *.idx file; index-pack --verify checks the
whole *.idx file in one go), update a test that expected the string "CRC"
to appear in the error message.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Using an unsigned type, we would fail to detect a read error and then
proceed to try to write (size_t)-1 bytes.
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/bigfile:
Bigfile: teach "git add" to send a large file straight to a pack
index_fd(): split into two helper functions
index_fd(): turn write_object and format_check arguments into one flag
Make map_sha1_file(), parse_sha1_header() and unpack_sha1_header()
available to the streaming read API by exporting them via cache.h header
file.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
An object found in the delta-base cache is not guaranteed to
stay there, but we know it came from a pack and it is likely
to give us a quick access if we read_sha1_file() it right now,
which is a piece of useful information.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/replacing:
read_sha1_file(): allow selective bypassing of replacement mechanism
inline lookup_replace_object() calls
read_sha1_file(): get rid of read_sha1_file_repl() madness
t6050: make sure we test not just commit replacement
Declare lookup_replace_object() in cache.h, not in commit.h
Conflicts:
environment.c
The original interface for sha1_object_info() takes an object name and
gives back a type and its size (the latter is given only when it was
asked). The new interface wraps its implementation and exposes a bit
more pieces of information that the interface used to discard, namely:
- where the object is stored (loose? cached? packed?)
- if packed, where in which packfile?
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
* In the earlier round, this used u.pack.delta to record the length of
the delta chain, but the caller is not necessarily interested in the
length of the delta chain per-se, but may only want to know if it is a
delta against another object or is stored as a deflated data. Calling
packed_object_info_detail() involves walking the reverse index chain to
compute the store size of the object and is unnecessarily expensive.
We could resurrect the code if a new caller wants to know, but I doubt
it.
Instead return an integer that can be given to typename() if
the caller wants a string, just like everybody else does.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* jc/convert:
convert: make it harder to screw up adding a conversion attribute
convert: make it safer to add conversion attributes
convert: give saner names to crlf/eol variables, types and functions
convert: rename the "eol" global variable to "core_eol"
* jc/bigfile:
Bigfile: teach "git add" to send a large file straight to a pack
index_fd(): split into two helper functions
index_fd(): turn write_object and format_check arguments into one flag
* jc/replacing:
read_sha1_file(): allow selective bypassing of replacement mechanism
inline lookup_replace_object() calls
read_sha1_file(): get rid of read_sha1_file_repl() madness
t6050: make sure we test not just commit replacement
Declare lookup_replace_object() in cache.h, not in commit.h
Since commit c793430 (Limit file descriptors used by packs, 2011-02-28),
the extra parameter added in f2e872aa (Work around EMFILE when there are
too many pack files, 2010-11-01) is not used anymore.
Remove it.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Acked-by: Shawn O. Pearce <spearce@spearce.org>
The way "object replacement" mechanism was tucked to the read_sha1_file()
interface was suboptimal in a couple of ways:
- Callers that want it to die with useful diagnosis upon seeing a corrupt
object does not have a way to say that they do not want any object
replacement.
- Callers who do not want it to die but want to handle the errors
themselves are told to arrange to call read_object(), but the function
does not use the replacement mechanism, and also it is a file scope
static function that not many callers can call to begin with.
This adds a read_sha1_file_extended() that takes a set of flags; the
callers of read_sha1_file() passes a flag READ_SHA1_FILE_REPLACE to ask
for object replacement mechanism to kick in.
Later, we could add another flag bit to tell the function to return an
error instead of dying and then remove the misguided "call read_object()
yourself".
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Most callers want to silently get a replacement object, and they do not
care what the real name of the replacement object is. Worse yet, no sane
interface to return the underlying object without replacement is provided.
Remove the function and make only the few callers that want the name of
the replacement object find it themselves.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When adding a new content to the repository, we have always slurped
the blob in its entirety in-core first, and computed the object name
and compressed it into a loose object file. Handling large binary
files (e.g. video and audio asset for games) has been problematic
because of this design.
At the middle level of "git add" callchain is an internal API
index_fd() that takes an open file descriptor to read from the
working tree file being added with its size. Teach it to call out to
fast-import when adding a large blob.
The write-out codepath in entry.c::write_entry() should be taught to
stream, instead of reading everything in core. This should not be so
hard to implement, especially if we limit ourselves only to loose
object files and non-delta representation in packfiles.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Split out the case where we do not know the size of the input (hence we
read everything into a strbuf before doing anything) to index_pipe(), and
the other case where we mmap or read the whole data to index_bulk().
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "format_check" parameter tucked after the existing parameters is too
ugly an afterthought to live in any reasonable API.
Combine it with the other boolean parameter "write_object" into a single
"flags" parameter.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
I found that some doubled words had snuck back into projects from which
I'd already removed them, so now there's a "syntax-check" makefile rule in
gnulib to help prevent recurrence.
Running the command below spotted a few in git, too:
git ls-files | xargs perl -0777 -n \
-e 'while (/\b(then?|[iao]n|i[fst]|but|f?or|at|and|[dt])\s+\1\b/gims)' \
-e '{$n=($` =~ tr/\n/\n/ + 1); ($v=$&)=~s/\n/\\n/g;' \
-e 'print "$ARGV:$n:$v\n"}'
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The git-new-workdir script in contrib/ makes a new work tree by sharing
many subdirectories of the .git directory with the original repository.
When rerere.enabled is set in the original repository, but the user has
not encountered any conflicts yet, the original repository may not yet
have .git/rr-cache directory.
When rerere wants to run in a new work tree created from such a young
original repository, it fails to mkdir(2) .git/rr-cache that is a symlink
to a yet-to-be-created directory.
There are three possible approaches to this:
- A naive solution is not to create a symlink in the git-new-workdir
script to a directory the original does not have (yet). This is not a
solution, as we tend to lazily create subdirectories of .git/, and
having rerere.enabled configuration set is a strong indication that the
user _wants_ to have this lazy creation to happen;
- We could always create .git/rr-cache upon repository creation. This is
tempting but will not help people with existing repositories.
- Detect this case by seeing that mkdir(2) failed with EEXIST, checking
that the path is a symlink, and try running mkdir(2) on the link
target.
This patch solves the issue by doing the third one.
Strictly speaking, this is incomplete. It does not attempt to handle
relative symbolic link that points into the original repository, but this
is good enough to help people who use contrib/workdir/git-new-workdir
script.
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In the spirit of v1.5.0.2~21 (Check for PRIuMAX rather than
NO_C99_FORMAT in fast-import.c, 2007-02-20), use PRIuMAX from
git-compat-util.h on all platforms instead of C99-specific formats
like %zu with dangerous fallbacks to %u or %lu.
So now C99-challenged platforms can build git without provoking
warnings or errors from printf, even if pointers do not have the same
size as an int or long.
The need for a fallback PRIuMAX is detected in git-compat-util.h with
"#ifndef PRIuMAX". So while at it, simplify the Makefile and configure
script by eliminating the NO_C99_FORMAT knob altogether.
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* sp/maint-fd-limit:
sha1_file.c: Don't retain open fds on small packs
mingw: add minimum getrlimit() compatibility stub
Limit file descriptors used by packs
If a pack file is small enough that its entire contents fits within
one mmap window, mmap the file and then immediately close its file
descriptor. This reduces the number of file descriptors that are
needed to read from repositories with many tiny pack files, such
as one that has received 1000 pushes (and created 1000 small pack
files) since its last repack.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Rather than using 'errno == EMFILE' after a failed open() call
to indicate the process is out of file descriptors and an LRU
pack window should be closed, place a hard upper limit on the
number of open packs based on the actual rlimit of the process.
By using a hard upper limit that is below the rlimit of the current
process it is not necessary to check for EMFILE on every single
fd-allocating system call. Instead reserving 25 file descriptors
makes it safe to assume the system call won't fail due to being over
the filedescriptor limit. Here 25 is chosen as a WAG, but considers
3 for stdin/stdout/stderr, and at least a few for other Git code
to operate on temporary files. An additional 20 is reserved as it
is not known what the C library needs to perform other services on
Git's behalf, such as nsswitch or name resolution.
This fixes a case where running `git gc --auto` in a repository
with more than 1024 packs (but an rlimit of 1024 open fds) fails
due to the temporary output file not being able to allocate a
file descriptor. The output file is opened by pack-objects after
object enumeration and delta compression are done, both of which
have already opened all of the packs and fully populated the file
descriptor table.
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Functions such as hashcmp that expect a binary SHA-1 value take
parameters of type "unsigned char *" to avoid accepting a textual
SHA-1 passed by mistake. Unfortunately, this means passing the string
literal EMPTY_TREE_SHA1_BIN requires an ugly cast. Tweak the
definition of EMPTY_TREE_SHA1_BIN to produce a value of more
convenient type.
In the future the definition might change to
extern const unsigned char empty_tree_sha1_bin[20];
#define EMPTY_TREE_SHA1_BIN empty_tree_sha1_bin
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>