Commit graph

22 commits

Author SHA1 Message Date
Nguyễn Thái Ngọc Duy fbd4a7036d list-objects: mark more commits as edges in mark_edges_uninteresting
The purpose of edge commits is to let pack-objects know what objects
it can use as base, but does not need to include in the thin pack
because the other side is supposed to already have them. So far we
mark uninteresting parents of interesting commits as edges. But even
an unrelated uninteresting commit (that the other side has) may
become a good base for pack-objects and help produce more efficient
packs.

This is especially true for shallow clone, when the client issues a
fetch with a depth smaller or equal to the number of commits the
server is ahead of the client. For example, in this commit history
the client has up to "A" and the server has up to "B":

    -------A---B
     have--^   ^
              /
       want--+

If depth 1 is requested, the commit list to send to the client
includes only B. The way m_e_u is working, it checks if parent
commits of B are uninteresting, if so mark them as edges.  Due to
shallow effect, commit B is grafted to have no parents and the
revision walker never sees A as the parent of B. In fact it marks no
edges at all in this simple case and sends everything B has to the
client even if it could have excluded what A and also the client
already have.

In a slightly different case where A is not a direct parent of B
(iow there are commits in between A and B), marking A as an edge can
still save some because B may still have stuff from the far ancestor
A.

There is another case from the earlier patch, when we deepen a ref
from C->E to A->E:

    ---A---B   C---D---E
     want--^   ^       ^
       shallow-+      /
          have-------+

In this case we need to send A and B to the client, and C (i.e. the
current shallow point that the client informs the server) is a very
good base because it's closet to A and B. Normal m_e_u won't recognize
C as an edge because it only looks back to parents (i.e. A<-B) not the
opposite way B->C even if C is already marked as uninteresting commit
by the previous patch.

This patch includes all uninteresting commits from command line as
edges and lets pack-objects decide what's best to do. The upside is we
have better chance of producing better packs in certain cases. The
downside is we may need to process some extra objects on the server
side.

For the shallow case on git.git, when the client is 5 commits behind
and does "fetch --depth=3", the result pack is 99.26 KiB instead of
4.92 MiB.

Reported-and-analyzed-by: Matthijs Kooijman <matthijs@stdin.nl>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-28 11:54:18 -07:00
Nguyễn Thái Ngọc Duy e76a5fb459 list-objects: reduce one argument in mark_edges_uninteresting
mark_edges_uninteresting() is always called with this form

  mark_edges_uninteresting(revs->commits, revs, ...);

Remove the first argument and let mark_edges_uninteresting figure that
out by itself. It helps answer the question "are this commit list and
revs related in any way?" when looking at mark_edges_uninteresting
implementation.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2013-08-28 11:54:18 -07:00
Nguyễn Thái Ngọc Duy d688cf07b1 tree_entry_interesting(): give meaningful names to return values
It is a basic code hygiene to avoid magic constants that are unnamed.
Besides, this helps extending the value later on for "interesting, but
cannot decide if the entry truely matches yet" (ie. prefix matches)

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-10-27 11:38:24 -07:00
Junio C Hamano 4947367267 list-objects: pass callback data to show_objects()
The traverse_commit_list() API takes two callback functions, one to show
commit objects, and the other to show other kinds of objects. Even though
the former has a callback data parameter, so that the callback does not
have to rely on global state, the latter does not.

Give the show_objects() callback the same callback data parameter.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-09-01 15:46:12 -07:00
Junio C Hamano 1273738f05 Merge branch 'nd/struct-pathspec'
* nd/struct-pathspec:
  pathspec: rename per-item field has_wildcard to use_wildcard
  Improve tree_entry_interesting() handling code
  Convert read_tree{,_recursive} to support struct pathspec
  Reimplement read_tree_recursive() using tree_entry_interesting()
2011-05-06 10:50:06 -07:00
Nguyễn Thái Ngọc Duy 97d0b74a49 Improve tree_entry_interesting() handling code
t_e_i() can return -1 or 2 to early shortcut a search. Current code
may use up to two variables to handle it. One for saving return value
from t_e_i temporarily, one for saving return code 2.

The second variable is not needed. If we make sure the first variable
does not change until the next t_e_i() call, then we can do something
like this:

int ret = 0;

while (...) {
	if (ret != 2) {
		ret = t_e_i();
		if (ret < 0) /* no longer interesting */
			break;
		if (ret == 0) /* skip this round */
			continue;
	}
	/* ret > 0, interesting */
}

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-03-25 09:20:33 -07:00
Junio C Hamano 4b28cd9f2f Merge branch 'jc/maint-rev-list-culled-boundary'
* jc/maint-rev-list-culled-boundary:
  list-objects.c: don't add an unparsed NULL as a pending tree

Conflicts:
	list-objects.c
2011-03-22 21:37:59 -07:00
Junio C Hamano 6e7d0efa90 list-objects.c: don't add an unparsed NULL as a pending tree
"git rev-list --first-parent --boundary $commit^..$commit" segfaults on a
merge commit since 8d2dfc4 (process_{tree,blob}: show objects without
buffering, 2009-04-10), as it tried to dereference a commit that was
discarded as UNINTERESTING without being parsed (hence lacking "tree").

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-03-14 12:33:58 -07:00
Elijah Newren cc5fa2fdaf Make rev-list --objects work together with pathspecs
When traversing commits, the selection of commits would heed the list of
pathspecs passed, but subsequent walking of the trees of those commits
would not.  This resulted in 'rev-list --objects HEAD -- <paths>'
displaying objects at unwanted paths.

Have process_tree() call tree_entry_interesting() to determine which paths
are interesting and should be walked.

Naturally, this change can provide a large speedup when paths are specified
together with --objects, since many tree entries are now correctly ignored.
Interestingly, though, this change also gives me a small (~1%) but
repeatable speedup even when no paths are specified with --objects.

Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-02-03 14:10:46 -08:00
Junio C Hamano 9824a388e5 Merge branch 'lt/pack-object-memuse'
* lt/pack-object-memuse:
  show_object(): push path_name() call further down
  process_{tree,blob}: show objects without buffering

Conflicts:
	builtin-pack-objects.c
	builtin-rev-list.c
	list-objects.c
	list-objects.h
	upload-pack.c
2009-04-18 14:46:17 -07:00
Linus Torvalds 8d2dfc49b1 process_{tree,blob}: show objects without buffering
Here's a less trivial thing, and slightly more dubious one.

I was looking at that "struct object_array objects", and wondering why we
do that. I have honestly totally forgotten. Why not just call the "show()"
function as we encounter the objects? Rather than add the objects to the
object_array, and then at the very end going through the array and doing a
'show' on all, just do things more incrementally.

Now, there are possible downsides to this:

 - the "buffer using object_array" _can_ in theory result in at least
   better I-cache usage (two tight loops rather than one more spread out
   one). I don't think this is a real issue, but in theory..

 - this _does_ change the order of the objects printed. Instead of doing a
   "process_tree(revs, commit->tree, &objects, NULL, "");" in the loop
   over the commits (which puts all the root trees _first_ in the object
   list, this patch just adds them to the list of pending objects, and
   then we'll traverse them in that order (and thus show each root tree
   object together with the objects we discover under it)

   I _think_ the new ordering actually makes more sense, but the object
   ordering is actually a subtle thing when it comes to packing
   efficiency, so any change in order is going to have implications for
   packing. Good or bad, I dunno.

 - There may be some reason why we did it that odd way with the object
   array, that I have simply forgotten.

Anyway, now that we don't buffer up the objects before showing them
that may actually result in lower memory usage during that whole
traverse_commit_list() phase.

This is seriously not very deeply tested. It makes sense to me, it seems
to pass all the tests, it looks ok, but...

Does anybody remember why we did that "object_array" thing? It used to be
an "object_list" a long long time ago, but got changed into the array due
to better memory usage patterns (those linked lists of obejcts are
horrible from a memory allocation standpoint). But I wonder why we didn't
do this back then. Maybe there's a reason for it.

Or maybe there _used_ to be a reason, and no longer is.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-12 17:28:31 -07:00
Linus Torvalds cf2ab916af show_object(): push path_name() call further down
In particular, pushing the "path_name()" call _into_ the show() function
would seem to allow

 - more clarity into who "owns" the name (ie now when we free the name in
   the show_object callback, it's because we generated it ourselves by
   calling path_name())

 - not calling path_name() at all, either because we don't care about the
   name in the first place, or because we are actually happy walking the
   linked list of "struct name_path *" and the last component.

Now, I didn't do that latter optimization, because it would require some
more coding, but especially looking at "builtin-pack-objects.c", we really
don't even want the whole pathname, we really would be better off with the
list of path components.

Why? We use that name for two things:
 - add_preferred_base_object(), which actually _wants_ to traverse the
   path, and now does it by looking for '/' characters!
 - for 'name_hash()', which only cares about the last 16 characters of a
   name, so again, generating the full name seems to be just unnecessary
   work.

Anyway, so I didn't look any closer at those things, but it did convince
me that the "show_object()" calling convention was crazy, and we're
actually better off doing _less_ in list-objects.c, and giving people
access to the internal data structures so that they can decide whether
they want to generate a path-name or not.

This patch does that, and then for people who did use the name (even if
they might do something more clever in the future), it just does the
straightforward "name = path_name(path, component); .. free(name);" thing.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-12 17:28:31 -07:00
Junio C Hamano 6e353a5e5d Merge branch 'cc/bisect-filter'
* cc/bisect-filter: (21 commits)
  rev-list: add "int bisect_show_flags" in "struct rev_list_info"
  rev-list: remove last static vars used in "show_commit"
  list-objects: add "void *data" parameter to show functions
  bisect--helper: string output variables together with "&&"
  rev-list: pass "int flags" as last argument of "show_bisect_vars"
  t6030: test bisecting with paths
  bisect: use "bisect--helper" and remove "filter_skipped" function
  bisect: implement "read_bisect_paths" to read paths in "$GIT_DIR/BISECT_NAMES"
  bisect--helper: implement "git bisect--helper"
  bisect: use the new generic "sha1_pos" function to lookup sha1
  rev-list: call new "filter_skip" function
  patch-ids: use the new generic "sha1_pos" function to lookup sha1
  sha1-lookup: add new "sha1_pos" function to efficiently lookup sha1
  rev-list: pass "revs" to "show_bisect_vars"
  rev-list: make "show_bisect_vars" non static
  rev-list: move code to show bisect vars into its own function
  rev-list: move bisect related code into its own file
  rev-list: make "bisect_list" variable local to "cmd_rev_list"
  refs: add "for_each_ref_in" function to refactor "for_each_*_ref" functions
  quote: add "sq_dequote_to_argv" to put unwrapped args in an argv array
  ...
2009-04-12 16:46:40 -07:00
Björn Steinbrink de551d472e process_{tree,blob}: Remove useless xstrdup calls
The name of the processed object was duplicated for passing it to
add_object(), but that already calls path_name, which allocates a new
string anyway. So the memory allocated by the xstrdup calls just went
nowhere, leaking memory.

This reduces the RSS usage for a "rev-list --all --objects" by about 10% on
the gentoo repo (fully packed) as well as linux-2.6.git:

    gentoo:
                    | old           | new
    ----------------|-------------------------------
    RSS             |       1537284 |       1388408
    VSZ             |       1816852 |       1667952
    time elapsed    |       1:49.62 |       1:48.99
    min. page faults|        417178 |        379919

    linux-2.6.git:
                    | old           | new
    ----------------|-------------------------------
    RSS             |        324452 |        292996
    VSZ             |        491792 |        460376
    time elapsed    |       0:14.53 |       0:14.28
    min. page faults|         89360 |         81613

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-08 22:58:43 -07:00
Christian Couder 11c211fa06 list-objects: add "void *data" parameter to show functions
The goal of this patch is to get rid of the "static struct rev_info
revs" static variable in "builtin-rev-list.c".

To do that, we need to pass the revs to the "show_commit" function
in "builtin-rev-list.c" and this in turn means that the
"traverse_commit_list" function in "list-objects.c" must be passed
functions pointers to functions with 2 parameters instead of one.

So we have to change all the callers and all the functions passed
to "traverse_commit_list".

Anyway this makes the code more clean and more generic, so it
should be a good thing in the long run.

Signed-off-by: Christian Couder <chriscool@tuxfamily.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-04-07 22:12:38 -07:00
Martin Koegler a301b0c8f2 list-objects.c::process_tree/blob: check for NULL
As these functions are directly called with the result
from lookup_tree/blob, they must handle NULL.

Signed-off-by: Martin Koegler <mkoegler@auto.tuwien.ac.at>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-02-18 19:25:26 -08:00
Shawn O. Pearce 295dd2ad20 Fix memory leak in traverse_commit_list
If we were listing objects too then the objects were buffered in an
array only reachable from a stack allocated structure.  When this
function returns that array would be leaked as nobody would have
a reference to it anymore.

Historically this hasn't been a problem as the primary user of
traverse_commit_list() (the noble git-rev-list) would terminate
as soon as the function was finished, thus allowing the operating
system to cleanup memory.  However we have been leaking this data
in git-pack-objects ever since that program learned how to run the
revision listing internally, rather than relying on reading object
names from git-rev-list.

To better facilitate reuse of traverse_commit_list during other
builtin tools (such as git-fetch) we shouldn't leak temporary memory
like this and instead we need to clean up properly after ourselves.

Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-10 01:52:54 -08:00
Martin Waitz 302b9282c9 rename dirlink to gitlink.
Unify naming of plumbing dirlink/gitlink concept:

git ls-files -z '*.[ch]' |
xargs -0 perl -pi -e 's/dirlink/gitlink/g;' -e 's/DIRLNK/GITLINK/g;'

Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-21 23:34:54 -07:00
Linus Torvalds 6e2f441bd4 Teach git list-objects logic to not follow gitlinks
This allows us to pack superprojects and thus clone them (but not yet
check them out on the receiving side.. That's the next patch)

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-04-14 03:14:14 -07:00
Linus Torvalds 6fda5e5180 Initialize tree descriptors with a helper function rather than by hand.
This removes slightly more lines than it adds, but the real reason for
doing this is that future optimizations will require more setup of the
tree descriptor, and so we want to do it in one place.

Also renamed the "desc.buf" field to "desc.buffer" just to trigger
compiler errors for old-style manual initializations, making sure I
didn't miss anything.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-21 10:21:57 -07:00
Junio C Hamano 8d1d8f83b5 pack-objects: further work on internal rev-list logic.
This teaches the internal rev-list logic to understand options
that are needed for pack handling: --all, --unpacked, and --thin.

It also moves two functions from builtin-rev-list to list-objects
so that the two programs can share more code.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-09-07 02:46:02 -07:00
Junio C Hamano c64ed70d25 Separate object listing routines out of rev-list
Create a separate file, list-objects.c, and move object listing
routines from rev-list to it.  The next round will use it in
pack-objects directly.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-09-07 02:46:01 -07:00