git/revision.h
Derrick Stolee b45424181e revision.c: generation-based topo-order algorithm
The current --topo-order algorithm requires walking all
reachable commits up front, topo-sorting them, all before
outputting the first value. This patch introduces a new
algorithm which uses stored generation numbers to
incrementally walk in topo-order, outputting commits as
we go. This can dramatically reduce the computation time
to write a fixed number of commits, such as when limiting
with "-n <N>" or filling the first page of a pager.

When running a command like 'git rev-list --topo-order HEAD',
Git performed the following steps:

1. Run limit_list(), which parses all reachable commits,
   adds them to a linked list, and distributes UNINTERESTING
   flags. If all unprocessed commits are UNINTERESTING, then
   it may terminate without walking all reachable commits.
   This does not occur if we do not specify UNINTERESTING
   commits.

2. Run sort_in_topological_order(), which is an implementation
   of Kahn's algorithm. It first iterates through the entire
   set of important commits and computes the in-degree of each
   (plus one, as we use 'zero' as a special value here). Then,
   we walk the commits in priority order, adding them to the
   priority queue if and only if their in-degree is one. As
   we remove commits from this priority queue, we decrement the
   in-degree of their parents.

3. While we are peeling commits for output, get_revision_1()
   uses pop_commit on the full list of commits computed by
   sort_in_topological_order().

In the new algorithm, these three steps correspond to three
different commit walks. We run these walks simultaneously,
and advance each only as far as necessary to satisfy the
requirements of the 'higher order' walk. We know when we can
pause each walk by using generation numbers from the commit-
graph feature.

Recall that the generation number of a commit satisfies:

* If the commit has at least one parent, then the generation
  number is one more than the maximum generation number among
  its parents.

* If the commit has no parent, then the generation number is one.

There are two special generation numbers:

* GENERATION_NUMBER_INFINITY: this value is 0xffffffff and
  indicates that the commit is not stored in the commit-graph and
  the generation number was not previously calculated.

* GENERATION_NUMBER_ZERO: this value (0) is a special indicator
  to say that the commit-graph was generated by a version of Git
  that does not compute generation numbers (such as v2.18.0).

Since we use generation_numbers_enabled() before using the new
algorithm, we do not need to worry about GENERATION_NUMBER_ZERO.
However, the existence of GENERATION_NUMBER_INFINITY implies the
following weaker statement than the usual we expect from
generation numbers:

    If A and B are commits with generation numbers gen(A) and
    gen(B) and gen(A) < gen(B), then A cannot reach B.

Thus, we will walk in each of our stages until the "maximum
unexpanded generation number" is strictly lower than the
generation number of a commit we are about to use.

The walks are as follows:

1. EXPLORE: using the explore_queue priority queue (ordered by
   maximizing the generation number), parse each reachable
   commit until all commits in the queue have generation
   number strictly lower than needed. During this walk, update
   the UNINTERESTING flags as necessary.

2. INDEGREE: using the indegree_queue priority queue (ordered
   by maximizing the generation number), add one to the in-
   degree of each parent for each commit that is walked. Since
   we walk in order of decreasing generation number, we know
   that discovering an in-degree value of 0 means the value for
   that commit was not initialized, so should be initialized to
   two. (Recall that in-degree value "1" is what we use to say a
   commit is ready for output.) As we iterate the parents of a
   commit during this walk, ensure the EXPLORE walk has walked
   beyond their generation numbers.

3. TOPO: using the topo_queue priority queue (ordered based on
   the sort_order given, which could be commit-date, author-
   date, or typical topo-order which treats the queue as a LIFO
   stack), remove a commit from the queue and decrement the
   in-degree of each parent. If a parent has an in-degree of
   one, then we add it to the topo_queue. Before we decrement
   the in-degree, however, ensure the INDEGREE walk has walked
   beyond that generation number.

The implementations of these walks are in the following methods:

* explore_walk_step and explore_to_depth
* indegree_walk_step and compute_indegrees_to_depth
* next_topo_commit and expand_topo_walk

These methods have some patterns that may seem strange at first,
but they are probably carry-overs from their equivalents in
limit_list and sort_in_topological_order.

One thing that is missing from this implementation is a proper
way to stop walking when the entire queue is UNINTERESTING, so
this implementation is not enabled by comparisions, such as in
'git rev-list --topo-order A..B'. This can be updated in the
future.

In my local testing, I used the following Git commands on the
Linux repository in three modes: HEAD~1 with no commit-graph,
HEAD~1 with a commit-graph, and HEAD with a commit-graph. This
allows comparing the benefits we get from parsing commits from
the commit-graph and then again the benefits we get by
restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
  HEAD, w/ commit-graph: 0.02 s

Test: git rev-list --topo-order -100 HEAD -- tools
HEAD~1, no commit-graph: 9.63 s
HEAD~1, w/ commit-graph: 6.06 s
  HEAD, w/ commit-graph: 0.06 s

This speedup is due to a few things. First, the new generation-
number-enabled algorithm walks commits on order of the number of
results output (subject to some branching structure expectations).
Since we limit to 100 results, we are running a query similar to
filling a single page of results. Second, when specifying a path,
we must parse the root tree object for each commit we walk. The
previous benefits from the commit-graph are entirely from reading
the commit-graph instead of parsing commits. Since we need to
parse trees for the same number of commits as before, we slow
down significantly from the non-path-based query.

For the test above, I specifically selected a path that is changed
frequently, including by merge commits. A less-frequently-changed
path (such as 'README') has similar end-to-end time since we need
to walk the same number of commits (before determining we do not
have 100 hits). However, get the benefit that the output is
presented to the user as it is discovered, much the same as a
normal 'git log' command (no '--topo-order'). This is an improved
user experience, even if the command has the same runtime.

Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2018-11-02 12:14:22 +09:00

347 lines
8.5 KiB
C

#ifndef REVISION_H
#define REVISION_H
#include "commit.h"
#include "parse-options.h"
#include "grep.h"
#include "notes.h"
#include "pretty.h"
#include "diff.h"
#include "commit-slab-decl.h"
/* Remember to update object flag allocation in object.h */
#define SEEN (1u<<0)
#define UNINTERESTING (1u<<1)
#define TREESAME (1u<<2)
#define SHOWN (1u<<3)
#define TMP_MARK (1u<<4) /* for isolated cases; clean after use */
#define BOUNDARY (1u<<5)
#define CHILD_SHOWN (1u<<6)
#define ADDED (1u<<7) /* Parents already parsed and added? */
#define SYMMETRIC_LEFT (1u<<8)
#define PATCHSAME (1u<<9)
#define BOTTOM (1u<<10)
#define USER_GIVEN (1u<<25) /* given directly by the user */
#define TRACK_LINEAR (1u<<26)
#define ALL_REV_FLAGS (((1u<<11)-1) | USER_GIVEN | TRACK_LINEAR)
#define TOPO_WALK_EXPLORED (1u<<27)
#define TOPO_WALK_INDEGREE (1u<<28)
#define DECORATE_SHORT_REFS 1
#define DECORATE_FULL_REFS 2
struct rev_info;
struct log_info;
struct string_list;
struct saved_parents;
define_shared_commit_slab(revision_sources, char *);
struct rev_cmdline_info {
unsigned int nr;
unsigned int alloc;
struct rev_cmdline_entry {
struct object *item;
const char *name;
enum {
REV_CMD_REF,
REV_CMD_PARENTS_ONLY,
REV_CMD_LEFT,
REV_CMD_RIGHT,
REV_CMD_MERGE_BASE,
REV_CMD_REV
} whence;
unsigned flags;
} *rev;
};
#define REVISION_WALK_WALK 0
#define REVISION_WALK_NO_WALK_SORTED 1
#define REVISION_WALK_NO_WALK_UNSORTED 2
struct topo_walk_info;
struct rev_info {
/* Starting list */
struct commit_list *commits;
struct object_array pending;
/* Parents of shown commits */
struct object_array boundary_commits;
/* The end-points specified by the end user */
struct rev_cmdline_info cmdline;
/* excluding from --branches, --refs, etc. expansion */
struct string_list *ref_excludes;
/* Basic information */
const char *prefix;
const char *def;
struct pathspec prune_data;
/*
* Whether the arguments parsed by setup_revisions() included any
* "input" revisions that might still have yielded an empty pending
* list (e.g., patterns like "--all" or "--glob").
*/
int rev_input_given;
/*
* Whether we read from stdin due to the --stdin option.
*/
int read_from_stdin;
/* topo-sort */
enum rev_sort_order sort_order;
unsigned int early_output;
unsigned int ignore_missing:1,
ignore_missing_links:1;
/* Traversal flags */
unsigned int dense:1,
prune:1,
no_walk:2,
remove_empty_trees:1,
simplify_history:1,
topo_order:1,
simplify_merges:1,
simplify_by_decoration:1,
single_worktree:1,
tag_objects:1,
tree_objects:1,
blob_objects:1,
verify_objects:1,
edge_hint:1,
edge_hint_aggressive:1,
limited:1,
unpacked:1,
boundary:2,
count:1,
left_right:1,
left_only:1,
right_only:1,
rewrite_parents:1,
print_parents:1,
show_decorations:1,
reverse:1,
reverse_output_stage:1,
cherry_pick:1,
cherry_mark:1,
bisect:1,
ancestry_path:1,
first_parent_only:1,
line_level_traverse:1,
tree_blobs_in_commit_order:1,
/* for internal use only */
exclude_promisor_objects:1;
/* Diff flags */
unsigned int diff:1,
full_diff:1,
show_root_diff:1,
no_commit_id:1,
verbose_header:1,
ignore_merges:1,
combine_merges:1,
dense_combined_merges:1,
always_show_header:1;
/* Format info */
unsigned int shown_one:1,
shown_dashes:1,
show_merge:1,
show_notes:1,
show_notes_given:1,
show_signature:1,
pretty_given:1,
abbrev_commit:1,
abbrev_commit_given:1,
zero_commit:1,
use_terminator:1,
missing_newline:1,
date_mode_explicit:1,
preserve_subject:1;
unsigned int disable_stdin:1;
/* --show-linear-break */
unsigned int track_linear:1,
track_first_time:1,
linear:1;
struct date_mode date_mode;
int expand_tabs_in_log; /* unset if negative */
int expand_tabs_in_log_default;
unsigned int abbrev;
enum cmit_fmt commit_format;
struct log_info *loginfo;
int nr, total;
const char *mime_boundary;
const char *patch_suffix;
int numbered_files;
int reroll_count;
char *message_id;
struct ident_split from_ident;
struct string_list *ref_message_ids;
int add_signoff;
const char *extra_headers;
const char *log_reencode;
const char *subject_prefix;
int no_inline;
int show_log_size;
struct string_list *mailmap;
/* Filter by commit log message */
struct grep_opt grep_filter;
/* Negate the match of grep_filter */
int invert_grep;
/* Display history graph */
struct git_graph *graph;
/* special limits */
int skip_count;
int max_count;
timestamp_t max_age;
timestamp_t min_age;
int min_parents;
int max_parents;
int (*include_check)(struct commit *, void *);
void *include_check_data;
/* diff info for patches and for paths limiting */
struct diff_options diffopt;
struct diff_options pruning;
struct reflog_walk_info *reflog_info;
struct decoration children;
struct decoration merge_simplification;
struct decoration treesame;
/* notes-specific options: which refs to show */
struct display_notes_opt notes_opt;
/* interdiff */
const struct object_id *idiff_oid1;
const struct object_id *idiff_oid2;
const char *idiff_title;
/* range-diff */
const char *rdiff1;
const char *rdiff2;
int creation_factor;
const char *rdiff_title;
/* commit counts */
int count_left;
int count_right;
int count_same;
/* line level range that we are chasing */
struct decoration line_log_data;
/* copies of the parent lists, for --full-diff display */
struct saved_parents *saved_parents_slab;
struct commit_list *previous_parents;
const char *break_bar;
struct revision_sources *sources;
struct topo_walk_info *topo_walk_info;
};
int ref_excluded(struct string_list *, const char *path);
void clear_ref_exclusion(struct string_list **);
void add_ref_exclusion(struct string_list **, const char *exclude);
#define REV_TREE_SAME 0
#define REV_TREE_NEW 1 /* Only new files */
#define REV_TREE_OLD 2 /* Only files removed */
#define REV_TREE_DIFFERENT 3 /* Mixed changes */
/* revision.c */
typedef void (*show_early_output_fn_t)(struct rev_info *, struct commit_list *);
extern volatile show_early_output_fn_t show_early_output;
struct setup_revision_opt {
const char *def;
void (*tweak)(struct rev_info *, struct setup_revision_opt *);
const char *submodule;
int assume_dashdash;
unsigned revarg_opt;
};
void init_revisions(struct rev_info *revs, const char *prefix);
int setup_revisions(int argc, const char **argv, struct rev_info *revs,
struct setup_revision_opt *);
void parse_revision_opt(struct rev_info *revs, struct parse_opt_ctx_t *ctx,
const struct option *options,
const char * const usagestr[]);
#define REVARG_CANNOT_BE_FILENAME 01
#define REVARG_COMMITTISH 02
int handle_revision_arg(const char *arg, struct rev_info *revs,
int flags, unsigned revarg_opt);
void reset_revision_walk(void);
int prepare_revision_walk(struct rev_info *revs);
struct commit *get_revision(struct rev_info *revs);
char *get_revision_mark(const struct rev_info *revs,
const struct commit *commit);
void put_revision_mark(const struct rev_info *revs,
const struct commit *commit);
void mark_parents_uninteresting(struct commit *commit);
void mark_tree_uninteresting(struct tree *tree);
void show_object_with_name(FILE *, struct object *, const char *);
void add_pending_object(struct rev_info *revs,
struct object *obj, const char *name);
void add_pending_oid(struct rev_info *revs,
const char *name, const struct object_id *oid,
unsigned int flags);
void add_head_to_pending(struct rev_info *);
void add_reflogs_to_pending(struct rev_info *, unsigned int flags);
void add_index_objects_to_pending(struct rev_info *, unsigned int flags);
enum commit_action {
commit_ignore,
commit_show,
commit_error
};
enum commit_action get_commit_action(struct rev_info *revs,
struct commit *commit);
enum commit_action simplify_commit(struct rev_info *revs,
struct commit *commit);
enum rewrite_result {
rewrite_one_ok,
rewrite_one_noparents,
rewrite_one_error
};
typedef enum rewrite_result (*rewrite_parent_fn_t)(struct rev_info *revs, struct commit **pp);
int rewrite_parents(struct rev_info *revs,
struct commit *commit,
rewrite_parent_fn_t rewrite_parent);
/*
* The log machinery saves the original parent list so that
* get_saved_parents() can later tell what the real parents of the
* commits are, when commit->parents has been modified by history
* simpification.
*
* get_saved_parents() will transparently return commit->parents if
* history simplification is off.
*/
struct commit_list *get_saved_parents(struct rev_info *revs, const struct commit *commit);
#endif