git/object.h
Derrick Stolee 8d049e182e revision: --show-pulls adds helpful merges
The default file history simplification of "git log -- <path>" or
"git rev-list -- <path>" focuses on providing the smallest set of
commits that first contributed a change. The revision walk greatly
restricts the set of walked commits by visiting only the first
TREESAME parent of a merge commit, when one exists. This means
that portions of the commit-graph are not walked, which can be a
performance benefit, but can also "hide" commits that added changes
but were ignored by a merge resolution.

The --full-history option modifies this by walking all commits and
reporting a merge commit as "interesting" if it has _any_ parent
that is not TREESAME. This tends to be an over-representation of
important commits, especially in an environment where most merge
commits are created by pull request completion.

Suppose we have a commit A and we create a commit B on top that
changes our file. When we merge the pull request, we create a merge
commit M. If no one else changed the file in the first-parent
history between M and A, then M will not be TREESAME to its first
parent, but will be TREESAME to B. Thus, the simplified history
will be "B". However, M will appear in the --full-history mode.

However, suppose that a number of topics T1, T2, ..., Tn were
created based on commits C1, C2, ..., Cn between A and M as
follows:

  A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn
   \     \     \            \  /      /    /            /
    \     \__.. \            \/ ..__T1    /           Tn
     \           \__..       /\     ..__T2           /
      \_____________________B  \____________________/

If the commits T1, T2, ... Tn did not change the file, then all of
P1 through Pn will be TREESAME to their first parent, but not
TREESAME to their second. This means that all of those merge commits
appear in the --full-history view, with edges that immediately
collapse into the lower history without introducing interesting
single-parent commits.

The --simplify-merges option was introduced to remove these extra
merge commits. By noticing that the rewritten parents are reachable
from their first parents, those edges can be simplified away. Finally,
the commits now look like single-parent commits that are TREESAME to
their "only" parent. Thus, they are removed and this issue does not
cause issues anymore. However, this also ends up removing the commit
M from the history view! Even worse, the --simplify-merges option
requires walking the entire history before returning a single result.

Many Git users are using Git alongside a Git service that provides
code storage alongside a code review tool commonly called "Pull
Requests" or "Merge Requests" against a target branch.  When these
requests are accepted and merged, they typically create a merge
commit whose first parent is the previous branch tip and the second
parent is the tip of the topic branch used for the request. This
presents a valuable order to the parents, but also makes that merge
commit slightly special. Users may want to see not only which
commits changed a file, but which pull requests merged those commits
into their branch. In the previous example, this would mean the
users want to see the merge commit "M" in addition to the single-
parent commit "C".

Users are even more likely to want these merge commits when they
use pull requests to merge into a feature branch before merging that
feature branch into their trunk.

In some sense, users are asking for the "first" merge commit to
bring in the change to their branch. As long as the parent order is
consistent, this can be handled with the following rule:

  Include a merge commit if it is not TREESAME to its first
  parent, but is TREESAME to a later parent.

These merges look like the merge commits that would result from
running "git pull <topic>" on a main branch. Thus, the option to
show these commits is called "--show-pulls". This has the added
benefit of showing the commits created by closing a pull request or
merge request on any of the Git hosting and code review platforms.

To test these options, extend the standard test example to include
a merge commit that is not TREESAME to its first parent. It is
surprising that that option was not already in the example, as it
is instructive.

In particular, this extension demonstrates a common issue with file
history simplification. When a user resolves a merge conflict using
"-Xours" or otherwise ignoring one side of the conflict, they create
a TREESAME edge that probably should not be TREESAME. This leads
users to become frustrated and complain that "my change disappeared!"
In my experience, showing them history with --full-history and
--simplify-merges quickly reveals the problematic merge. As mentioned,
this option is expensive to compute. The --show-pulls option
_might_ show the merge commit (usually titled "resolving conflicts")
more quickly. Of course, this depends on the user having the correct
parent order, which is backwards when using "git pull master" from a
topic branch.

There are some special considerations when combining the --show-pulls
option with --simplify-merges. This requires adding a new PULL_MERGE
object flag to store the information from the initial TREESAME
comparisons. This helps avoid dropping those commits in later filters.
This is covered by a test, including how the parents can be simplified.
Since "struct object" has already ruined its 32-bit alignment by using
33 bits across parsed, type, and flags member, let's not make it worse.
PULL_MERGE is used in revision.c with the same value (1u<<15) as
REACHABLE in commit-graph.c. The REACHABLE flag is only used when
writing a commit-graph file, and a revision walk using --show-pulls
does not happen in the same process. Care must be taken in the future
to ensure this remains the case.

Update Documentation/rev-list-options.txt with significant details
around this option. This requires updating the example in the
History Simplification section to demonstrate some of the problems
with TREESAME second parents.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-04-10 09:58:55 -07:00

198 lines
6.4 KiB
C

#ifndef OBJECT_H
#define OBJECT_H
#include "cache.h"
struct buffer_slab;
struct parsed_object_pool {
struct object **obj_hash;
int nr_objs, obj_hash_size;
/* TODO: migrate alloc_states to mem-pool? */
struct alloc_state *blob_state;
struct alloc_state *tree_state;
struct alloc_state *commit_state;
struct alloc_state *tag_state;
struct alloc_state *object_state;
unsigned commit_count;
/* parent substitutions from .git/info/grafts and .git/shallow */
struct commit_graft **grafts;
int grafts_alloc, grafts_nr;
int is_shallow;
struct stat_validity *shallow_stat;
char *alternate_shallow_file;
int commit_graft_prepared;
struct buffer_slab *buffer_slab;
};
struct parsed_object_pool *parsed_object_pool_new(void);
void parsed_object_pool_clear(struct parsed_object_pool *o);
struct object_list {
struct object *item;
struct object_list *next;
};
struct object_array {
unsigned int nr;
unsigned int alloc;
struct object_array_entry {
struct object *item;
/*
* name or NULL. If non-NULL, the memory pointed to
* is owned by this object *except* if it points at
* object_array_slopbuf, which is a static copy of the
* empty string.
*/
char *name;
char *path;
unsigned mode;
} *objects;
};
#define OBJECT_ARRAY_INIT { 0, 0, NULL }
/*
* object flag allocation:
* revision.h: 0---------10 15 25----28
* fetch-pack.c: 01
* negotiator/default.c: 2--5
* walker.c: 0-2
* upload-pack.c: 4 11-----14 16-----19
* builtin/blame.c: 12-13
* bisect.c: 16
* bundle.c: 16
* http-push.c: 16-----19
* commit-graph.c: 15
* commit-reach.c: 16-----19
* sha1-name.c: 20
* list-objects-filter.c: 21
* builtin/fsck.c: 0--3
* builtin/index-pack.c: 2021
* builtin/pack-objects.c: 20
* builtin/reflog.c: 10--12
* builtin/show-branch.c: 0-------------------------------------------26
* builtin/unpack-objects.c: 2021
*/
#define FLAG_BITS 29
/*
* The object type is stored in 3 bits.
*/
struct object {
unsigned parsed : 1;
unsigned type : TYPE_BITS;
unsigned flags : FLAG_BITS;
struct object_id oid;
};
const char *type_name(unsigned int type);
int type_from_string_gently(const char *str, ssize_t, int gentle);
#define type_from_string(str) type_from_string_gently(str, -1, 0)
/*
* Return the current number of buckets in the object hashmap.
*/
unsigned int get_max_object_index(void);
/*
* Return the object from the specified bucket in the object hashmap.
*/
struct object *get_indexed_object(unsigned int);
/*
* This can be used to see if we have heard of the object before, but
* it can return "yes we have, and here is a half-initialised object"
* for an object that we haven't loaded/parsed yet.
*
* When parsing a commit to create an in-core commit object, its
* parents list holds commit objects that represent its parents, but
* they are expected to be lazily initialized and do not know what
* their trees or parents are yet. When this function returns such a
* half-initialised objects, the caller is expected to initialize them
* by calling parse_object() on them.
*/
struct object *lookup_object(struct repository *r, const struct object_id *oid);
void *create_object(struct repository *r, const struct object_id *oid, void *obj);
void *object_as_type(struct repository *r, struct object *obj, enum object_type type, int quiet);
/*
* Returns the object, having parsed it to find out what it is.
*
* Returns NULL if the object is missing or corrupt.
*/
struct object *parse_object(struct repository *r, const struct object_id *oid);
/*
* Like parse_object, but will die() instead of returning NULL. If the
* "name" parameter is not NULL, it is included in the error message
* (otherwise, the hex object ID is given).
*/
struct object *parse_object_or_die(const struct object_id *oid, const char *name);
/* Given the result of read_sha1_file(), returns the object after
* parsing it. eaten_p indicates if the object has a borrowed copy
* of buffer and the caller should not free() it.
*/
struct object *parse_object_buffer(struct repository *r, const struct object_id *oid, enum object_type type, unsigned long size, void *buffer, int *eaten_p);
/** Returns the object, with potentially excess memory allocated. **/
struct object *lookup_unknown_object(const struct object_id *oid);
struct object_list *object_list_insert(struct object *item,
struct object_list **list_p);
int object_list_contains(struct object_list *list, struct object *obj);
void object_list_free(struct object_list **list);
/* Object array handling .. */
void add_object_array(struct object *obj, const char *name, struct object_array *array);
void add_object_array_with_path(struct object *obj, const char *name, struct object_array *array, unsigned mode, const char *path);
/*
* Returns NULL if the array is empty. Otherwise, returns the last object
* after removing its entry from the array. Other resources associated
* with that object are left in an unspecified state and should not be
* examined.
*/
struct object *object_array_pop(struct object_array *array);
typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
/*
* Apply want to each entry in array, retaining only the entries for
* which the function returns true. Preserve the order of the entries
* that are retained.
*/
void object_array_filter(struct object_array *array,
object_array_each_func_t want, void *cb_data);
/*
* Remove from array all but the first entry with a given name.
* Warning: this function uses an O(N^2) algorithm.
*/
void object_array_remove_duplicates(struct object_array *array);
/*
* Remove any objects from the array, freeing all used memory; afterwards
* the array is ready to store more objects with add_object_array().
*/
void object_array_clear(struct object_array *array);
void clear_object_flags(unsigned flags);
/*
* Clear the specified object flags from all in-core commit objects.
*/
void clear_commit_marks_all(unsigned int flags);
#endif /* OBJECT_H */