Is this behavior well-defined?
With Git 2.34 (Q4 2021), the revision traversal API has illustrated by two commits:
- one by taking advantage of the commit-graph, when available, to determine if a commit is reachable from any of the existing refs.
- one cancelling/reverting that change
Both illustrates how git-rev-list
works:
See commit f559d6d, commit 809ea28, commit bf9c0cb, commit f45022d (09 Aug 2021), and commit 29ef1f2 (05 Aug 2021) by Patrick Steinhardt (pks-t
).
(Merged by Junio C Hamano -- gitster
-- in commit a5619d4, 03 Sep 2021)
connected
: do not sort input revisions
Signed-off-by: Patrick Steinhardt
In order to compute whether objects reachable from a set of tips are all connected, we do a revision walk with these tips as positive references and --not --all
.
--not --all
will cause the revision walk to load all preexisting references as uninteresting, which can be very expensive in repositories with many references.
Benchmarking the git-rev-list
command highlights that by far the most expensive single phase is initial sorting of the input revisions: after all references have been loaded, we first sort commits by author date.
In a real-world repository with about 2.2 million references, it makes up about 40% of the total runtime of git-rev-list
.
Ultimately, the connectivity check shouldn't really bother about the order of input revisions at all.
We only care whether we can actually walk all objects until we hit the cut-off point.
So sorting the input is a complete waste of time.
Introduce a new "--unsorted-input
" flag to git-rev-list
which will cause it to not sort the commits and adjust the connectivity check to always pass the flag.
This results in the following speedups, executed in a clone of gitlab-org/gitlab
:
Benchmark #1: git rev-list --objects --quiet --not --all --not $(cat newrev)
Time (mean ± σ): 7.639 s ± 0.065 s [User: 7.304 s, System: 0.335 s]
Range (min … max): 7.543 s … 7.742 s 10 runs
Benchmark #2: git rev-list --unsorted-input --objects --quiet --not --all --not $newrev
Time (mean ± σ): 4.995 s ± 0.044 s [User: 4.657 s, System: 0.337 s]
Range (min … max): 4.909 s … 5.048 s 10 runs
Summary
'git rev-list --unsorted-input --objects --quiet --not --all --not $(cat newrev)' ran
1.53 ± 0.02 times faster than 'git rev-list --objects --quiet --not --all --not $newrev'
Note that not all refs are visible to clients.
rev-list-options
now includes in its man page:
--unsorted-input
Show commits in the order they were given on the command line instead
of sorting them in reverse chronological order by commit time. Cannot
be combined with --no-walk
or --no-walk=sorted
.
rev-list-options
now includes in its man page:
Cannot be combined with --graph
. Cannot be combined with
--unsorted-input
if sorted
or no argument was given.
And with the same Git 2.34 (Q4 2021), a regression fix reverts the optimization above:
See commit a7df4f5 (11 Nov 2021) by Junio C Hamano (gitster
).
(Merged by Junio C Hamano -- gitster
-- in commit 8996d68, 12 Nov 2021)
This reverts commit f45022d (connected
: do not sort input revisions, 2021-08-09, Git v2.34.0-rc0 -- merge listed in batch #3), as this is like breakage in the traversal more likely.
In a history with 10 single strand of pearls,
1-->2-->3--...->7-->8-->9-->10
asking "rev-list --unsorted-input 1 10 --not 9 8 7 6 5 4
(man) fails to paint the bottom 1 uninteresting as the traversal stops, without completing the propagation of uninteresting bit starting at 4 down through 3 and 2 to 1.
2018:
Git 2.16 (Q1 2018) will allow git describe
to give an object a human readable name based on an available ref when used as git describe <blob>
.
(See more at "Which commit has this blob?")
In that context, git rev-list adds a new order.
See commit ce5b6f9 by Stefan Beller (stefanbeller
).
revision.h
: introduce blob/tree walking in order of the commits
The functionality to list tree objects in the order they were seen
while traversing the commits will be used in one of the next commits,
where we teach git describe
to describe not only commits, but blobs, too.
That means the git rev-list
man page has a new object traversal order:
--in-commit-order::
Print tree and blob ids in order of the commits.
The tree and blob ids are printed after they are first referenced by a commit.
You have one more parameter which influences git-rev-list: GIT_COMMIT_GRAPH_PARANOIA
.
Before Git 2.44 (Q1 2024), we stopped relying on commit-graph that (still) records information about commits that are lost from the object store, which has negative performance implications.
The default has been flipped to disable this pessimization.
See commit b1df3b3 (24 Nov 2023) by Patrick Steinhardt (pks-t
).
(Merged by Junio C Hamano -- gitster
-- in commit 66685e8, 18 Dec 2023)
commit-graph
: disable GIT_COMMIT_GRAPH_PARANOIA
by default
Reported-by: Jeff King
Signed-off-by: Patrick Steinhardt
In 7a5d604 ("commit
: detect commits that exist in commit-graph but not in the ODB", 2023-10-31, Git v2.43.0-rc1 -- merge), we have introduced a new object existence check into repo_parse_commit_internal()
so that we do not parse commits via the commit-graph that don't have a corresponding object in the object database.
This new check of course comes with a performance penalty, which the commit put at around 30% for git rev-list --topo-order
(man).
Default-disable GIT_COMMIT_GRAPH_PARANOIA
and restore the behaviour and thus performance previous to the mentioned commit.
In order to not be inconsistent, also disable this behaviour by default in lookup_commit_in_graph()
, where the object existence check has been introduced right at its inception via f559d6d ("revision
: avoid hitting packfiles when commits are in commit-graph", 2021-08-09, Git v2.34.0-rc0 -- merge listed in batch #3).
git
now includes in its man page:
GIT_COMMIT_GRAPH_PARANOIA
:
When loading a commit object from the commit-graph, Git performs an
existence check on the object in the object database. This is done to
avoid issues with stale commit-graphs that contain references to
already-deleted commits, but comes with a performance penalty.
The default is "false", which disables the aforementioned behavior.
Setting this to "true" enables the existence check so that stale commits
will never be returned from the commit-graph at the cost of performance.