How does Git create commits so fast?
Asked Answered
G

3

3

From what I understand, each commit in Git is a "snapshot" of the entire repository, which means that, at the very least, every file has to be read. My repository is 9.2 GB and a commit takes a fraction of a second. Makes no sense how it happens so fast.

Gluttonize answered 17/8, 2016 at 22:56 Comment(0)
V
9

at the very least, every file has to be read

On the contrary, that's the very most that could happen.

Running git commit to commit your staged changes is generally fast because actually staging the changes did most of the work. Creating a commit simply turns the index (aka the "staging area") into a very lightweight commit object, which contains the metadata about your commit, and a few tree objects, which contain the structure of the repository.

All the data in the files, though, gets added to git's database when you run git add on a particular file. The data about that file is then stored in the staging area so that when you run git commit then all the information about that file is already in the index. So the costliest part is amortized over running git add.

The other subtle thing is that the index contains the information about all the files in your repository - and it maintains information about the working directory like the time stamp that it last examined the file and its file size. So even if you run something like git add . to stage all the changed files, it only needs to stat the file to find out if it's changed, and it can ignore it if it hasn't.

Obviously looking at all the files in your working directory is a little bit expensive, but much less costly than adding a full snapshot of even the unchanged files.

So even though git stores a snapshot of the repository at each commit, it really only needs to store new data for the files that changed, it can store pointers to the old, unchanged file contents for everything else.

Vengeful answered 18/8, 2016 at 0:16 Comment(0)
C
3

Note: if you have a repository with a large number of commits, like the "largest Git repo on the planet", with over 250000 commits, adding new commits can actually be slow.

That is why Git 2.23 (Q3 2019) introduces commit-graph chains.

See commit 5b15eb3, commit 16110c9, commit a09c130, commit e2017c4, commit ba41112, commit 3da4b60, commit c2bc6e6, commit 8d84097, commit c523035, commit 1771be9, commit 135a712, commit 6c622f9, commit 144354b, commit 118bd57, commit 5c84b33, commit 3cbc6ed, commit d4f4d60, commit 890345a (18 Jun 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 92b1ea6, 19 Jul 2019)

commit-graph: document commit-graph chains

The documentation now has:

Commit Graphs Chains

Typically, repos grow with near-constant velocity (commits per day).
Over time, the number of commits added by a fetch operation is much smaller than the number of commits in the full history.

By creating a "chain" of commit-graphs, we enable fast writes of new commit data without rewriting the entire commit history -- at least, most of the time.

File Layout

A commit-graph chain uses multiple files, and we use a fixed naming convention to organize these files.
Each commit-graph file has a name $OBJDIR/info/commit-graphs/graph-{hash}.graph where {hash} is the hex-valued hash stored in the footer of that file (which is a hash of the file's contents before that hash).
For a chain of commit-graph files, a plain-text file at $OBJDIR/info/commit-graphs/commit-graph-chain contains the hashes for the files in order from "lowest" to "highest".

For example, if the commit-graph-chain file contains the lines:

{hash0}
{hash1}
{hash2}

then the commit-graph chain looks like the following diagram:

+-----------------------+
|  graph-{hash2}.graph  |
+-----------------------+
    |
+-----------------------+
|                       |
|  graph-{hash1}.graph  |
|                       |
+-----------------------+
    |
+-----------------------+
|                       |
|                       |
|                       |
|  graph-{hash0}.graph  |
|                       |
|                       |
|                       |
+-----------------------+

  • Let X0 be the number of commits in graph-{hash0}.graph,
  • X1 be the number of commits in graph-{hash1}.graph, and
  • X2 be the number of commits in graph-{hash2}.graph.

If a commit appears in position i in graph-{hash2}.graph, then we interpret this as being the commit in position (X0 + X1 + i), and that will be used as its "graph position".
The commits in graph-{hash2}.graph use these positions to refer to their parents, which may be in graph-{hash1}.graph or graph-{hash0}.graph.
We can navigate to an arbitrary commit in position j by checking its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 + X2).


That means git commit-grah has a new write command option: --split.

commit-graph: add --split option to builtin

Add a new "--split" option to the 'git commit-graph write' subcommand.
This option allows the optional behavior of writing a commit-graph chain.

The current behavior will add a tip commit-graph containing any commits that are not in the existing commit-graph or commit-graph chain.
Later changes will allow merging the chain and expiring out-dated files.

Add a new test script (t5324-split-commit-graph.sh) that demonstrates this behavior.

And the same documentation adds:

With the --split option, write the commit-graph as a chain of multiple commit-graph files stored in <dir>/info/commit-graphs.
The new commits not already in the commit-graph are added in a new "tip" file.
This file is merged with the existing file if the following merge conditions are met:

  • If --size-multiple=<X> is not specified, let X equal 2. If the new tip file would have N commits and the previous tip has M commits and X times N is greater than M, instead merge the two files into a single file.

  • If --max-commits=<M> is specified with M a positive integer, and the new tip file would have more than M commits, then instead merge the new tip with the previous tip.

Finally, if --expire-time=<datetime> is not specified, let datetime be the current time. After writing the split commit-graph, delete all unused commit-graph whose modified times are older than datetime.


That will help with forks:

commit-graph: allow cross-alternate chains

In an environment like a fork network, it is helpful to have a commit-graph chain that spans both the base repo and the fork repo.
The fork is usually a small set of data on top of the large repo, but sometimes the fork is much larger.
For example, git-for-windows/git has almost double the number of commits as git/git because it rebases its commits on every major version update.

The documentation now includes:

Chains across multiple object directories

In a repo with alternates, we look for the commit-graph-chain file starting in the local object directory and then in each alternate.
The first file that exists defines our chain.
As we look for the graph-{hash} files for each {hash} in the chain file, we follow the same pattern for the host directories.

This allows commit-graphs to be split across multiple forks in a fork network.
The typical case is a large "base" repo with many smaller forks.

As the base repo advances, it will likely update and merge its commit-graph chain more frequently than the forks.
If a fork updates their commit-graph after the base repo, then it should "reparent" the commit-graph chain onto the new chain in the base repo.
When reading each graph-{hash} file, we track the object directory containing it. During a write of a new commit-graph file, we check for any changes in the source object directory and read the commit-graph-chain file for that source and create a new file based on those files.
During this "reparent" operation, we necessarily need to collapse all levels in the fork, as all of the files are invalid against the new base file.


That also involves expiring commit-graph files:

commit-graph: expire commit-graph files

As we merge commit-graph files in a commit-graph chain, we should clean up the files that are no longer used.

This change introduces an 'expiry_window' value to the context, which is always zero (for now).
We then check the modified time of each graph-{hash}.graph file in the $OBJDIR/info/commit-graphs folder and unlink the files that are older than the expiry_window.

The documentation now references:

Deleting graph-{hash} files

After a new tip file is written, some graph-{hash} files may no longer be part of a chain. It is important to remove these files from disk, eventually.
The main reason to delay removal is that another process could read the commit-graph-chain file before it is rewritten, but then look for the graph-{hash} files after they are deleted.

To allow holding old split commit-graphs for a while after they are unreferenced, we update the modified times of the files when they become unreferenced.
Then, we scan the $OBJDIR/info/commit-graphs/ directory for graph-{hash} files whose modified times are older than a given expiry window.
This window defaults to zero, but can be changed using command-line arguments or a config setting.


With Git 2.27 (Q2 2020), "git commit-graph write" learned different ways to write out split files.

See commit dbd5e0a (29 Apr 2020) by Junio C Hamano (gitster).
See commit 7a9ce02 (15 Apr 2020), and commit 6830c36, commit f478106, commit 8a6ac28, commit fdbde82, commit 4f02735, commit 2fa05f3 (14 Apr 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 6a1c17d, 01 May 2020)

builtin/commit-graph.c: introduce split strategy 'no-merge'

Signed-off-by: Taylor Blau

In the previous commit, we laid the groundwork for supporting different splitting strategies. In this commit, we introduce the first splitting strategy: 'no-merge'.

Passing '--split=no-merge' is useful for callers which wish to write a new incremental commit-graph, but do not want to spend effort condensing the incremental chain (*1).

Previously, this was possible by passing '--size-multiple=0', but this no longer the case following 63020f175f ("commit-graph: prefer default size_mult when given zero", 2020-01-02, Git v2.25.0-rc2 -- merge).

When '--split=no-merge' is given, the commit-graph machinery will never condense an existing chain, and it will always write a new incremental.

(*1): This might occur when, for example, a server administrator running some program after each push may want to ensure that each job runs proportional in time to the size of the push, and does not "jump" when the commit-graph machinery decides to trigger a merge.


"git fsck --no-progress"(man) still spewed noise from the commit-graph subsystem, which has been corrected with Git 2.42 (Q3 2023).

See commit 9281cd0, commit 7248857, commit f5facaa, commit eb319d6, commit 39bdd30, commit eda206f (07 Jul 2023) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 6016ee0, 18 Jul 2023)

commit-graph.c: avoid duplicated progress output during verify

Signed-off-by: Taylor Blau
Acked-by: Derrick Stolee

When git commit-graph(man) verify was taught how to verify commit-graph chains in 3da4b60 ("commit-graph: verify chains with --shallow mode", 2019-06-18, Git v2.23.0-rc0 -- merge listed in batch #6), it produced one line of progress per layer of the commit-graph chain.

$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (4356/4356), done.
Verifying commits in commit graph: 100% (131912/131912), done.

This could be somewhat confusing to users, who may wonder why there are multiple occurrences of "Verifying commits in commit graph".

There are likely good arguments on whether or not there should be one line of progress output per commit-graph layer.
On the one hand, the existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among multiple layers is an implementation detail that the caller need not be aware of.

Clarify this by showing a single progress meter regardless of the number of layers in the commit-graph chain.
After this patch, the output reflects the logical contents of a commit-graph chain, instead of showing one line of output per commit-graph layer:

$ git.compile commit-graph verify
Verifying commits in commit graph: 100% (136268/136268), done.

With Git 2.43 (Q4 2023), "git commit-graph verify"(man) is more robust against read errors when verifying graph chain.

See commit 5f25919, commit 7754a56, commit 47d06bb, commit 2d45710, commit 8298b54, commit 7ed76b4 (28 Sep 2023) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit c3c0020, 04 Oct 2023)

commit-graph: detect read errors when verifying graph chain

Signed-off-by: Jeff King

Because it's OK to not have a graph file at all, the graph_verify() function needs to tell the difference between a missing file and a real error.
So when loading a traditional graph file, we call open_commit_graph() separately from load_commit_graph_chain_fd_st(), and don't complain if the first one fails with ENOENT.

When the function learned about chain files in 3da4b60 ("commit-graph: verify chains with --shallow mode", 2019-06-18, Git v2.23.0-rc0 -- merge listed in batch #6), we couldn't be as careful, since the only way to load a chain was with read_commit_graph_one(), which did both the open/load as a single unit.
So we'll miss errors in chain files we load, thinking instead that there was just no chain file at all.

Note that we do still report some of these problems to stderr, as the loading function calls error() and warning().
But we'd exit with a successful exit code, which is wrong.

We can fix that by using the recently split open/load functions for chains.
That lets us treat the chain file just like a single file with respect to error handling here.

An existing test (from 3da4b60) shows off the problem; we were expecting "commit-graph verify" to report success, but that makes no sense.
We did not even verify the contents of the graph data, because we couldn't load it! I don't think this was an intentional exception, but rather just the test covering what happened to occur.

Crosslet answered 21/7, 2019 at 1:11 Comment(0)
S
0

As far as I understand it so far... Imagine you have a many commits in the master branch and another branch with also many many commits. So if a VCS does not support the concept of git with hashes and so on and just stores the difference of the files and then you want to branch. Then the other VCS has either to revert all changes unitl the shared commit and apply all changes of the other branch or it has to compare all files one by one. In my opinion the hashing algorithm of git seems to be the better approach even if git has to do much iterating/searching I guess. Idk if I'm right I just started today to read something about git. Feel free to downvote/upvote and comment :D I think it's a topic where only a few people have really in dept knowledge

Strunk answered 6/6, 2017 at 21:46 Comment(1)
Hi ! It would be better if you checkout Answering Questions Format for future endeavor at Stack overflow. -Thank youVoiceful

© 2022 - 2024 — McMap. All rights reserved.