From what I understand, each commit in Git is a "snapshot" of the entire repository, which means that, at the very least, every file has to be read. My repository is 9.2 GB and a commit takes a fraction of a second. Makes no sense how it happens so fast.
at the very least, every file has to be read
On the contrary, that's the very most that could happen.
Running git commit
to commit your staged changes is generally fast because actually staging the changes did most of the work. Creating a commit simply turns the index (aka the "staging area") into a very lightweight commit object, which contains the metadata about your commit, and a few tree objects, which contain the structure of the repository.
All the data in the files, though, gets added to git's database when you run git add
on a particular file. The data about that file is then stored in the staging area so that when you run git commit
then all the information about that file is already in the index. So the costliest part is amortized over running git add
.
The other subtle thing is that the index contains the information about all the files in your repository - and it maintains information about the working directory like the time stamp that it last examined the file and its file size. So even if you run something like git add .
to stage all the changed files, it only needs to stat
the file to find out if it's changed, and it can ignore it if it hasn't.
Obviously looking at all the files in your working directory is a little bit expensive, but much less costly than adding a full snapshot of even the unchanged files.
So even though git stores a snapshot of the repository at each commit, it really only needs to store new data for the files that changed, it can store pointers to the old, unchanged file contents for everything else.
Note: if you have a repository with a large number of commits, like the "largest Git repo on the planet", with over 250000 commits, adding new commits can actually be slow.
That is why Git 2.23 (Q3 2019) introduces commit-graph chains.
See commit 5b15eb3, commit 16110c9, commit a09c130, commit e2017c4, commit ba41112, commit 3da4b60, commit c2bc6e6, commit 8d84097, commit c523035, commit 1771be9, commit 135a712, commit 6c622f9, commit 144354b, commit 118bd57, commit 5c84b33, commit 3cbc6ed, commit d4f4d60, commit 890345a (18 Jun 2019) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 92b1ea6, 19 Jul 2019)
commit-graph
: document commit-graph chains
The documentation now has:
Commit Graphs Chains
Typically, repos grow with near-constant velocity (commits per day).
Over time, the number of commits added by a fetch operation is much smaller than the number of commits in the full history.By creating a "chain" of commit-graphs, we enable fast writes of new commit data without rewriting the entire commit history -- at least, most of the time.
File Layout
A commit-graph chain uses multiple files, and we use a fixed naming convention to organize these files.
Each commit-graph file has a name$OBJDIR/info/commit-graphs/graph-{hash}.graph
where{hash}
is the hex-valued hash stored in the footer of that file (which is a hash of the file's contents before that hash).
For a chain of commit-graph files, a plain-text file at$OBJDIR/info/commit-graphs/commit-graph-chain
contains the hashes for the files in order from "lowest" to "highest".For example, if the
commit-graph-chain
file contains the lines:{hash0} {hash1} {hash2}
then the commit-graph chain looks like the following diagram:
+-----------------------+ | graph-{hash2}.graph | +-----------------------+ | +-----------------------+ | | | graph-{hash1}.graph | | | +-----------------------+ | +-----------------------+ | | | | | | | graph-{hash0}.graph | | | | | | | +-----------------------+
- Let
X0
be the number of commits ingraph-{hash0}.graph
,X1
be the number of commits ingraph-{hash1}.graph
, andX2
be the number of commits ingraph-{hash2}.graph
.If a commit appears in position
i
ingraph-{hash2}.graph
, then we interpret this as being the commit in position(X0 + X1 + i)
, and that will be used as its "graph position".
The commits ingraph-{hash2}.graph
use these positions to refer to their parents, which may be ingraph-{hash1}.graph
orgraph-{hash0}.graph
.
We can navigate to an arbitrary commit in positionj
by checking its containment in the intervals[0, X0)
,[X0, X0 + X1)
,[X0 + X1, X0 + X1 + X2)
.
That means git commit-grah
has a new write
command option: --split
.
commit-graph
: add--split
option to builtin
Add a new "
--split
" option to the 'git commit-graph write
' subcommand.
This option allows the optional behavior of writing a commit-graph chain.The current behavior will add a tip commit-graph containing any commits that are not in the existing commit-graph or commit-graph chain.
Later changes will allow merging the chain and expiring out-dated files.Add a new test script (
t5324-split-commit-graph.sh
) that demonstrates this behavior.
And the same documentation adds:
With the
--split
option, write the commit-graph as a chain of multiple commit-graph files stored in<dir>/info/commit-graphs
.
The new commits not already in the commit-graph are added in a new "tip" file.
This file is merged with the existing file if the following merge conditions are met:
If
--size-multiple=<X>
is not specified, letX
equal 2. If the new tip file would haveN
commits and the previous tip hasM
commits andX
timesN
is greater thanM
, instead merge the two files into a single file.If
--max-commits=<M>
is specified withM
a positive integer, and the new tip file would have more thanM
commits, then instead merge the new tip with the previous tip.Finally, if
--expire-time=<datetime>
is not specified, letdatetime
be the current time. After writing the split commit-graph, delete all unused commit-graph whose modified times are older thandatetime
.
That will help with forks:
commit-graph
: allow cross-alternate chains
In an environment like a fork network, it is helpful to have a commit-graph chain that spans both the base repo and the fork repo.
The fork is usually a small set of data on top of the large repo, but sometimes the fork is much larger.
For example,git-for-windows/git
has almost double the number of commits as git/git because it rebases its commits on every major version update.
The documentation now includes:
Chains across multiple object directories
In a repo with alternates, we look for the
commit-graph-chain
file starting in the local object directory and then in each alternate.
The first file that exists defines our chain.
As we look for thegraph-{hash}
files for each{hash}
in the chain file, we follow the same pattern for the host directories.This allows commit-graphs to be split across multiple forks in a fork network.
The typical case is a large "base" repo with many smaller forks.As the base repo advances, it will likely update and merge its commit-graph chain more frequently than the forks.
If a fork updates their commit-graph after the base repo, then it should "reparent" the commit-graph chain onto the new chain in the base repo.
When reading eachgraph-{hash}
file, we track the object directory containing it. During a write of a new commit-graph file, we check for any changes in the source object directory and read thecommit-graph-chain
file for that source and create a new file based on those files.
During this "reparent" operation, we necessarily need to collapse all levels in the fork, as all of the files are invalid against the new base file.
That also involves expiring commit-graph files:
commit-graph
: expire commit-graph files
As we merge commit-graph files in a commit-graph chain, we should clean up the files that are no longer used.
This change introduces an '
expiry_window
' value to the context, which is always zero (for now).
We then check the modified time of eachgraph-{hash}.graph
file in the$OBJDIR/info/commit-graphs
folder and unlink the files that are older than theexpiry_window
.
The documentation now references:
Deleting graph-{hash} files
After a new tip file is written, some
graph-{hash}
files may no longer be part of a chain. It is important to remove these files from disk, eventually.
The main reason to delay removal is that another process could read thecommit-graph-chain
file before it is rewritten, but then look for thegraph-{hash}
files after they are deleted.To allow holding old split commit-graphs for a while after they are unreferenced, we update the modified times of the files when they become unreferenced.
Then, we scan the$OBJDIR/info/commit-graphs/
directory forgraph-{hash}
files whose modified times are older than a given expiry window.
This window defaults to zero, but can be changed using command-line arguments or a config setting.
With Git 2.27 (Q2 2020), "git commit-graph write
" learned different ways to write out split files.
See commit dbd5e0a (29 Apr 2020) by Junio C Hamano (gitster
).
See commit 7a9ce02 (15 Apr 2020), and commit 6830c36, commit f478106, commit 8a6ac28, commit fdbde82, commit 4f02735, commit 2fa05f3 (14 Apr 2020) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit 6a1c17d, 01 May 2020)
builtin/commit-graph.c
: introduce split strategy 'no-merge'Signed-off-by: Taylor Blau
In the previous commit, we laid the groundwork for supporting different splitting strategies. In this commit, we introduce the first splitting strategy: '
no-merge
'.Passing '
--split=no-merge
' is useful for callers which wish to write a new incremental commit-graph, but do not want to spend effort condensing the incremental chain (*1).
Previously, this was possible by passing '--size-multiple=0
', but this no longer the case following 63020f175f ("commit-graph
: prefer default size_mult
when given zero", 2020-01-02, Git v2.25.0-rc2 -- merge).
When '
--split=no-merge
' is given, the commit-graph machinery will never condense an existing chain, and it will always write a new incremental.(*1): This might occur when, for example, a server administrator running some program after each push may want to ensure that each job runs proportional in time to the size of the push, and does not "jump" when the commit-graph machinery decides to trigger a merge.
"git fsck --no-progress
"(man) still spewed noise from the commit-graph subsystem, which has been corrected with Git 2.42 (Q3 2023).
See commit 9281cd0, commit 7248857, commit f5facaa, commit eb319d6, commit 39bdd30, commit eda206f (07 Jul 2023) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit 6016ee0, 18 Jul 2023)
commit-graph.c
: avoid duplicated progress output duringverify
Signed-off-by: Taylor Blau
Acked-by: Derrick Stolee
When
git commit-graph
(man) verify was taught how to verify commit-graph chains in 3da4b60 ("commit-graph
: verify chains with--shallow
mode", 2019-06-18, Git v2.23.0-rc0 -- merge listed in batch #6), it produced one line of progress per layer of the commit-graph chain.$ git.compile commit-graph verify Verifying commits in commit graph: 100% (4356/4356), done. Verifying commits in commit graph: 100% (131912/131912), done.
This could be somewhat confusing to users, who may wonder why there are multiple occurrences of "Verifying commits in commit graph".
There are likely good arguments on whether or not there should be one line of progress output per commit-graph layer.
On the one hand, the existing output shows us verifying each individual layer of the chain.
But on the other hand, the fact that a commit-graph may be stored among multiple layers is an implementation detail that the caller need not be aware of.Clarify this by showing a single progress meter regardless of the number of layers in the commit-graph chain.
After this patch, the output reflects the logical contents of a commit-graph chain, instead of showing one line of output per commit-graph layer:$ git.compile commit-graph verify Verifying commits in commit graph: 100% (136268/136268), done.
With Git 2.43 (Q4 2023), "git commit-graph verify
"(man) is more robust against read errors when verifying graph chain.
See commit 5f25919, commit 7754a56, commit 47d06bb, commit 2d45710, commit 8298b54, commit 7ed76b4 (28 Sep 2023) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit c3c0020, 04 Oct 2023)
commit-graph
: detect read errors when verifying graph chainSigned-off-by: Jeff King
Because it's OK to not have a graph file at all, the
graph_verify()
function needs to tell the difference between a missing file and a real error.
So when loading a traditional graph file, we callopen_commit_graph()
separately fromload_commit_graph_chain_fd_st()
, and don't complain if the first one fails with ENOENT.When the function learned about chain files in 3da4b60 ("
commit-graph
: verify chains with--shallow
mode", 2019-06-18, Git v2.23.0-rc0 -- merge listed in batch #6), we couldn't be as careful, since the only way to load a chain was withread_commit_graph_one()
, which did both the open/load as a single unit.
So we'll miss errors in chain files we load, thinking instead that there was just no chain file at all.Note that we do still report some of these problems to stderr, as the loading function calls
error()
andwarning()
.
But we'd exit with a successful exit code, which is wrong.We can fix that by using the recently split open/load functions for chains.
That lets us treat the chain file just like a single file with respect to error handling here.An existing test (from 3da4b60) shows off the problem; we were expecting "commit-graph verify" to report success, but that makes no sense.
We did not even verify the contents of the graph data, because we couldn't load it! I don't think this was an intentional exception, but rather just the test covering what happened to occur.
As far as I understand it so far... Imagine you have a many commits in the master branch and another branch with also many many commits. So if a VCS does not support the concept of git with hashes and so on and just stores the difference of the files and then you want to branch. Then the other VCS has either to revert all changes unitl the shared commit and apply all changes of the other branch or it has to compare all files one by one. In my opinion the hashing algorithm of git seems to be the better approach even if git has to do much iterating/searching I guess. Idk if I'm right I just started today to read something about git. Feel free to downvote/upvote and comment :D I think it's a topic where only a few people have really in dept knowledge
© 2022 - 2024 — McMap. All rights reserved.