If my repository contains several copies of the same files with only small changes (don't ask why), will git save space by only storing the differences between the files?
It could, but it is very hard to say whether it will. There are situations where it is guaranteed that it won't.
To understand this answer (and its limitations) we must look at the way git stores objects. There's a good description of the format of "git objects" (as stored in .git/objects/
) in this stackoverflow answer or in the Pro Git book.
When storing "loose objects" like this—which git does for what we might call "active" objects—they are zlib-deflated, as the Pro Git book says, but not otherwise compressed. So two different (not bit-for-bit identical) files stored in two different objects are never compressed against each other.
On the other hand, eventually objects can be "packed" into a "pack file". See another section of the Pro Git book for information on pack files. Objects stored in pack files are "delta-compressed" against other objects in the same file. Precisely what criteria git uses for choosing which objects are compressed against which other objects is quite obscure. Here's a snippet from the Pro Git Book again:
When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space. The git verify-pack plumbing command allows you to see what was packed up [...]
If git decides to delta-compress "pack entry for big file A" vs "pack entry for big file B", then—and only then—can git save space in the way you asked.
Git makes pack files every time git gc
runs (or more precisely, through git pack-objects
and git repack
; higher level operations, including git gc
, run these for you when needed/appropriate). At this time, git gathers up loose objects, and/or explodes and re-packs existing packs. If your close-but-not-quite-identical files get delta-compressed against each other at this point, you may see some very large space-savings.
If you then go to modify the files, though, you'll work on the expanded and uncompressed versions in your work tree and then git add
the result. This will make a new "loose object", and by definition that won't be delta-compressed against anything (no other loose object, nor any pack).
When you clone a repository, generally git makes packs (or even "thin packs", which are packs that are not stand-alone) out of the objects to be transferred, so that what is sent across the Intertubes is as small as possible. So here you may get the benefit of delta compression even if the objects are loose in the source repository. Again, you'll lose the benefit as soon as you start working on those files (turning them into loose objects), and regain it only if-and-when the loose objects are packed again and git's heuristics compress them against each other.
The real takeaway here is that to find out, you can simply try it, using the method outlined in the Pro Git book.
will git save space by only storing the differences between the files?
Yes, git can pack the files into a compressed format.
You have two nearly identical 4K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?
It turns out that it can. The initial format in which Git saves objects on disk is called a loose object format. However, occasionally Git packs up several of these objects into a single binary file called a packfile in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the
git gc
command manually, or if you push to a remote server. To see what happens, you can manually ask Git to pack up the objects by calling thegit gc
command:
git gc
from 100 GB to 1~10 GB or so if I put it into a git repo and run git gc
on it. We shall see! –
Shafer Yes, it can. Running git gc
is the magic that may make it happen. See the answer by @Emil Davtyan here, for instance. @torek also mentions some of this.
See this link in particular: 10.4 Git Internals - Packfiles: in addition to the quote in this answer here (emphasis added):
What is cool is that although the objects on disk before you ran the
gc
command were collectively about 15K in size, the new packfile is only 7K. You’ve cut your disk usage by half by packing your objects.How does Git do this? When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next.
How to try it out yourself out and see how much space you can save
cd path/to/my_repo
# check the size of your repo's .git folder
du -sh .git
# try compressing your repo by running "git garbage collection"
time git gc
# re-check the size of your repo's .git folder
du -sh .git
Here are some real results for me:
On a small documentation repo with mostly just markdown
.md
text docs:1.7M --> 288K:
$ du -sh .git 1.7M .git $ git gc Enumerating objects: 182, done. Counting objects: 100% (182/182), done. Delta compression using up to 20 threads Compressing objects: 100% (178/178), done. Writing objects: 100% (182/182), done. Total 182 (delta 103), reused 4 (delta 0), pack-reused 0 $ du -sh .git 288K .git
On a larger ~150 MB repo with code and some binary build files:
50M --> 48M:
$ du -sh .git 50M .git $ time git gc Enumerating objects: 8449, done. Counting objects: 100% (8449/8449), done. Delta compression using up to 20 threads Compressing objects: 100% (2872/2872), done. Writing objects: 100% (8449/8449), done. Total 8449 (delta 5566), reused 8376 (delta 5524), pack-reused 0 real 0m1.603s user 0m2.098s sys 0m0.167s $ du -sh .git 48M .git
On a brand-new 107 GB directory with 2.1M (2.1 million) files from 25 years of semi-duplicate data where someone just copied the same 300 MB folder over again and again (hundreds of times) as their version control system:
11 GB after the initial
git gc
packing process which it automatically did after first runninggit commit
to add all of the files.git commit
took 11 minutes on a very high-end laptop with a very high-speed SSD.So, since
git gc
had just run automatically aftergit commit
, there's no change to see, but it's very impressive that 2.1M files comprising 107 GB got packed down to only 11 GB!:11 GB .git folder
$ du -sh .git 11G .git $ time git gc Enumerating objects: 190027, done. Counting objects: 100% (190027/190027), done. Delta compression using up to 20 threads Compressing objects: 100% (60886/60886), done. Writing objects: 100% (190027/190027), done. Total 190027 (delta 124418), reused 190025 (delta 124417), pack-reused 0 real 0m43.456s user 0m34.286s sys 0m6.565s $ du -sh .git 11G .git
For more details, see my longer answer on this, here: What are the file limits in Git (number and size)?
See also:
git maintenance run --auto
–
Kenyakenyatta © 2022 - 2024 — McMap. All rights reserved.
blob 0\x00
. All empty files will have exactly the same SHA1 hash and therefore there will be only one such blob in your repo, regardless of whether you committed one or one thousand empty files. – Rikki