Does git de-duplicate between files?
Asked Answered
F

3

20

If my repository contains several copies of the same files with only small changes (don't ask why), will git save space by only storing the differences between the files?

Fact answered 4/9, 2014 at 9:32 Comment(2)
Let's take this to the extreme: Suppose you have committed several files that have no content (i.e., files that are completely empty). The git representation of such a file, before zlib compression, is blob 0\x00. All empty files will have exactly the same SHA1 hash and therefore there will be only one such blob in your repo, regardless of whether you committed one or one thousand empty files.Rikki
@DavidHammen That's true, but only applies to identical files. The question is about files that are not identical, but similar.Tifanie
K
20

It could, but it is very hard to say whether it will. There are situations where it is guaranteed that it won't.

To understand this answer (and its limitations) we must look at the way git stores objects. There's a good description of the format of "git objects" (as stored in .git/objects/) in this stackoverflow answer or in the Pro Git book.

When storing "loose objects" like this—which git does for what we might call "active" objects—they are zlib-deflated, as the Pro Git book says, but not otherwise compressed. So two different (not bit-for-bit identical) files stored in two different objects are never compressed against each other.

On the other hand, eventually objects can be "packed" into a "pack file". See another section of the Pro Git book for information on pack files. Objects stored in pack files are "delta-compressed" against other objects in the same file. Precisely what criteria git uses for choosing which objects are compressed against which other objects is quite obscure. Here's a snippet from the Pro Git Book again:

When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next. You can look into the packfile and see what Git did to save space. The git verify-pack plumbing command allows you to see what was packed up [...]

If git decides to delta-compress "pack entry for big file A" vs "pack entry for big file B", then—and only then—can git save space in the way you asked.

Git makes pack files every time git gc runs (or more precisely, through git pack-objects and git repack; higher level operations, including git gc, run these for you when needed/appropriate). At this time, git gathers up loose objects, and/or explodes and re-packs existing packs. If your close-but-not-quite-identical files get delta-compressed against each other at this point, you may see some very large space-savings.

If you then go to modify the files, though, you'll work on the expanded and uncompressed versions in your work tree and then git add the result. This will make a new "loose object", and by definition that won't be delta-compressed against anything (no other loose object, nor any pack).

When you clone a repository, generally git makes packs (or even "thin packs", which are packs that are not stand-alone) out of the objects to be transferred, so that what is sent across the Intertubes is as small as possible. So here you may get the benefit of delta compression even if the objects are loose in the source repository. Again, you'll lose the benefit as soon as you start working on those files (turning them into loose objects), and regain it only if-and-when the loose objects are packed again and git's heuristics compress them against each other.

The real takeaway here is that to find out, you can simply try it, using the method outlined in the Pro Git book.

Klong answered 4/9, 2014 at 10:7 Comment(0)
K
6

will git save space by only storing the differences between the files?

Yes, git can pack the files into a compressed format.

You have two nearly identical 4K objects on your disk. Wouldn’t it be nice if Git could store one of them in full but then the second object only as the delta between it and the first?

It turns out that it can. The initial format in which Git saves objects on disk is called a loose object format. However, occasionally Git packs up several of these objects into a single binary file called a packfile in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server. To see what happens, you can manually ask Git to pack up the objects by calling the git gc command:

Keijo answered 4/9, 2014 at 9:46 Comment(2)
Upvoted! Also from that link: "How does Git do this? When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next." This gives me hope. I have a 100 GB directory with 25 years of duplicate data where someone just copied the same 300 MB folder over again and again (hundreds of times) as their version control system. I'm hoping it will compress down by git gc from 100 GB to 1~10 GB or so if I put it into a git repo and run git gc on it. We shall see!Shafer
And we have seen.Shafer
S
2

Yes, it can. Running git gc is the magic that may make it happen. See the answer by @Emil Davtyan here, for instance. @torek also mentions some of this.

See this link in particular: 10.4 Git Internals - Packfiles: in addition to the quote in this answer here (emphasis added):

What is cool is that although the objects on disk before you ran the gc command were collectively about 15K in size, the new packfile is only 7K. You’ve cut your disk usage by half by packing your objects.

How does Git do this? When Git packs objects, it looks for files that are named and sized similarly, and stores just the deltas from one version of the file to the next.

How to try it out yourself out and see how much space you can save

cd path/to/my_repo

# check the size of your repo's .git folder
du -sh .git

# try compressing your repo by running "git garbage collection"
time git gc

# re-check the size of your repo's .git folder
du -sh .git

Here are some real results for me:

  1. On a small documentation repo with mostly just markdown .md text docs:

    1.7M --> 288K:

    $ du -sh .git
    1.7M    .git
    $ git gc
    Enumerating objects: 182, done.
    Counting objects: 100% (182/182), done.
    Delta compression using up to 20 threads
    Compressing objects: 100% (178/178), done.
    Writing objects: 100% (182/182), done.
    Total 182 (delta 103), reused 4 (delta 0), pack-reused 0
    $ du -sh .git
    288K    .git
    
  2. On a larger ~150 MB repo with code and some binary build files:

    50M --> 48M:

    $ du -sh .git
    50M .git
    $ time git gc
    Enumerating objects: 8449, done.
    Counting objects: 100% (8449/8449), done.
    Delta compression using up to 20 threads
    Compressing objects: 100% (2872/2872), done.
    Writing objects: 100% (8449/8449), done.
    Total 8449 (delta 5566), reused 8376 (delta 5524), pack-reused 0
    
    real    0m1.603s
    user    0m2.098s
    sys 0m0.167s
    $ du -sh .git
    48M .git
    
  3. On a brand-new 107 GB directory with 2.1M (2.1 million) files from 25 years of semi-duplicate data where someone just copied the same 300 MB folder over again and again (hundreds of times) as their version control system:

    11 GB after the initial git gc packing process which it automatically did after first running git commit to add all of the files.

    git commit took 11 minutes on a very high-end laptop with a very high-speed SSD.

    So, since git gc had just run automatically after git commit, there's no change to see, but it's very impressive that 2.1M files comprising 107 GB got packed down to only 11 GB!:

    11 GB .git folder

    $ du -sh .git
    11G .git
    $ time git gc
    Enumerating objects: 190027, done.
    Counting objects: 100% (190027/190027), done.
    Delta compression using up to 20 threads
    Compressing objects: 100% (60886/60886), done.
    Writing objects: 100% (190027/190027), done.
    Total 190027 (delta 124418), reused 190025 (delta 124417), pack-reused 0
    
    real    0m43.456s
    user    0m34.286s
    sys 0m6.565s
    $ du -sh .git
    11G .git
    

    For more details, see my longer answer on this, here: What are the file limits in Git (number and size)?

See also:

  1. What are the file limits in Git (number and size)?
    1. my answer
Shafer answered 13/7, 2023 at 17:53 Comment(1)
And do not forget the more recent (2020+: Git 2.29+) git maintenance run --autoKenyakenyatta

© 2022 - 2024 — McMap. All rights reserved.