git gc --aggressive vs git repack
Asked Answered
G

5

126

I'm looking for ways to reduce the size of a git repository. Searching leads me to git gc --aggressive most of the times. I have also read that this isn't the preferred approach.

Why? what should I be aware of if I'm running gc --aggressive?

git repack -a -d --depth=250 --window=250 is recommended over gc --aggressive. Why? How does repack reduce the size of a repository? Also, I'm not quite clear about the flags --depth and --window.

What should I choose between gc and repack? When should I use gc and repack?

Gathard answered 25/2, 2015 at 13:23 Comment(0)
S
105

Nowadays there is no difference: git gc --aggressive operates according to the suggestion Linus made in 2007; see below. As of version 2.11 (Q4 2016), git defaults to a depth of 50. A window of size 250 is good because it scans a larger section of each object, but depth at 250 is bad because it makes every chain refer to very deep old objects, which slows down all future git operations for marginally lower disk usage.


Historical Background

Linus suggested (see below for the full mailing list post) using git gc --aggressive only when you have, in his words, “a really bad pack” or “really horribly bad deltas,” however “almost always, in other cases, it’s actually a really bad thing to do.” The result may even leave your repository in worse condition than when you started!

The command he suggests for doing this properly after having imported “a long and involved history” is

git repack -a -d -f --depth=250 --window=250

But this assumes you have already removed unwanted gunk from your repository history and that you have followed the checklist for shrinking a repository found in the git filter-branch documentation.

git-filter-branch can be used to get rid of a subset of files, usually with some combination of --index-filter and --subdirectory-filter. People expect the resulting repository to be smaller than the original, but you need a few more steps to actually make it smaller, because Git tries hard not to lose your objects until you tell it to. First make sure that:

  • You really removed all variants of a filename, if a blob was moved over its lifetime. git log --name-only --follow --all -- filename can help you find renames.

  • You really filtered all refs: use --tag-name-filter cat -- --all when calling git filter-branch.

Then there are two ways to get a smaller repository. A safer way is to clone, that keeps your original intact.

  • Clone it with git clone file:///path/to/repo. The clone will not have the removed objects. See git-clone. (Note that cloning with a plain path just hardlinks everything!)

If you really don’t want to clone it, for whatever reasons, check the following points instead (in this order). This is a very destructive approach, so make a backup or go back to cloning it. You have been warned.

  • Remove the original refs backed up by git-filter-branch: say

    git for-each-ref --format="%(refname)" refs/original/ |
      xargs -n 1 git update-ref -d
    
  • Expire all reflogs with git reflog expire --expire=now --all.

  • Garbage collect all unreferenced objects with git gc --prune=now (or if your git gc is not new enough to support arguments to --prune, use git repack -ad; git prune instead).


Date: Wed, 5 Dec 2007 22:09:12 -0800 (PST)
From: Linus Torvalds <torvalds at linux-foundation dot org>
To: Daniel Berlin <dberlin at dberlin dot org>
cc: David Miller <davem at davemloft dot net>,
    ismail at pardus dot org dot tr,
    gcc at gcc dot gnu dot org,
    git at vger dot kernel dot org
Subject: Re: Git and GCC
In-Reply-To: <[email protected]>
Message-ID: <[email protected]>
References: <[email protected]>
            <[email protected]>
            <[email protected]>
            <[email protected]>
            <[email protected]>

On Thu, 6 Dec 2007, Daniel Berlin wrote:

Actually, it turns out that git-gc --aggressive does this dumb thing to pack files sometimes regardless of whether you converted from an SVN repo or not.

Absolutely. git --aggressive is mostly dumb. It’s really only useful for the case of “I know I have a really bad pack, and I want to throw away all the bad packing decisions I have done.”

To explain this, it’s worth explaining (you are probably aware of it, but let me go through the basics anyway) how git delta-chains work, and how they are so different from most other systems.

In other SCMs, a delta-chain is generally fixed. It might be “forwards” or “backwards,” and it might evolve a bit as you work with the repository, but generally it’s a chain of changes to a single file represented as some kind of single SCM entity. In CVS, it’s obviously the *,v file, and a lot of other systems do rather similar things.

Git also does delta-chains, but it does them a lot more “loosely.” There is no fixed entity. Deltas are generated against any random other version that git deems to be a good delta candidate (with various fairly successful heuristics), and there are absolutely no hard grouping rules.

This is generally a very good thing. It’s good for various conceptual reasons (i.e., git internally never really even needs to care about the whole revision chain — it doesn’t really think in terms of deltas at all), but it’s also great because getting rid of the inflexible delta rules means that git doesn’t have any problems at all with merging two files together, for example — there simply are no arbitrary *,v “revision files” that have some hidden meaning.

It also means that the choice of deltas is a much more open-ended question. If you limit the delta chain to just one file, you really don’t have a lot of choices on what to do about deltas, but in git, it really can be a totally different issue.

And this is where the really badly named --aggressive comes in. While git generally tries to re-use delta information (because it’s a good idea, and it doesn’t waste CPU time re-finding all the good deltas we found earlier), sometimes you want to say “let’s start all over, with a blank slate, and ignore all the previous delta information, and try to generate a new set of deltas.”

So --aggressive is not really about being aggressive, but about wasting CPU time re-doing a decision we already did earlier!

Sometimes that is a good thing. Some import tools in particular could generate really horribly bad deltas. Anything that uses git fast-import, for example, likely doesn’t have much of a great delta layout, so it might be worth saying “I want to start from a clean slate.”

But almost always, in other cases, it’s actually a really bad thing to do. It’s going to waste CPU time, and especially if you had actually done a good job at deltaing earlier, the end result isn’t going to re-use all those good deltas you already found, so you’ll actually end up with a much worse end result too!

I’ll send a patch to Junio to just remove the git gc --aggressive documentation. It can be useful, but it generally is useful only when you really understand at a very deep level what it’s doing, and that documentation doesn’t help you do that.

Generally, doing incremental git gc is the right approach, and better than doing git gc --aggressive. It’s going to re-use old deltas, and when those old deltas can’t be found (the reason for doing incremental GC in the first place!) it’s going to create new ones.

On the other hand, it’s definitely true that an “initial import of a long and involved history” is a point where it can be worth spending a lot of time finding the really good deltas. Then, every user ever after (as long as they don’t use git gc --aggressive to undo it!) will get the advantage of that one-time event. So especially for big projects with a long history, it’s probably worth doing some extra work, telling the delta finding code to go wild.

So the equivalent of git gc --aggressive — but done properly — is to do (overnight) something like

git repack -a -d --depth=250 --window=250

where that depth thing is just about how deep the delta chains can be (make them longer for old history — it’s worth the space overhead), and the window thing is about how big an object window we want each delta candidate to scan.

And here, you might well want to add the -f flag (which is the “drop all old deltas,” since you now are actually trying to make sure that this one actually finds good candidates.

And then it’s going to take forever and a day (i.e., a “do it overnight” thing). But the end result is that everybody downstream from that repository will get much better packs, without having to spend any effort on it themselves.

          Linus
Stereopticon answered 25/2, 2015 at 14:5 Comment(1)
Your comment about depth is a bit confusing. At first I was going to complain you are dead wrong, that aggressive can greatly speed up a git repository. After doing an aggressive garbage collection a HUGE repo that took five minutes to do a git status reduced to seconds. But then I realised you didn't mean the aggressive gc slowed down the repo, but just an extremely large depth size.Beta
I
71

When should I use gc & repack?

As I mentioned in "Git Garbage collection doesn't seem to fully work", a git gc --aggressive is neither sufficient or even enough on its own.
And, as I explain below, often not needed.

The most effective combination would be adding git repack, but also git prune:

git gc
git repack -Ad      # kills in-pack garbage
git prune           # kills loose garbage

Note: Git 2.11 (Q4 2016) will set the default gc aggressive depth to 50

See commit 07e7dbf (11 Aug 2016) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 0952ca8, 21 Sep 2016)

gc: default aggressive depth to 50

"git gc --aggressive" used to limit the delta-chain length to 250, which is way too deep for gaining additional space savings and is detrimental for runtime performance.
The limit has been reduced to 50.

The summary is: the current default of 250 doesn't save much space, and costs CPU. It's not a good tradeoff.

The "--aggressive" flag to git-gc does three things:

  1. use "-f" to throw out existing deltas and recompute from scratch
  2. use "--window=250" to look harder for deltas
  3. use "--depth=250" to make longer delta chains

Items (1) and (2) are good matches for an "aggressive" repack.
They ask the repack to do more computation work in the hopes of getting a better pack. You pay the costs during the repack, and other operations see only the benefit.

Item (3) is not so clear.
Allowing longer chains means fewer restrictions on the deltas, which means potentially finding better ones and saving some space.
But it also means that operations which access the deltas have to follow longer chains, which affects their performance.
So it's a tradeoff, and it's not clear that the tradeoff is even a good one.

(See commit for study)

You can see that that the CPU savings for regular operations improves as we decrease the depth.
But we can also see that the space savings are not that great as the depth goes higher. Saving 5-10% between 10 and 50 is probably worth the CPU tradeoff. Saving 1% to go from 50 to 100, or another 0.5% to go from 100 to 250 is probably not.


Speaking of CPU saving, "git repack" learned to accept the --threads=<n> option and pass it to pack-objects.

See commit 40bcf31 (26 Apr 2017) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit 31fb6f4, 29 May 2017)

repack: accept --threads=<n> and pass it down to pack-objects

We already do so for --window=<n> and --depth=<n>; this will help when the user wants to force --threads=1 for reproducible testing without getting affected by racing multiple threads.

Incessant answered 25/2, 2015 at 13:36 Comment(8)
I mentioned the Linus thread in the "Git Garbage collection doesn't seem to fully work" linkIncessant
Thanks for this modern update! Every other answer here is old. Now we can see that git gc --aggressive has been fixed twice: First, to do what Linus suggested in 2007 as a "better packing method". And then in Git 2.11 to avoid the excessive object depth that Linus had suggested but which turned out to be harmful (slows down all future Git operations and didn't save any space worth speaking of).Fontana
git gc , followed by git repack -Ad and git prune increases the size of my repository ...why ?Truck
@Truck Not sure: what version of Git are you using? You can ask a new question for that (with more details such as the OS, the general size of your repo, ...)Incessant
man git-repack says for -d: ` Also run git prune-packed to remove redundant loose object files.` Or does git prune also do that? man git-prune says In most cases, users should run git gc, which calls git prune., so what's the use after git gc? Wouldn't it be better or sufficient to use only git repack -Ad && git gc?Unstopped
@Unstopped git prune (git-scm.com/docs/git-prune) calls git prune-packed (git-scm.com/docs/git-prune-packed), so I would keep it in there.Incessant
@Incessant sorry, overlooked that. But still, as git gc runs git prune, is there any good in running gc + pack + prune (in this specific order) instead of pack + gc?Unstopped
@Unstopped In my test, using prune after repack was useful. But that might have been changed since 2015.Incessant
M
15

The problem with git gc --aggressive is that the option name and documentation is misleading.

As Linus himself explains in this mail, what git gc --aggressive basicly does is this:

While git generally tries to re-use delta information (because it's a good idea, and it doesn't waste CPU time re-finding all the good deltas we found earlier), sometimes you want to say "let's start all over, with a blank slate, and ignore all the previous delta information, and try to generate a new set of deltas".

Usually there is no need to recalculate deltas in git, since git determines these deltas very flexible. It only makes sense if you know that you have really, really bad deltas. As Linus explains, mainly tools which make use of git fast-import fall into this category.

Most of the time git does a pretty good job at determining useful deltas and using git gc --aggressive will leave you with deltas which are potentially even worse while wasting a lot of CPU time.


Linus ends his mail with the conclusion that git repack with a large --depth and --window is the better choice in most of time; especially after you imported a large project and want to make sure that git finds good deltas.

So the equivalent of git gc --aggressive - but done properly - is to do (overnight) something like

git repack -a -d --depth=250 --window=250

where that depth thing is just about how deep the delta chains can be (make them longer for old history - it's worth the space overhead), and the window thing is about how big an object window we want each delta candidate to scan.

And here, you might well want to add the -f flag (which is the "drop all old deltas", since you now are actually trying to make sure that this one actually finds good candidates.

Marshamarshal answered 25/2, 2015 at 13:41 Comment(0)
I
15

Caution. Do not run git gc --agressive with repository which is not synchronized with remote if you have no backups.

This operation recreates deltas from scratch and could lead to data loss if gracefully interrupted.

For my 8GB computer aggressive gc ran out of memory on 1Gb repository with 10k small commits. When OOM killer terminated git process - it left me with almost empty repository, only working tree and few deltas survived.

Of course, it was not the only copy of repository so I just recreated it and pulled from remote (fetch did not work on broken repo and deadlocked on 'resolving deltas' step few times I tried to do so), but if your repo is single-developer local repo without remotes at all - back it up first.

Insignia answered 3/6, 2018 at 21:4 Comment(0)
I
11

Note: beware of using git gc --aggressive, as Git 2.22 (Q2 2019) clarifies.

See commit 0044f77, commit daecbf2, commit 7384504, commit 22d4e3b, commit 080a448, commit 54d56f5, commit d257e0f, commit b6a8d09 (07 Apr 2019), and commit fc559fb, commit cf9cd77, commit b11e856 (22 Mar 2019) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit ac70c53, 25 Apr 2019)

gc docs: downplay the usefulness of --aggressive

The existing "gc --aggressive" docs come just short of recommending to users that they run it regularly.
I've personally talked to many users who've taken these docs as an advice to use this option, and have, usually it's (mostly) a waste of time.

So let's clarify what it really does, and let the user draw their own conclusions.

Let's also clarify the "The effects [...] are persistent" to paraphrase a brief version of Jeff King's explanation.

That means the git-gc documentation now includes:

AGGRESSIVE

When the --aggressive option is supplied, git-repack will be invoked with the -f flag, which in turn will pass --no-reuse-delta to git-pack-objects.
This will throw away any existing deltas and re-compute them, at the expense of spending much more time on the repacking.

The effects of this are mostly persistent, e.g. when packs and loose objects are coalesced into one another pack the existing deltas in that pack might get re-used, but there are also various cases where we might pick a sub-optimal delta from a newer pack instead.

Furthermore, supplying --aggressive will tweak the --depth and --window options passed to git-repack.
See the gc.aggressiveDepth and gc.aggressiveWindow settings below.
By using a larger window size we're more likely to find more optimal deltas.

It's probably not worth it to use this option on a given repository without running tailored performance benchmarks on it.
It takes a lot more time, and the resulting space/delta optimization may or may not be worth it. Not using this at all is the right trade-off for most users and their repositories.

And (commit 080a448):

gc docs: note how --aggressive impacts --window & --depth

Since 07e7dbf (gc: default aggressive depth to 50, 2016-08-11, Git v2.10.1) we somewhat confusingly use the same depth under --aggressive as we do by default.

As noted in that commit that makes sense, it was wrong to make more depth the default for "aggressive", and thus save disk space at the expense of runtime performance, which is usually the opposite of someone who'd like "aggressive gc" wants.

Incessant answered 27/4, 2019 at 22:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.