git push is very slow for a branch
Asked Answered
T

3

30

We have a git repo that is quite large (ios app resources). I appreciate that git is going to be slow when working with it, but if I create a new branch and edit a couple of files (not binary ones) and push, it takes forever.

It feels like the entire repo is being pushed. I was under the impression that git would only send the diff, is that wrong? (I know git stores compressed versions of the whole file, I mean the diff between my branch and where I branched from).

If I run git diff --stat --cached origin/foo then I see a short list of files that looks like what I would expect, e.g. 34 files changed, 1117 insertions(+), 72 deletions(-). But when I push it gets to Writing objects: 21% (2317/10804) and grinds to a halt, as if it's pushing all 2.4GB of binary data.

Am I missing something (I've googled it pretty hard)? Is this the expected behaviour? I'm using git 2.2.2 on OS X (Mavericks), and ssh ([email protected]).

I found a similar question here: Git - pushing a remote branch for a large project is really slow but no real answers.

Tolliver answered 18/3, 2015 at 9:55 Comment(4)
see discussion on git mailing list - comments.gmane.org/gmane.comp.version-control.git/265716Tolliver
Alive link to the thread mentioned by @Tolliver : public-inbox.org/git/…Overunder
For large repos only, you now now (Q1 2019) have, with Git For Windows 2.21, the config pack.sparse which can help the performance of the push. That will be generalized for all platform. See "git push is very slow for a huge repo"Annelieseannelise
See also with Git 2.38 (Q3 2022) the new setting git -c push.useBitmaps=false pushAnnelieseannelise
K
29

You're using a "smart" transport (this is a good thing), so you do get deltas, or more specifically, "delta compression". But that's not to say that git pushes diffs.

Both push and fetch work the same way here: on a smart transport, your git calls up the remote and both ends have a mini conversation to figure out who has which repository objects, identified by SHA-1 and attached to specific labels (typically branch and tag names although other labels are allowed as well).

For instance, in this case, your git calls up theirs and says: "I propose to have you set your branch master to SHA-1 1234567.... I see that your master is currently 333333..., here's what I think you need to get from there to 7777777...." Theirs should reply with "ok, I need some of those but I already have ...". Once your git has figured out what needs to be sent, and what is already present, your git builds a "thin pack"1 containing all the to-be-sent objects. (This is the "delta compressing using up to %d threads" phase.)

The resulting thin pack is then sent over the smart transport; this is where you see the "writing objects" messages. (The entire thin pack must be sent successfully, after which the receiver "fattens it up" again using git index-pack --fix-thin and drops it into the repository.)

Exactly what data is sent, depends on the objects in the thin pack. That should be just the set of commits between "what they have" and "what you're sending", plus any objects (trees and blobs) needed for those commits, plus any annotated tags you're sending and any objects needed for those, that they don't already have.

You can find the commits in question by using git fetch to pick up their latest information, then using git rev-list to see what commits you'd send them. For instance, if you're just going to push things on master:

$ git fetch origin   # assuming the remote name is origin
[wait for it to finish]
$ git rev-list origin/master..master

Examining these commits may show a very large binary file that is contained in one of the middle ones, then removed again in a later commit:

$ git log --name-status origin/master..master

If one commit has A giantfile.bin and then a subsequent (probably listed first in git log output) commit has D giantfile.bin, you're probably getting hung up sending the blob for giantfile.bin.

If that's the case, you can use git rebase -i to eliminate the commit that adds the giant binary file, so that git push won't have to send that commit.

(If your history is linear—has no merges to push—then you can also, or instead, use git format-patch to create a series of email messages that contain patches. These are suitable for emailing to someone at the other site—not that there's someone at github waiting to receive them, but you can easily examine the patch files to see if any of them are enormous.)


1The pack is "thin" in that it violates a normal pack-file rule that requires any delta-compression "downstream" object to be in the pack itself. Instead, the "downstream" objects can (in fact, must) be in the repository receiving the thin pack.

Kishke answered 18/3, 2015 at 10:29 Comment(7)
unfortunately, the format-patch totals 1.7MB, which doesn't explain the slow pushTolliver
At this point the thing to investigate is what's in the thin-pack, and whether your ssh transport is simply getting stuck (due to some probably-not-git-related network issue). A network tracer like tcpdump may be helpful for the latter.Kishke
is the thin-pack === git format-patch origin/foo..foo ?Tolliver
No: the key difference is that a thin pack contains actual git objects, while format-patch makes diffs suitable for emailing. The diffs provide enough information to reconstruct files, but not a complete commit graph, nor tags. Or, in other words, format-patch gets you most but not all information, in a different format. There's no direct way to compare sizes, although one might reasonably expect a strong positive correlation.Kishke
On the way Git is building the objects to send, see (with Git 2.21+) "git push is very slow for a huge repo".Annelieseannelise
i don;t know what was there but in my case there was only one remote branch master there and i ran into same issue of slow push but when i pulled master and merged it to local branch it is fixed. even there were no large giant files in between still...Annabel
@ShashankBhatt: you may have run into the bug VonC mentions in his answer. The fetch-and-merge might have triggered a repack. Or, you may have run into the case that occurs with depth 1 shallow clones. We don't have enough information about your setup to say.Kishke
A
4

Note that Git 2.25 fixes an extreme slowdown in pack-objects when you have more than 1023 packs. See below for numbers.

Other option: Git 2.38 (Q3 2022) proposes the new setting git -c push.useBitmaps=false push, to disable packing for git push.

But for Git 2.25 fix:

That might have a positive influence on your case, where you have a large number of pack files.

See commit f66e040 (11 Nov 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 8faff38, 01 Dec 2019)

pack-objects: avoid pointless oe_map_new_pack() calls

Signed-off-by: Jeff King
Reviewed-by: Derrick Stolee

Since 43fa44fa3b (pack-objects: move in_pack out of struct object_entry, 2018-04-14), we use a complicated system to save some per-object memory.

Each object_entry structs gets a 10-bit field to store the index of the pack it's in. We map those indices into pointers using packing_data->in_pack_by_idx, which we initialize at the start of the program.
If we have 2^10 or more packs, then we instead create an array of pack pointers, one per object. This is packing_data->in_pack.

So far so good. But there's one other tricky case: if a new pack arrives after we've initialized in_pack_by_idx, it won't have an index yet. We solve that by calling oe_map_new_pack(), which just switches on the fly to the less-optimal in_pack mechanism, allocating the array and back-filling it for already-seen objects.

But that logic kicks in even when we've switched to it already (whether because we really did see a new pack, or because we had too many packs in the first place). The result doesn't produce a wrong outcome, but it's very slow. What happens is this:

  • imagine you have a repo with 500k objects and 2000 packs that you want to repack.

  • before looking at any objects, we call prepare_in_pack_by_idx().
    It starts allocating an index for each pack.
    On the 1024th pack, it sees there are too many, so it bails, leaving in_pack_by_idx as NULL.

  • while actually adding objects to the packing list, we call oe_set_in_pack(), which checks whether the pack already has an index.
    If it's one of the packs after the first 1023, then it doesn't have one, and we'll call oe_map_new_pack().

But there's no useful work for that function to do.
We're already using in_pack, so it just uselessly walks over the complete list of objects, trying to backfill in_pack.

And we end up doing this for almost 1000 packs (each of which may be triggered by more than one object). And each time it triggers, we may iterate over up to 500k objects. So in the absolute worst case, this is quadratic in the number of objects.

The solution is simple: we don't need to bother checking whether the pack has an index if we've already converted to using in_pack, since by definition we're not going to use it. So we can just push the "does the pack have a valid index" check down into that half of the conditional, where we know we're going to use it.

The current test in p5303 sadly doesn't notice this problem, since it maxes out at 1000 packs. If we add a new test to it at 2000 packs, it does show the improvement:

Test                      HEAD^               HEAD
>     ----------------------------------------------------------------------
5303.12: repack (2000)    26.72(39.68+0.67)   15.70(28.70+0.66) -41.2%

However, these many-pack test cases are rather expensive to run, so adding larger and larger numbers isn't appealing. Instead, we can show it off more easily by using GIT_TEST_FULL_IN_PACK_ARRAY, which forces us into the absolute worst case: no pack has an index, so we'll trigger oe_map_new_pack() pointlessly for every single object, making it truly quadratic.

Here are the numbers (on git.git) with the included change to p5303:

Test                      HEAD^               HEAD
>     ----------------------------------------------------------------------
5303.3: rev-list (1)      2.05(1.98+0.06)     2.06(1.99+0.06) +0.5%
5303.4: repack (1)        33.45(33.46+0.19)   2.75(2.73+0.22) -91.8%
5303.6: rev-list (50)     2.07(2.01+0.06)     2.06(2.01+0.05) -0.5%
5303.7: repack (50)       34.21(35.18+0.16)   3.49(4.50+0.12) -89.8%
5303.9: rev-list (1000)   2.87(2.78+0.08)     2.88(2.80+0.07) +0.3%
5303.10: repack (1000)    41.26(51.30+0.47)   10.75(20.75+0.44) -73.9%

Again, those improvements aren't realistic for the 1-pack case (because in the real world, the full-array solution doesn't kick in), but it's more useful to be testing the more-complicated code path.

While we're looking at this issue, we'll tweak one more thing: in oe_map_new_pack(), we call REALLOC_ARRAY(pack->in_pack). But we'd never expect to get here unless we're back-filling it for the first time, in which case it would be NULL.
So let's switch that to ALLOC_ARRAY() for clarity, and add a BUG() to document the expectation. Unfortunately this code isn't well-covered in the test suite because it's inherently racy (it only kicks in if somebody else adds a new pack while we're in the middle of repacking).

Annelieseannelise answered 9/12, 2019 at 19:22 Comment(0)
R
1

In my case, I just happened to unintenionally add a really large file to my commit.

Raphael answered 26/4, 2022 at 14:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.