Git fetch for many files is slow against a high-latency disk
Asked Answered
git
N

1

1

What I'm interested in here is some insight into git's internals -

If I have a repo hosted remotely on Bitbucket with many files (say ~25000, they're all around 2K in size), why is the first fetch so slow when targeting a high-latency disk?

I would expect operations like the first checkout to be slow, due to the need to write lots of files, but the fetch should only be receiving a handful of metadata and pack files and writing those to disk. The disk is high-latency but throughput is fine, so the performance of writing a small number of large files is generally fine.

Necker answered 24/10, 2017 at 21:23 Comment(2)
Are you saying that the same operation is much faster on an otherwise-identical PC on the same network? Does your git repo have a long history? (git clone by default will also fetch every version from every branch for all of history.)Macias
Same operation is much faster on the same PC on a different drive (local drive as opposed to a particularly slow network drive). Repo has only one checkin.Necker
C
2

The fetch should only be receiving a handful of metadata and pack files and writing those to disk.

Still, Git 2.20 (Q4 2018) will improve fetching speed.

That is because, when creating a thin pack, which allows objects to be made into a delta against another object that is not in the resulting pack but is known to be present on the receiving end, the code learned to take advantage of the reachability bitmap; this allows the server to send a delta against a base beyond the "boundary" commit.

See commit 6a1e32d, commit 30cdc33 (21 Aug 2018), and commit 198b349, commit 22bec79, commit 5a924a6, commit 968e77a (17 Aug 2018) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 3ebdef2, 17 Sep 2018)

pack-objects: reuse on-disk deltas for thin "have" objects

When we serve a fetch, we pass the "wants" and "haves" from the fetch negotiation to pack-objects. That tells us not only which objects we need to send, but we also use the boundary commits as "preferred bases": their trees and blobs are candidates for delta bases, both for reusing on-disk deltas and for finding new ones.

However, this misses some opportunities. Modulo some special cases like shallow or partial clones, we know that every object reachable from the "haves" could be a preferred base.
We don't use all of them for two reasons:

  1. It's expensive to traverse the whole history and enumerate all of the objects the other side has.
  2. The delta search is expensive, so we want to keep the number of candidate bases sane. The boundary commits are the most likely to work.

When we have reachability bitmaps, though, reason 1 no longer applies.
We can efficiently compute the set of reachable objects on the other side (and in fact already did so as part of the bitmap set-difference to get the list of interesting objects). And using this set conveniently covers the shallow and partial cases, since we have to disable the use of bitmaps for those anyway.

The second reason argues against using these bases in the search for new deltas.

But there's one case where we can use this information for free: when we have an existing on-disk delta that we're considering reusing, we can do so if we know the other side has the base object. This in fact saves time during the delta search, because it's one less delta we have to compute.

And that's exactly what this patch does: when we're considering whether to reuse an on-disk delta, if bitmaps tell us the other side has the object (and we're making a thin-pack), then we reuse it.

Here are the results on p5311 using linux.git, which simulates a client fetching after N days since their last fetch:

 Test                         origin              HEAD
 --------------------------------------------------------------------------
 5311.3: server   (1 days)    0.27(0.27+0.04)     0.12(0.09+0.03) -55.6%
 5311.4: size     (1 days)               0.9M              237.0K -73.7%
 5311.5: client   (1 days)    0.04(0.05+0.00)     0.10(0.10+0.00) +150.0%
 5311.7: server   (2 days)    0.34(0.42+0.04)     0.13(0.10+0.03) -61.8%
 5311.8: size     (2 days)               1.5M              347.7K -76.5%
 5311.9: client   (2 days)    0.07(0.08+0.00)     0.16(0.15+0.01) +128.6%
 5311.11: server   (4 days)   0.56(0.77+0.08)     0.13(0.10+0.02) -76.8%
 5311.12: size     (4 days)              2.8M              566.6K -79.8%
 5311.13: client   (4 days)   0.13(0.15+0.00)     0.34(0.31+0.02) +161.5%
 5311.15: server   (8 days)   0.97(1.39+0.11)     0.30(0.25+0.05) -69.1%
 5311.16: size     (8 days)              4.3M                1.0M -76.0%
 5311.17: client   (8 days)   0.20(0.22+0.01)     0.53(0.52+0.01) +165.0%
 5311.19: server  (16 days)   1.52(2.51+0.12)     0.30(0.26+0.03) -80.3%
 5311.20: size    (16 days)              8.0M                2.0M -74.5%
 5311.21: client  (16 days)   0.40(0.47+0.03)     1.01(0.98+0.04) +152.5%
 5311.23: server  (32 days)   2.40(4.44+0.20)     0.31(0.26+0.04) -87.1%
 5311.24: size    (32 days)             14.1M                4.1M -70.9%
 5311.25: client  (32 days)   0.70(0.90+0.03)     1.81(1.75+0.06) +158.6%
 5311.27: server  (64 days)   11.76(26.57+0.29)   0.55(0.50+0.08) -95.3%
 5311.28: size    (64 days)             89.4M               47.4M -47.0%
 5311.29: client  (64 days)   5.71(9.31+0.27)     15.20(15.20+0.32) +166.2%
 5311.31: server (128 days)   16.15(36.87+0.40)   0.91(0.82+0.14) -94.4%
 5311.32: size   (128 days)            134.8M              100.4M -25.5%
 5311.33: client (128 days)   9.42(16.86+0.49)    25.34(25.80+0.46) +169.0%

In all cases we save CPU time on the server (sometimes significant) and the resulting pack is smaller.
We do spend more CPU time on the client side, because it has to reconstruct more deltas.

But that's the right tradeoff to make, since clients tend to outnumber servers.
It just means the thin pack mechanism is doing its job.

From the user's perspective, the end-to-end time of the operation will generally be faster. E.g., in the 128-day case, we saved 15s on the server at a cost of 16s on the client.
Since the resulting pack is 34MB smaller, this is a net win if the network speed is less than 270Mbit/s. And that's actually the worst case.
The 64-day case saves just over 11s at a cost of just under 11s. So it's a slight win at any network speed, and the 40MB saved is pure bonus.
That trend continues for the smaller fetches.


With Git 2.22 (Q2 2019), another option will help on the repacking side, the the pathname hash-cache now created by default to avoid making crappy deltas when repacking.

See commit 36eba03 (14 Mar 2019) by Eric Wong (ele828).
See commit d431660, commit 90ca149 (15 Mar 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 2bfb182, 13 May 2019)

pack-objects: default to writing bitmap hash-cache

Enabling pack.writebitmaphashcache should always be a performance win.
It costs only 4 bytes per object on disk, and the timings in ae4f07f (pack-bitmap: implement optional name_hash cache, 2013-12-21, Git v2.0.0-rc0) show it improving fetch and partial-bitmap clone times by 40-50%.

The only reason we didn't enable it by default at the time is that early versions of JGit's bitmap reader complained about the presence of optional header bits it didn't understand.
But that was changed in JGit's d2fa3987a (Use bitcheck to check for presence of OPT_FULL option, 2013-10-30), which made it into JGit v3.5.0 in late 2014.

So let's turn this option on by default.
It's backwards-compatible with all versions of Git, and if you are also using JGit on the same repository, you'd only run into problems using a version that's almost 5 years old.

We'll drop the manual setting from all of our test scripts, including perf tests. This isn't strictly necessary, but it has two advantages:

  1. If the hash-cache ever stops being enabled by default, our perf regression tests will notice.

  2. We can use the modified perf tests to show off the behavior of an otherwise unconfigured repo, as shown below.

These are the results of a few of a perf tests against linux.git that showed interesting results.
You can see the expected speedup in 5310.4, which was noted in ae4f07f (Dec. 2013, Git v2.0.0-rc0).
Curiously, 5310.8 did not improve (and actually got slower), despite seeing the opposite in ae4f07f. I don't have an explanation for that.

The tests from p5311 did not exist back then, but do show improvements (a smaller pack due to better deltas, which we found in less time).

  Test                                    HEAD^                HEAD
  -------------------------------------------------------------------------------------
  5310.4: simulated fetch                 7.39(22.70+0.25)     5.64(11.43+0.22) -23.7%
  5310.8: clone (partial bitmap)          18.45(24.83+1.19)    19.94(28.40+1.36) +8.1%
  5311.31: server (128 days)              0.41(1.13+0.05)      0.34(0.72+0.02) -17.1%
  5311.32: size   (128 days)                         7.4M                 7.0M -4.8%
  5311.33: client (128 days)              1.33(1.49+0.06)      1.29(1.37+0.12) -3.0%

Git 2.23 (Q3 2019) makes sure generation of pack bitmaps are now disabled when .keep files exist, as these are mutually exclusive features.

See commit 7328482 (29 Jun 2019) by Eric Wong (ele828).
Helped-by: Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit d60dc1a, 19 Jul 2019)

This fixes 36eba03 ("repack: enable bitmaps by default on bare repos", March 2019, Git v2.22.0-rc0)

repack: disable bitmaps-by-default if .keep files exist

Bitmaps aren't useful with multiple packs, and users with .keep files ended up with redundant packs when bitmaps got enabled by default in bare repos.

So detect when .keep files exist and stop enabling bitmaps by default in that case.

Wasteful (but otherwise harmless) race conditions with .keep files documented by Jeff King still apply and there's a chance we'd still end up with redundant data on the FS, as discussed here.

v2: avoid subshell in test case, be multi-index aware


However, the same Git 2.23 (Q3 2019) squelches unneeded and misleading warnings from "repack" when the command attempts to generate pack bitmaps without explicitly asked for by the user.

See commit 7ff024e, commit 2557501, commit cc2649a (31 Jul 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 51cf315, 01 Aug 2019)

repack: simplify handling of auto-bitmaps and .keep files

Commit 7328482 (repack: disable bitmaps-by-default if .keep files exist, 2019-06-29, Git v2.23.0-rc0) taught repack to prefer disabling bitmaps to duplicating objects (unless bitmaps were asked for explicitly).

But there's an easier way to do this: if we keep passing the --honor-pack-keep flag to pack-objects when auto-enabling bitmaps, then pack-objects already makes the same decision (it will disable bitmaps rather than duplicate).
Better still, pack-objects can actually decide to do so based not just on the presence of a .keep file, but on whether that .keep file actually impacts the new pack we're making (so if we're racing with a push or fetch, for example, their temporary .keep file will not block us from generating bitmaps if they haven't yet updated their refs).

And because repack uses the --write-bitmap-index-quiet flag, we don't have to worry about pack-objects generating confusing warnings when it does see a .keep file.
We can confirm this by tweaking the .keep test to check repack's stderr.


With Git 2.31 (Q1 2021), there are various improvements to the codepath that writes out pack bitmaps.

See commit f077b0a, commit 45f4eeb, commit 341fa34, commit 089f751, commit 928e3f4, commit 1467b95, commit 597b2c3, commit ed03a58, commit 6dc5ef7 (08 Dec 2020) by Derrick Stolee (derrickstolee).
See commit 8357805, commit 98c31f3, commit c6b0c39, commit 3b1ca60 (08 Dec 2020) by Taylor Blau (ttaylorr).
See commit 449fa5e, commit 010e5ea, commit 4a9c581, commit ccae08e, commit 3ed6751, commit 2e2d141, commit d574bf4, commit 2978b00, commit c5cd749, commit ec6c7b4, commit ca51090 (08 Dec 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit c256631, 06 Jan 2021)

pack-bitmap-write: relax unique revwalk condition

Signed-off-by: Derrick Stolee
Helped-by: Johannes Schindelin
Signed-off-by: Taylor Blau

The previous commits improved the bitmap computation process for very long, linear histories with many refs by removing quadratic growth in how many objects were walked. The strategy of computing "intermediate commits" using bitmasks for which refs can reach those commits partitioned the poset of reachable objects so each part could be walked exactly once. This was effective for linear histories.

However, there was a (significant) drawback: wide histories with many refs had an explosion of memory costs to compute the commit bitmasks during the exploration that discovers these intermediate commits. Since these wide histories are unlikely to repeat walking objects, the benefit of walking objects multiple times was not expensive before. But now, the commit walk before computing bitmaps is incredibly expensive.

In an effort to discover a happy medium, this change reduces the walk for intermediate commits to only the first-parent history. This focuses the walk on how the histories converge, which still has significant reduction in repeat object walks. It is still possible to create quadratic behavior in this version, but it is probably less likely in realistic data shapes.

Here is some data taken on a fresh clone of the kernel:

|   runtime (sec)    |   peak heap (GB)   |
|                    |                    |
|   from  |   with   |   from  |   with   |
| scratch | existing | scratch | existing |
+---------+----------+---------+-----------
|  64.044 |   83.241 |   2.088 |    2.194 |
|  45.049 |   37.624 |   2.267 |    2.334 |
|  88.478 |   53.218 |   2.157 |    2.224 |

With Git 2.36 (Q2 2022), the re is a couple of optimization added to git fetch.

See commit b18aaaa, commit 6fd1cc8 (10 Feb 2022) by Patrick Steinhardt (pks-t).
(Merged by Junio C Hamano -- gitster -- in commit 68fd3b3, 23 Feb 2022)

fetch-pack: use commit-graph when computing cutoff

Signed-off-by: Patrick Steinhardt

During packfile negotiation, we iterate over all refs announced by the remote side to check whether their IDs refer to commits already known to us.
If a commit is known to us already, then its date is a potential cutoff point for commits we have in common with the remote side.

There is potentially a lot of commits announced by the remote depending on how many refs there are in the remote repository, and for every one of them we need to search for it in our object database and, if found, parse the corresponding object to find out whether it is a candidate for the cutoff date.
This can be sped up by trying to look up commits via the commit-graph first, which is a lot more efficient.

Benchmarks in a repository with about 2,1 million refs and an up-to-date commit-graph show an almost 20% speedup when mirror-fetching:

Benchmark 1: git fetch +refs/*:refs/* (v2.35.0)
  Time (mean ± σ):     115.587 s ±  2.009 s    [User: 109.874 s, System: 11.305 s]
  Range (min … max):   113.584 s … 118.820 s    5 runs

Benchmark 2: git fetch +refs/*:refs/* (HEAD)
  Time (mean ± σ):     96.859 s ±  0.624 s    [User: 91.948 s, System: 10.980 s]
  Range (min … max):   96.180 s … 97.875 s    5 runs

Summary
  'git fetch +refs/*:refs/* (HEAD)' ran
    1.19 ± 0.02 times faster than 'git fetch +refs/*:refs/* (v2.35.0)'
Cubical answered 22/9, 2018 at 1:41 Comment(1)
See also github.com/git/git/commit/…Cubical

© 2022 - 2024 — McMap. All rights reserved.