How to git fetch efficiently from a shallow clone
Asked Answered
L

5

78

We use git to distribute an operating system and keep it upto date. We can't distribute the full repository since it's too large (>2GB), so we have been using shallow clones (~300M). However recently when fetching from a shallow clone, it's now inefficiently fetches the entire >2GB repository. This is an untenable waste of bandwidth for deployments.

The git documentation says you cannot fetch from a shallow repository, though that's strictly not true. Are there any workarounds to make a git clone --depth 1 able to fetch just what's changed from it? Or some other strategy to keep the distribution size as small as possible whilst having all the bits git needs to do an update?

I have unsuccessfully tried cloning from --depth 20 to see if it will upgrade more efficiently, that didn't work. I did also look into http://git-scm.com/docs/git-bundle, but that seems to create huge bundles.

Lanitalank answered 14/10, 2013 at 3:0 Comment(4)
"but that seems to create huge bundles": only for the first one. After that, you can create incremental bundles.Olodort
My initial distribution cannot be huge...Lanitalank
You will have to try again fetching for your shallow clone with Git 1.9/2.0 (Q1 2014): those operations are now much more efficient. See my answer belowOlodort
Git 2.5 (Q2 2015) supports a single fetch commit! I have edited my answer below, now referencing "Pull a specific commit from a remote git repository".Olodort
O
60

--depth is a git fetch option. I see the doc doesn't really highlight that git clone does a fetch.

When you fetch, the two repos swap info on who has what by starting from the remote's heads and searching backward for the most recent shared commit in the fetched refs' histories, then filling in all the missing objects to complete just the new commits between the most recent shared commits and the newly fetched ones.

A --depth=1 fetch just gets the branch tips and no prior history. Further fetches of those histories will fetch everything new by the above procedure, but if the previously-fetched commits aren't in the newly fetched history, fetch will retrieve all of it -- unless you limit the fetch with --depth.

Your client did a depth=1 fetch from one repo and switched urls to a different repo. At least one long ancestry path in this new repo's refs apparently shares no commits with anything currently in your repo. That might be worth investigating, but either way unless there's some particular reason, your clients can just do every fetch --depth=1.

Odeen answered 16/10, 2013 at 3:49 Comment(9)
As you can see in my test, I reset hard to a26424 which is in the remote github.com/Webconverger/webc/commits/master. So I don't understand why it just doesn't fetch everything new. How can I compare remote refs? git ls-remote only shows tags/branches ...Lanitalank
You switched repos. You have ten branches and seventeen tags in this new repo, and at least one of them references a long ancestry having no commits in common with any history presently in your repo.Odeen
So.. IIUC, I should prune the branches/tags on github.com/webconverger/webc (the new repo), to ensure everything is in common with say "a26424"?Lanitalank
Or fetch only the refs you want (to set defaults see the ['remote.<name>.fetch](https://www.kernel.org/pub/software/scm/git/docs/git-fetch.html#_named_remote_in_configuration_file) entry in fetch`'s discussion of how to configure remotes)Odeen
IIUC --depth 1 is the way to go, though we didn't implement that way since my colleague discovered a bug, which is now fixed in github.com/git/git/commit/… So we are waiting to here back from github whether that's deployed and then we will be using it.Lanitalank
That might be easiest. With an unrestricted refspec you'll still be fetching 27 commits the first time. Have you checked what refs the old and new repos have in common, or rather don't?Odeen
I haven't actually figured out how to check common or excluded refs easily. Any tips?Lanitalank
git ls-remote will tell you all the remote's refs, git branch -a and git tag will tell you all the ones you have.Odeen
git fetch -v -v -v I've found to be very useful btwLanitalank
L
47

Just did g clone github.com:torvalds/linux and it took so much time, so I just skipped it by CTRL+C.

Then did g clone github.com:torvalds/linux --depth 1 and it did cloned quite fast. And I have only one commit in git log.

So clone --depth 1 should work. If you need to update existing repository, you should use git fetch origin remoteBranch:localBranch --depth 1. It works too, it fetches only one commit.

Summing up:

Initial clone:

git clone git_url --depth 1

Code update

git fetch origin remoteBranch:localBranch --depth 1
Linette answered 22/10, 2013 at 21:7 Comment(3)
I'd like to add the depth thing to the config, so I can do git fetch origin without needing to remember the depth filter. Is that possible?Ichor
Yes you may want to create an alias. Here is the manual on aliasing in git: git-scm.com/book/en/v2/Git-Basics-Git-AliasesLinette
Only this solution worked for me (--unshallow doesn't work). Key was branch:branchFranklynfrankness
W
16

If you can select a specific branch, it can be even faster. Here's an example using Spark master branch and latest tag:

Initial clone

git clone [email protected]:apache/spark.git --branch master --single-branch --depth 1

Update to specific tag

git fetch --depth 1 origin tags/v1.6.0

It becomes very fast to switch tags/branch this way.

Wes answered 7/1, 2016 at 14:35 Comment(0)
O
14

Note that Git 1.9/2.0 (Q1 2014) could be more efficient in fetching for a shallow clone.
See commit 82fba2b, from Nguyễn Thái Ngọc Duy (pclouds):

Now that git supports data transfer from or to a shallow clone, these limitations are not true anymore.

All the details are in "shallow.c: the 8 steps to select new commits for .git/shallow".

You can see the consequence in commits like 0d7d285, f2c681c, and c29a7b8 which support clone, send-pack /receive-pack with/from shallow clones.
smart-http now supports shallow fetch/clone too.
You can even clone form a shallow repo.

Update 2015: git 2.5+ (Q2 2015) will even allow for a single commit fetch! See "Pull a specific commit from a remote git repository".

Update 2016 (Oct.): git 2.11+ (Q4 2016) allows for fetching:


You also have git fetch --update-shallow, with Git v1.9, Q4 2013:

By default when fetching from a shallow repository, git fetch refuses refs that require updating .git/shallow.
This option updates .git/shallow and accepts such refs.

With Git 2.45 (Q2 2024), batch 6, make sure failure return from merge_bases_many() is properly caught.

See commit 25fd20e, commit 81a34cb (09 Mar 2024), and commit caaf1a2, commit 5317380, commit f87056c, commit 76e2a09, commit 8226e15, commit fb02c52, commit 896a0e1, commit 2d2da17, commit 24876eb, commit 207c40e, commit e67431d (28 Feb 2024) by Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit 7745f92, 11 Mar 2024)

commit-reach(paint_down_to_common): prepare for handling shallow commits

Signed-off-by: Johannes Schindelin

When git fetch --update-shallow(man) needs to test for commit ancestry, it can naturally run into a missing object (e.g. if it is a parent of a shallow commit).
For the purpose of --update-shallow, this needs to be treated as if the child commit did not even have that parent, i.e.
the commit history needs to be clamped.

For all other scenarios, clamping the commit history is actually a bug, as it would hide repository corruption (for an analysis regarding shallow and partial clones, see the analysis further down).

Add a flag to optionally ask the function to ignore missing commits, as --update-shallow needs it to, while detecting missing objects as a repository corruption error by default.

This flag is needed, and cannot be replaced by is_repository_shallow() to indicate that situation, because that function would return 0 in the --update-shallow scenario: There is not actually a shallow file in that scenario, as demonstrated e.g. by t5537.10 ("add new shallow root with receive.updateshallow on") and t5538.4 ("add new shallow root with receive.updateshallow on").

Note: shallow commits' parents are set to NULL internally already, therefore there is no need to special-case shallow repositories here, as the merge-base logic will not try to access parent commits of shallow commits.

Likewise, partial clones aren't an issue either: If a commit is missing during the revision walk in the merge-base logic, it is fetched via promisor_remote_get_direct().
And not only the single missing commit object: Due to the way the "promised" objects are fetched (in fetch_objects() in promisor-remote.c, using fetch --filter=blob:none), there is no actual way to fetch a single commit object, as the remote side will pass that commit OID to pack-objects --revs [...] which in turn passes it to rev-list which interprets this as a commit range instead of a single object.
Therefore, in partial clones (unless they are shallow in addition), all commits reachable from a commit that is in the local object database are also present in that local database.

Olodort answered 19/1, 2014 at 13:24 Comment(0)
S
1

I don't know if it suites your set-up but what I use is to have ha full clone of a repo in a separate directory. Then I do shallow clone from the remote repository with reference to the local one.

git clone --depth 1 --reference /path/to/local/clone [email protected]/group/repo.git 

That way only the differences with the reference repository and remote are actually fetched. To make it even quicker you can use the --shared option, but be sure to read about the restrictions in the git documentation (it can be dangerous).

Also I found out that in some circumstances when the remote has changed a lot, the clone starts fetching too much data. It is good to break it then and update the reference repo (which strangely takes much less bandwidth than it took in the first place.) And then start the clone again.

Sly answered 22/10, 2013 at 19:35 Comment(4)
I tried your command with and without several options and I am getting thing likefatal: reference repository '[email protected]/group/repo.git' is shallow.Knorring
You cannot use shallow repository/clone as a reference. It has to be full depth clone.Sly
Do you mean I can’t use--depth 1?Knorring
Not for the reference repository.Sly

© 2022 - 2024 — McMap. All rights reserved.