Pre-load git repository?

A

3

4

If I already have the same files locally, instead of pulling down a large directory of files from a remote repository is there a way to "pre-load" my local repository with the files? I already have the same files locally that are on the remote, they just are not in the local repo.

Here's my situation:

I've got a remote web site that has a large (many gigs) directory of resources (images, PDFs, swfs, flvs). I've set up a git repository for this remote site and I have cloned it locally, using the .gitignore file to exclude the big resource directory from being included in the repo.

I'd like to make the big resources directory part of the remote repo now, but that's going to drastically increase the size of the repo and when I do my next local pull I'm in for a really long wait/download. So I'm basically hoping there is a way of telling git "I'm going to ask you to pull that repo that is all of a sudden much bigger but I've already got most of what's making it so big"? Or would this maybe go the other way, where I add the files to my local repo first and then somehow the repositories work it out that they've got the same files and no transfer is necessary?

This would also come in handy when new developers are brought onto a large project and the bulk of it could be provided on DVDs instead of them having to clone/download a huge repo.

Align answered 3/1, 2014 at 18:33 Comment(3)

Even if you could load the files in their current state, it would still have to get the history for each of them. From my understanding that's the crux of the problem with binary files in source control - not that they are large in and of themselves, but that there is no efficient way to track changes – Ecclesia 3/1, 2014 at 19:0

If you already have the repo and want to add large files on both ends: if you make the exact same commit to each repo, their hashes should be the same and when you try to fetch it should consider the commits to be the same. I think. – Ecclesia 3/1, 2014 at 19:7

I just tried to add the same file in separate clones of a repo and the commits ended up having different hashes and conflicting, so my previous comment won't help – Ecclesia 3/1, 2014 at 19:15

C

0

I suggest you be very careful about making a habit out of adding gigabytes of binaries into your git without looking into options like git-annex.

Now. Just having the files themselves locally isn't enough for Git to use them. You could use git hash-object to manually add the big binaries to Git's object database on either side of the great network divide and create a commit containing the exact same files on the other side, but when pushing/fetching such a commit Git isn't smart enough to figure out that those objects already exist on the other side; because the commit that needs to be transmitted doesn't exist the big blobs will be included in the resulting packfile that's transmitted over the wire. To avoid this you'd have to manually copy all commit and tree objects but omit the big blobs. Doable but probably more trouble than it's worth.

A more realistic approach is to take the hit of the network transfer once and be smart about future transfers. You can have a local mirror that people can clone from. If that's also not fast enough it's an indication that your git is too big.

You can also clone the git with git clone --reference <ref> <url>, where <ref> is a local directory containing the git you're cloning. This will reuse all objects from the reference git, making the clone extremely fast. However, as noted in the git clone manpage, the new clone will directly refer to the objects in the old clone so if the old clone is deleted you're in trouble. To actually copy the objects you can run git repack -a after cloning.

git clone --reference /some/old/clone http://example.com/some/git dirname
cd dirname
git repack -a
rm .git/objects/info/alternates

The last command deletes the link to the reference git so Git won't try to look for objects there in the future.

To distribute a Git repository on e.g. DVD or similar storage mechanisms look into git bundle. See e.g. How to git bundle a complete repo.

Constance answered 3/1, 2014 at 19:45 Comment(1)

Thank you very much. My guess was that it would be more trouble than editing .gitignore, typing a couple commands and waiting 10 hours, but you've confirmed it for me with your very complete answer. Thanks for cluing me in to git-annex, which looks useful for many other things as well. – Align 4/1, 2014 at 18:43

S

0

I only put the git source code, all part of pictures, pdfs,images . .. I would create a separate storage server, and in my aplication i link to storage.

Simonne answered 3/1, 2014 at 18:39 Comment(1)

With a large, complex web site I need EVERYTHING in the repository, both for emulating the operation of the site on my local machine and for version control (tracking changes). Often php code, CSS and images are all included in the deployment of a new feature or content change. I want the large resources directory tracked, I just don't want to have to download it since I've already got the files. – Align 3/1, 2014 at 18:52

C

0

I suggest you be very careful about making a habit out of adding gigabytes of binaries into your git without looking into options like git-annex.

Now. Just having the files themselves locally isn't enough for Git to use them. You could use git hash-object to manually add the big binaries to Git's object database on either side of the great network divide and create a commit containing the exact same files on the other side, but when pushing/fetching such a commit Git isn't smart enough to figure out that those objects already exist on the other side; because the commit that needs to be transmitted doesn't exist the big blobs will be included in the resulting packfile that's transmitted over the wire. To avoid this you'd have to manually copy all commit and tree objects but omit the big blobs. Doable but probably more trouble than it's worth.

A more realistic approach is to take the hit of the network transfer once and be smart about future transfers. You can have a local mirror that people can clone from. If that's also not fast enough it's an indication that your git is too big.

You can also clone the git with git clone --reference <ref> <url>, where <ref> is a local directory containing the git you're cloning. This will reuse all objects from the reference git, making the clone extremely fast. However, as noted in the git clone manpage, the new clone will directly refer to the objects in the old clone so if the old clone is deleted you're in trouble. To actually copy the objects you can run git repack -a after cloning.

git clone --reference /some/old/clone http://example.com/some/git dirname
cd dirname
git repack -a
rm .git/objects/info/alternates

The last command deletes the link to the reference git so Git won't try to look for objects there in the future.

To distribute a Git repository on e.g. DVD or similar storage mechanisms look into git bundle. See e.g. How to git bundle a complete repo.

Constance answered 3/1, 2014 at 19:45 Comment(1)

Thank you very much. My guess was that it would be more trouble than editing .gitignore, typing a couple commands and waiting 10 hours, but you've confirmed it for me with your very complete answer. Thanks for cluing me in to git-annex, which looks useful for many other things as well. – Align 4/1, 2014 at 18:43

S

0

Today (Q4 2017/Q1 2018, 4 years after the OP's question), the only Git-related way to clone a huge Git repo would be through the (February 2017) GVFS (Git Virtual File System).

As tweeted, for a 270GB repo:

“The Windows codebase has over 3.5M files. With GVFS (Git Virtual File System), cloning now takes a few minutes instead of 12+ hours.”

See github.com/Microsoft/GVFS.
GVFS is based on Git fork: github.com/Microsoft/git.
And based on a protocol whose specifications are described here.

This is not yet supported by EGit, or even regular Git for now, but the integration of such a mechanism has begun, with Git 2.16 (Q1 2018), and the implementation of a narrow/partial clone, where the object walking machinery has been taught a way to tell it to "filter" some objects from enumeration.

This is the result of a discussion around partial cloning, documented here (Nov 2017), even though the issue was highlighted back in May 2013:

When working with large repositories, having to fetch all objects in the region of history the user is interested in is wasteful.
This is especially true in two cases:

using sparse checkout: objects outside the directory the user is looking at are not likely to ever be needed. Later, the user should be able to fetch objects outside that directory if they turn out to be needed (e.g. if the sparse checkout expands). This is especially useful when combined with a virtual filesystem that determines the sparse checkout pattern to use automatically (https://blogs.msdn.microsoft.com/devops/2017/02/03/announcing-gvfs-git-virtual-file-system/).

the repository contains large binary files: historical versions of large files are not needed in order to build the latest version of code.
Using a shallow clone loses the ability to use "git log" to understand the project's history during development.
Using Git LFS requires anticipating this problem in advance and deciding which files to offload to LFS in advance.
Having native support in Git for omitting large blobs avoids this dilemma.

Both Microsoft and Google internally use patches to support partial clone and have published their patches. This issue tracks incorporating the functionality into Git upstream.

As a result:

See commit f4371a8, commit 4875c97 (05 Dec 2017), and commit 9535ce7, commit caf3827, commit 25ec7bc, commit c3a9ad3, commit 314f354, commit 578d81d (21 Nov 2017) by Jeff Hostetler (jeffhostetler).
See commit 1dde5fa (05 Dec 2017) by Christian Couder (chriscool).
^{(Merged by Junio C Hamano -- gitster -- in commit 61061ab, 27 Dec 2017)}

rev-list/pack-objects: add list-objects filtering support

Teach rev-list to use the filtering provided by the traverse_commit_list_filtered() interface to omit unwanted objects from the result.

In the future, we will introduce a "partial clone" mechanism wherein an object in a repo, obtained from a remote, may reference a missing object that can be dynamically fetched from that remote once needed.
This "partial clone" mechanism will have a way, sometimes slow, of determining if a missing link is one of the links expected to be produced by this mechanism.

This patch introduces handling of missing objects to help debugging and development of the "partial clone" mechanism, and once the mechanism is implemented, for a power user to perform operations that are missing-object aware without incurring the cost of checking if a missing link is expected.

With Git 2.29 (Q4 2020), a new helper function has_object() has been introduced to make it easier to mark object existence checks that do and don't want to trigger lazy fetches, and a few such checks are converted using it.

See commit 9eb86f4, commit ee47243, commit 3318238, commit 1d8d9cb (05 Aug 2020) by Jonathan Tan (jhowtan).
^{(Merged by Junio C Hamano -- gitster -- in commit d1a8a89, 13 Aug 2020)}

pack-objects: no fetch when allow-{any,promisor}

^{Signed-off-by: Jonathan Tan}

The options --missing=allow-{any,promisor} were introduced in caf3827e2f ("rev-list: add list-objects filtering support", 2017-11-22, Git v2.16.0-rc0 -- merge listed in batch #11) with the following note in the commit message:
This patch introduces handling of missing objects to help
debugging and development of the "partial clone" mechanism,
and once the mechanism is implemented, for a power user to
perform operations that are missing-object aware without
incurring the cost of checking if a missing link is expected.  
The idea that these options are missing-object aware (and thus do not need to lazily fetch objects, unlike unaware commands that assume that all objects are present) are assumed in later commits such as 07ef3c6604 ("fetch test: use more robust test for filtered objects", 2020-01-15, Git v2.26.0-rc0 -- merge listed in batch #5).

However, the current implementations of these options use has_object_file(), which indeed lazily fetches missing objects.
Teach these implementations not to do so.
Also, update the documentation of these options to be clearer.

git pack-objects now includes in its man page:

a missing object is encountered. If the repository is a partial clone, an attempt to fetch missing objects will be made before declaring them missing.
This is the default action.
if a missing object is encountered. No fetch of a missing object will occur.
Missing objects will silently be omitted from the results.
No fetch of a missing object will occur. An unexpected missing object will raise an error.

Sanies answered 28/12, 2017 at 17:47 Comment(0)

`rev-list`/`pack-objects`: add list-objects filtering support

`pack-objects`: no fetch when `allow-{any,promisor}`

Recommended topics

Hot tags

rev-list/pack-objects: add list-objects filtering support

pack-objects: no fetch when allow-{any,promisor}

Recommended topics

Hot tags

`rev-list`/`pack-objects`: add list-objects filtering support

`pack-objects`: no fetch when `allow-{any,promisor}`