Partial clone with Git and Mercurial
Asked Answered
T

7

75

Is it possible to clone only one branch (or from a given commit) in Git and Mercurial? I mean, I want to clone a central repo but since it's huge I'd like to only get part of it and still be able to contribute back my changes. Is it possible? Like, I only want from Tag 130 onwards or something like that?

If so, how?

Templeton answered 6/4, 2010 at 17:15 Comment(1)
See also Git 2.17 partial clone (or "narrow clone") https://mcmap.net/q/183655/-is-there-any-distributed-revision-control-system-that-supports-partial-checkout-cloneConah
B
75

In Git land you are talking about three different types of partial clones:

  • shallow clones: I want history from revision point X onward.

    Use git clone --depth <n> <url> for that, but please remember that shallow clones are somewhat limited in interacting with other repositories. You would be able to generate patches and send them via email.

  • partial clone by filepath: I want all revision history history in some directory /path.

    Not possible in Git. With modern Git though you can have sparse checkout, i.e. you have whole history but you check out (have in working area) only subset of all files.

  • cloning only selected branch: I want to clone only one branch (or selected subset of branches).

    Possible, and

    before git 1.7.10 not simple: you would need to do what clone does manually, i.e. git init [<directory>], then git remote add origin <url>, edit .git/config replacing * in remote.origin.fetch by requested branch (probably 'master'), then git fetch .

    as of git 1.7.10 git clone offers the --single-branch option which seems like it was added just for this purpose, and seems pretty easy.

    Note however that because branches usually share most of their history, the gain from cloning only a subset of branches might be smaller than you think.

You can also do a shallow clone of only selected subset of branches.

If you know how people will want to break things down by filepath (multiple projects in the same repository) you can use submodules (sort of like svn:externals) to pre-split the repo into separately cloneable portions.

Brackett answered 6/4, 2010 at 18:16 Comment(8)
So, if I clone branch "XX" it will get all the parent commits from "master", right? Or only the single commit I've done on that branch?Templeton
If you clone (fetch) only branch "XX", you would get all its commits, including those commits that branch "XX" has in common with "master" branch. In Git commits do not 'belong' to a branch.Palmar
Ok, then it's not a partial clone anyway since you get all the parents and hence the entire repos (ok, the biggest part which is on master)Templeton
What do you mean by Git vs modern Git? A partial clone is possible in the second but not the first?Threescore
@Chris: by saying that "sparse checkou" requires modern Git I meant here that this feature was only recently added (at the time of posting), so it was available only in newest version.Palmar
Ah, looks like version 1.7.0 added partial checkouts. Thanks.Threescore
In 1.8.0 (or a bit earlier) making single branch clone is now much easier.Palmar
You might add to that list "partial clone" (or "narrow clone") with Git 2.17 (Q2 2018): https://mcmap.net/q/183655/-is-there-any-distributed-revision-control-system-that-supports-partial-checkout-cloneConah
P
53

In mercurial land you're talking about three different types of partial clones:

  • shallow clones: I want the history from revision point X onward use the remotefilelog extension
  • partial clones by filepath: I want all revision history in directory /path with experimental narrowhg extension or I want only files in directory /path to be in my working directory with experimental sparse extension (shipped since version 4.3, see hg help sparse).
  • partial clones by branch: I want all revision history on branch Y: use clone -r

If you know how people will want to break things down by filepath (multiple projects in the same repo (shame on you)) you can use subrepositories (sort of like svn externals) to pre-split the repo into separately cloneable portions

Also, as to the "so huge I'd like to only get a part of it": You really only have to do that one time ever. Just clone it while you have lunch, and then you have it forever more. Subsequently you can pull and get deltas efficiently going forward. And if you want another clone of it, just clone your first clone. Where you got a clone doesn't matter (and local clones take up no additional diskspace since they're hard links under the covers).

Polyethylene answered 6/4, 2010 at 17:55 Comment(7)
also tags aren't the same as branches unlike in some VCS so this comes under the first pointSanguinaria
There are the trimming history (mercurial.selenic.com/wiki/TrimmingHistory) and shallow clone (mercurial.selenic.com/wiki/ShallowClone) plugins for mercurial. I don't know how good they are, though.Analgesia
Both of those are rejected proposals without implementations.Polyethylene
* Shallow clones are now possible using 'remotefilelog': bitbucket.org/facebook/remotefilelog * Partial clones by filepath are possible (but still experimental), see comments.gmane.org/gmane.comp.version-control.mercurial.devel/…Clockmaker
Yeah, that's exciting. It relies on a centralized cache for operation, so it's not for all environments, but it's a very nice bit of work to come out of facebook.Polyethylene
Early 2017: partial clones by filepath (aka narrow clone) still isn't in mainline Mercurial but is possible with an extension from Google - bitbucket.org/Google/narrowhg . Similarly sparse checkout (aka narrow checkout) isn't in mainline Mercurial but is possible using the sparse.py Mercurial extension from Facebook - bitbucket.org/facebook/hg-experimental .Gains
2018: both narrow and sparse are now experimental extensions in Mercurial itself (no longer need to be downloaded).Clockmaker
T
11

The selected answer provides a good overview, but lacks a complete example.

Minimize your download and checkout footprint (a), (b):

git clone --no-checkout --depth 1 --single-branch --branch (name) (repo) (folder)
cd (folder)
git config core.sparseCheckout true
echo "target/path/1" >>.git/info/sparse-checkout
echo "target/path/2" >>.git/info/sparse-checkout
git checkout

Periodically optimize your local repository footprint (c) (optional, use with care):

git clean --dry-run # consider and tweak results then switch to --force
git gc
git repack -Ad
git prune

See also: How to handle big repositories with git

Trepidation answered 6/2, 2016 at 0:48 Comment(0)
C
7

This method creates an unversioned archive without subrepositories:

hg clone -U ssh://machine//directory/path/to/repo/project projecttemp

cd projecttemp

hg archive -r tip ../project-no-subrepos

The unversioned source code without the subrepositoies is in the project-no-subrepos directory

Conducive answered 29/2, 2012 at 19:21 Comment(0)
M
4

Regarding Git it might be of a historical significance that Linus Torvalds answered this question from the conceptual perspective back in 2007 in a talk that was recorded and is available online.

The question is whether it is possible to check out only some files out of a Git repository.

Tech Talk: Linus Torvalds on git t=43:10

To summarize, he said that one of the design decisions of Git that sets it apart from other source management systems (he cites BitKeeper and SVN) is that Git manages content, not files. The implications being that e.g. a diff of a subset of files in two revisions is computed by first taking the whole diff and then pruning it only to the files that were requested. Another is that you have to check out the whole history; in an all or nothing fashion. For this reason, he suggests splitting loosely related components among multiple repositories and mentions a then ongoing effort to implement an user interface for managing a repository that is structured as a super-project holding smaller repositories.

As far as I know this fundamental design decision still apples today. The super-project thing probably became what now are submodules.

Mcmillen answered 4/1, 2014 at 12:58 Comment(1)
I know the post... I originally submitted it to slashdot :PTempleton
C
1

If, as in Brent Bradburn'answer, you do a repack in a Git partial clone, make sure to:

git clone --filter=blob:none --no-checkout https://github.com/me/myRepo
cd myRepo
git sparse-checkout init
# Add the expected pattern, to include just a subfolder without top files:
git sparse-checkout set /mySubFolder/

# populate working-tree with only the right files:
git read-tree -mu HEAD

Regarding the local optimization in a partial clone, as in:

git clean --dry-run # consider and tweak results then switch to --force
git gc
git repack -Ad
git prune

use Git 2.32 (Q2 2021), where "git repack -A -d"(man) in a partial clone unnecessarily loosened objects in promisor pack before 2.32: fixed.

See commit a643157 (21 Apr 2021) by Rafael Silva (raffs).
(Merged by Junio C Hamano -- gitster -- in commit a0f521b, 10 May 2021)

repack: avoid loosening promisor objects in partial clones

Reported-by: SZEDER Gábor
Helped-by: Jeff King
Helped-by: Jonathan Tan
Signed-off-by: Rafael Silva

When git repack -A -d(man) is run in a partial clone, pack-objects is invoked twice: once to repack all promisor objects, and once to repack all non-promisor objects.
The latter pack-objects invocation is with --exclude-promisor-objects and --unpack-unreachable, which loosens all objects unused during this invocation.
Unfortunately, this includes promisor objects.

Because the -d argument to git repack(man) subsequently deletes all loose objects also in packs, these just-loosened promisor objects will be immediately deleted.
However, this extra disk churn is unnecessary in the first place.
For example, in a newly-cloned partial repo that filters all blob objects (e.g. --filter=blob:none), repack ends up unpacking all trees and commits into the filesystem because every object, in this particular case, is a promisor object.
Depending on the repo size, this increases the disk usage considerably: In my copy of the linux.git, the object directory peaked 26GB of more disk usage.

In order to avoid this extra disk churn, pass the names of the promisor packfiles as --keep-pack arguments to the second invocation of pack-objects.
This informs pack-objects that the promisor objects are already in a safe packfile and, therefore, do not need to be loosened.

For testing, we need to validate whether any object was loosened.
However, the "evidence" (loosened objects) is deleted during the process which prevents us from inspecting the object directory.
Instead, let's teach pack-objects to count loosened objects and emit via trace2 thus allowing inspecting the debug events after the process is finished.
This new event is used on the added regression test.

Lastly, add a new perf test to evaluate the performance impact made by this changes (tested on git.git):

Test          HEAD^                 HEAD
----------------------------------------------------------
5600.3: gc    134.38(41.93+90.95)   7.80(6.72+1.35) -94.2%

For a bigger repository, such as linux.git, the improvement is even bigger:

Test          HEAD^                     HEAD
-------------------------------------------------------------------
5600.3: gc    6833.00(918.07+3162.74)   268.79(227.02+39.18) -96.1%

These improvements are particular big because every object in the newly-cloned partial repository is a promisor object.


As noted with Git 2.33 (Q3 2021), the git-repack(man) doc clearly states that it does operate on promisor packfiles (in a separate partition), with "-a" specified.

Presumably the statements here are outdated, as they feature from the first doc in 2017 (and the repack support was added in 2018)

See commit ace6d8e (02 Jun 2021) by Tao Klerks (TaoK).
(Merged by Junio C Hamano -- gitster -- in commit 4009809, 08 Jul 2021)

Signed-off-by: Tao Klerks
Reviewed-by: Taylor Blau
Acked-by: Jonathan Tan

See technical/partial-clone man page.

Plus, still with Git 2.33 (Q3 2021), "git read-tree"(man) had a codepath where blobs are fetched one-by-one from the promisor remote, which has been corrected to fetch in bulk.

See commit d3da223, commit b2896d2 (23 Jul 2021) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 8230107, 02 Aug 2021)

cache-tree: prefetch in partial clone read-tree

Signed-off-by: Jonathan Tan

"git read-tree"(man) checks the existence of the blobs referenced by the given tree, but does not bulk prefetch them.
Add a bulk prefetch.

The lack of prefetch here was noticed at $DAYJOB during a merge involving some specific commits, but I couldn't find a minimal merge that didn't also trigger the prefetch in check_updates() in unpack-trees.c (and in all these cases, the lack of prefetch in cache-tree.c didn't matter because all the relevant blobs would have already been prefetched by then).
This is why I used read-tree here to exercise this code path.


Git 2.39 (Q4 2022) avoids calling 'cache_tree_update()' when doing so would be redundant.

See commit 652bd02, commit dc5d40f, commit 0e47bca, commit 68fcd48, commit 94fcf0e (10 Nov 2022) by Victoria Dye (vdye).
(Merged by Taylor Blau -- ttaylorr -- in commit a92fce4, 18 Nov 2022)

read-tree: use 'skip_cache_tree_update' option

Signed-off-by: Victoria Dye
Signed-off-by: Taylor Blau

When running 'read-tree' with a single tree and no prefix, 'prime_cache_tree()' is called after the tree is unpacked.
In that situation, skip a redundant call to 'cache_tree_update()' in 'unpack_trees()' by enabling the 'skip_cache_tree_update' unpack option.

Removing the redundant cache tree update provides a substantial performance improvement to 'git read-tree'(man) <tree-ish>, as shown by a test added to 'p0006-read-tree-checkout.sh':

Test                          before            after ---------------------------------------------------------------------- read-tree `br_ballast_plus_1`   3.94(1.80+1.57)   3.00(1.14+1.28) -23.9%  

Note that the 'read-tree' in 't1022-read-tree-partial-clone.sh' is updated to read two trees, rather than one.
The test was first introduced in d3da223 ("cache-tree: prefetch in partial clone read-tree", 2021-07-23, Git v2.33.0-rc0 -- merge) to exercise the 'cache_tree_update()' code path, as used in 'git merge'(man).
Since this patch drops the call to 'cache_tree_update()' in single-tree 'git read-tree', change the test to use the two-tree variant so that 'cache_tree_update()' is called as intended.

Conah answered 14/5, 2021 at 8:39 Comment(0)
S
-1

In mercurial, you should be able to so some of this using:

hg convert --banchmap FILE SOURCEDEST REVMAP

You may also want:

--config convert.hg.startrev=REV

The source can be git, mercurial, or a variety of other systems.

I haven't tried it, but convert is quite rich.

Skiing answered 21/2, 2012 at 17:9 Comment(1)
Convert extension rewrites the hashes thus this is not partial clone of the existing repo but rather a new one. Meaning it will be a separate repository that cannot pull or push from the original one.Humpback

© 2022 - 2024 — McMap. All rights reserved.