Shallow AND Sparse GIT Repository Clone
Asked Answered
I

2

6

I have a shallow cloned git repository that is over 1 GB. I use sparse checkout for the files/dirs needed.

How can I reduce the repository clone to just the sparse checkout files/dirs?

Initially I was able to limit the cloned repository to only the sparse checkout by disabling checkout when cloning. Then setting up sparse checkout before doing the initial checkout. This limited the repository to only about 200 MB. Much more manageable. However updating remote branch info at some point in the future causes the rest of the files and dirs to be included in the repository clone. Sending the repo clone size back to over 1 GB and I don't know how to just the sparse checkout files and dirs.

In short what I want is a shallow AND sparse repository clone. Not just sparse checkout of a shallow repo clone. The full repo is a waste of space and performance for certain tasks suffers.

Hope someone can share a solution. Thanks.

Instinctive answered 26/9, 2018 at 21:56 Comment(1)
This is fully implemented with Git 2.25 (Q1 2020): see the example at the end of my answer below.Marshamarshal
M
8

Shallow and sparse means "partial" or "narrow".

A partial clone (or "narrow clone") is in theory possible, and was implemented first in Dec 2017 with Git 2.16, as seen here.
But:

That is further optimized in Git 2.20 (Q4 2018), since in a partial clone that will lazily be hydrated from the originating repository, we generally want to avoid "does this object exist (locally)?" on objects that we deliberately omitted when we created the (partial/sparse) clone.
The cache-tree codepath (which is used to write a tree object out of the index) however insisted that the object exists, even for paths that are outside of the partial checkout area.
The code has been updated to avoid such a check.

See commit 2f215ff (09 Oct 2018) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit a08b1d6, 19 Oct 2018)

cache-tree: skip some blob checks in partial clone

In a partial clone, whenever a sparse checkout occurs, the existence of all blobs in the index is verified, whether they are included or excluded by the .git/info/sparse-checkout specification.
This significantly degrades performance because a lazy fetch occurs whenever the existence of a missing blob is checked.


With Git 2.24 (Q4 2019), the cache-tree code has been taught to be less aggressive in attempting to see if a tree object it computed already exists in the repository.

See commit f981ec1 (03 Sep 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit ae203ba, 07 Oct 2019)

cache-tree: do not lazy-fetch tentative tree

The cache-tree datastructure is used to speed up the comparison between the HEAD and the index, and when the index is updated by a cherry-pick (for example), a tree object that would represent the paths in the index in a directory is constructed in-core, to see if such a tree object exists already in the object store.

When the lazy-fetch mechanism was introduced, we converted this "does the tree exist?" check into an "if it does not, and if we lazily cloned, see if the remote has it" call by mistake.
Since the whole point of this check is to repair the cache-tree by recording an already existing tree object opportunistically, we shouldn't even try to fetch one from the remote.

Pass the OBJECT_INFO_SKIP_FETCH_OBJECT flag to make sure we only check for existence in the local object store without triggering the lazy fetch mechanism.


With Git 2.25 (Q1 2020), "git fetch" codepath had a big "do not lazily fetch missing objects when I ask if something exists" switch.

This has been corrected by marking the "does this thing exist?" calls with "if not please do not lazily fetch it" flag.

See commit 603960b, commit e362fad (13 Nov 2019), and commit 6462d5e (05 Nov 2019) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit fce9e83, 01 Dec 2019)

clone: remove fetch_if_missing=0

Signed-off-by: Jonathan Tan

Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) strove to remove the need for fetch_if_missing=0 from the fetching mechanism, so it is plausible to attempt removing fetch_if_missing=0 from clone as well. But doing so reveals a bug - when the server does not send an object directly pointed to by a ref, this should be an error, not a trigger for a lazy fetch. (This case in the fetching mechanism was covered by a test using "git clone", not "git fetch", which is why the aforementioned commit didn't uncover the bug.)

The bug can be fixed by suppressing lazy-fetching during the connectivity check. Fix this bug, and remove fetch_if_missing from clone.

And:

promisor-remote: remove fetch_if_missing=0

Signed-off-by: Jonathan Tan

Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) strove to remove the need for fetch_if_missing=0 from the fetching mechanism, so it is plausible to attempt removing fetch_if_missing=0 from the lazy-fetching mechanism in promisor-remote as well.

But doing so reveals a bug - when the server does not send an object pointed to by a tag object, an infinite loop occurs: Git attempts to fetch the missing object, which causes a deferencing of all refs (for negotiation), which causes a lazy fetch of that missing object, and so on.
This bug is because of unnecessary use of the fetch negotiator during lazy fetching - it is not used after initialization, but it is still initialized (which causes the dereferencing of all refs).

Thus, when the negotiator is not used during fetching, refrain from initializing it. Then, remove fetch_if_missing from promisor-remote.


See more with "Bring your monorepo down to size with sparse-checkout" from Derrick Stolee

Pairing sparse-checkout with the partial clone feature accelerates these workflows even more.
This combination speeds up the data transfer process since you don’t need every reachable Git object, and instead, can download only those you need to populate your cone of the working directory

$ git clone --filter=blob:none --no-checkout https://github.com/derrickstolee/sparse-checkout-example
Cloning into 'sparse-checkout-example'...
Receiving objects: 100% (373/373), 75.98 KiB | 2.71 MiB/s, done.
Resolving deltas: 100% (23/23), done.
 
$ cd sparse-checkout-example/
 
$ git sparse-checkout init --cone
Receiving objects: 100% (3/3), 1.41 KiB | 1.41 MiB/s, done.
 
$ git sparse-checkout set client/android
Receiving objects: 100% (26/26), 985.91 KiB | 5.76 MiB/s, done.

Before Git 2.25.1 (Feb. 2020), has_object_file() said "no" given an object registered to the system via pretend_object_file(), making it inconsistent with read_object_file(), causing lazy fetch to attempt fetching an empty tree from promisor remotes.

See discussion.

I tried to reproduce this with

empty_tree=$(git mktree </dev/null)
git init --bare x
git clone --filter=blob:none file://$(pwd)/x y
cd y
echo hi >README
git add README
git commit -m 'nonempty tree'
GIT_TRACE=1 git diff-tree "$empty_tree" HEAD

and indeed, it looks like Git serves the empty tree even from repositories that don't contain it.

See commit 9c8a294 (02 Jan 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit e26bd14, 22 Jan 2020)

sha1-file: remove OBJECT_INFO_SKIP_CACHED

Signed-off-by: Jonathan Tan

In a partial clone, if a user provides the hash of the empty tree ("git mktree </dev/null" - for SHA-1, this is 4b825d...) to a command which requires that that object be parsed, for example:

git diff-tree 4b825d <a non-empty tree>

then Git will lazily fetch the empty tree, unnecessarily, because parsing of that object invokes repo_has_object_file(), which does not special-case the empty tree.

Instead, teach repo_has_object_file() to consult find_cached_object() (which handles the empty tree), thus bringing it in line with the rest of the object-store-accessing functions.
A cost is that repo_has_object_file() will now need to oideq upon each invocation, but that is trivial compared to the filesystem lookup or the pack index search required anyway. (And if find_cached_object() needs to do more because of previous invocations to pretend_object_file(), all the more reason to be consistent in whether we present cached objects.)

As a historical note, the function now known as repo_read_object_file() was taught the empty tree in 346245a1bb ("hard-code the empty tree object", 2008-02-13, Git v1.5.5-rc0 -- merge), and the function now known as oid_object_info() was taught the empty tree in c4d9986f5f ("sha1_object_info: examine cached_object store too", 2011-02-07, Git v1.7.4.1).

repo_has_object_file() was never updated, perhaps due to oversight.
The flag OBJECT_INFO_SKIP_CACHED, introduced later in dfdd4afcf9 ("sha1_file: teach sha1_object_info_extended more flags", 2017-06-26, Git v2.14.0-rc0) and used in e83e71c5e1 ("sha1_file: refactor has_sha1_file_with_flags", 2017-06-26, Git v2.14.0-rc0), was introduced to preserve this difference in empty-tree handling, but now it can be removed.


Git 2.25.1 will also warn programmers about pretend_object_file() that allows the code to tentatively use in-core objects.

See commit 60440d7 (04 Jan 2020) by Jonathan Nieder (artagnon).
(Merged by Junio C Hamano -- gitster -- in commit b486d2e, 12 Feb 2020)

sha1-file: document how to use pretend_object_file

Inspired-by: Junio C Hamano
Signed-off-by: Jonathan Nieder

Like in-memory alternates, pretend_object_file contains a trap for the unwary: careless callers can use it to create references to an object that does not exist in the on-disk object store.

Add a comment documenting how to use the function without risking such problems.

The only current caller is blame, which uses pretend_object_file to create an in-memory commit representing the working tree state. Noticed during a discussion of how to safely use this function in operations like "git merge" which, unlike blame, are not read-only.

So the comment is now:

/*
 * Add an object file to the in-memory object store, without writing it
 * to disk.
 *
 * Callers are responsible for calling write_object_file to record the
 * object in persistent storage before writing any other new objects
 * that reference it.
 */
int pretend_object_file(void *, unsigned long, enum object_type,
            struct object_id *oid);

Git 2.25.1 (Feb. 2020) includes a Futureproofing for making sure a test do not depend on the current implementation detail.

See commit b54128b (13 Jan 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 3f7553a, 12 Feb 2020)

t5616: make robust to delta base change

Signed-off-by: Jonathan Tan

Commit 6462d5eb9a ("fetch: remove fetch_if_missing=0", 2019-11-08) contains a test that relies on having to lazily fetch the delta base of a blob, but assumes that the tree being fetched (as part of the test) is sent as a non-delta object.
This assumption may not hold in the future; for example, a change in the length of the object hash might result in the tree being sent as a delta instead.

Make the test more robust by relying on having to lazily fetch the delta base of the tree instead, and by making no assumptions on whether the blobs are sent as delta or non-delta.


Git 2.25.2 (March 2020) fixes a bug revealed by a recent change to make the protocol v2 the default.

See commit 3e96c66, commit d0badf8 (21 Feb 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 444cff6, 02 Mar 2020)

partial-clone: avoid fetching when looking for objects

Signed-off-by: Derrick Stolee

While testing partial clone, I noticed some odd behavior. I was testing a way of running 'git init', followed by manually configuring the remote for partial clone, and then running 'git fetch'.
Astonishingly, I saw the 'git fetch' process start asking the server for multiple rounds of pack-file downloads! When tweaking the situation a little more, I discovered that I could cause the remote to hang up with an error.

Add two tests that demonstrate these two issues.

In the first test, we find that when fetching with blob filters from a repository that previously did not have any tags, the 'git fetch --tags origin' command fails because the server sends "multiple filter-specs cannot be combined". This only happens when using protocol v2.

In the second test, we see that a 'git fetch origin' request with several ref updates results in multiple pack-file downloads.
This must be due to Git trying to fault-in the objects pointed by the refs. What makes this matter particularly nasty is that this goes through the do_oid_object_info_extended() method, so there are no "haves" in the negotiation.
This leads the remote to send every reachable commit and tree from each new ref, providing a quadratic amount of data transfer! This test is fixed if we revert 6462d5eb9a (fetch: remove fetch_if_missing=0, 2019-11-05, Git v2.25.0-rc0), but that revert causes other test failures.
The real fix will need more care.

Fix:

When using partial clone, find_non_local_tags() in builtin/fetch.c checks each remote tag to see if its object also exists locally. There is no expectation that the object exist locally, but this function nevertheless triggers a lazy fetch if the object does not exist. This can be extremely expensive when asking for a commit, as we are completely removed from the context of the non-existent object and thus supply no "haves" in the request.

6462d5eb9a (fetch: remove fetch_if_missing=0, 2019-11-05, Git v2.25.0-rc0, , Git v2.25.0-rc0) removed a global variable that prevented these fetches in favor of a bitflag. However, some object existence checks were not updated to use this flag.

Update find_non_local_tags() to use OBJECT_INFO_SKIP_FETCH_OBJECT in addition to OBJECT_INFO_QUICK.
The _QUICK option only prevents repreparing the pack-file structures. We need to be extremely careful about supplying _SKIP_FETCH_OBJECT when we expect an object to not exist due to updated refs.

This resolves a broken test in t5616-partial-clone.sh.


The logic to auto-follow tags by "git clone --single-branch" was not careful to avoid lazy-fetching unnecessary tags, which has been corrected with Git 2.27 (Q2 2020),

See commit 167a575 (01 Apr 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 3ea2b46, 22 Apr 2020)

clone: use "quick" lookup while following tags

Signed-off-by: Jeff King

When cloning with --single-branch, we implement git fetch's usual tag-following behavior, grabbing any tag objects that point to objects we have locally.

When we're a partial clone, though, our has_object_file() check will actually lazy-fetch each tag.

That not only defeats the purpose of --single-branch, but it does it incredibly slowly, potentially kicking off a new fetch for each tag.
This is even worse for a shallow clone, which implies --single-branch, because even tags which are supersets of each other will be fetched individually.

We can fix this by passing OBJECT_INFO_SKIP_FETCH_OBJECT to the call, which is what git fetch does in this case.

Likewise, let's include OBJECT_INFO_QUICK, as that's what git fetch does.
The rationale is discussed in 5827a03545 (fetch: use "quick" has_sha1_file for tag following, 2016-10-13, Git v2.10.2), but here the tradeoff would apply even more so because clone is very unlikely to be racing with another process repacking our newly-created repository.

This may provide a very small speedup even in the non-partial case case, as we'd avoid calling reprepare_packed_git() for each tag (though in practice, we'd only have a single packfile, so that reprepare should be quite cheap).


Before Git 2.27 (Q2 2020), serving a "git fetch" client over "git://" and "ssh://" protocols using the on-wire protocol version 2 was buggy on the server end when the client needs to make a follow-up request to e.g. auto-follow tags.

See commit 08450ef (08 May 2020) by Christian Couder (chriscool).
(Merged by Junio C Hamano -- gitster -- in commit a012588, 13 May 2020)

upload-pack: clear filter_options for each v2 fetch command

Helped-by: Derrick Stolee
Helped-by: Jeff King
Helped-by: Taylor Blau
Signed-off-by: Christian Couder

Because of the request/response model of protocol v2, the upload_pack_v2() function is sometimes called twice in the same process, while 'struct list_objects_filter_options filter_options' was declared as static at the beginning of 'upload-pack.c'.

This made the check in list_objects_filter_die_if_populated(), which is called by process_args(), fail the second time upload_pack_v2() is called, as filter_options had already been populated the first time.

To fix that, filter_options is not static any more. It's now owned directly by upload_pack(). It's now also part of 'struct upload_pack_data', so that it's owned indirectly by upload_pack_v2().

In the long term, the goal is to also have upload_pack() use 'struct upload_pack_data', so adding filter_options to this struct makes more sense than to have it owned directly by upload_pack_v2().

This fixes the first of the 2 bugs documented by d0badf8797 ("partial-clone: demonstrate bugs in partial fetch", 2020-02-21, Git v2.26.0-rc0 -- merge listed in batch #8).


With Git 2.29 (Q4 2020), the pretend-object mechanism checks if the given object already exists in the object store before deciding to keep the data in-core, but the check would have triggered lazy fetching of such an object from a promissor remote.

See commit a64d2aa (21 Jul 2020) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 5b137e8, 04 Aug 2020)

sha1-file: make pretend_object_file() not prefetch

Signed-off-by: Jonathan Tan

When pretend_object_file() is invoked with an object that does not exist (as is the typical case), there is no need to fetch anything from the promisor remote, because the caller already knows what the object is supposed to contain. Therefore, suppress the fetch. (The OBJECT_INFO_QUICK flag is added for the same reason.)

This was noticed at $DAYJOB when "blame" was run on a file that had uncommitted modifications.


With Git 2.37 (Q3 2022), "git mktree --missing"(man) lazily fetched objects that are missing from the local object store, which was totally unnecessary for the purpose of creating the tree object(s) from its input.

See commit 817b0f6 (21 Jun 2022) by Richard Oliver (RichardBray).
(Merged by Junio C Hamano -- gitster -- in commit 6fccbda, 13 Jul 2022)

mktree: do not check type of remote objects

Signed-off-by: Richard Oliver

With 31c8221 ("mktree: validate entry type in input", 2009-05-14, Git v1.6.4-rc0 -- merge), we called the sha1_object_info() API to obtain the type information, but allowed the call to silently fail when the object was missing locally, so that we can sanity-check the types opportunistically when the object did exist.

The implementation is understandable because back then there was no lazy/on-demand downloading of individual objects from the promisor remotes that causes a long delay and materializes the object, hence defeating the point of using "--missing".
The design is hurting us now.

We could bypass the opportunistic type/mode consistency check altogether when "--missing" is given, but instead, use the oid_object_info_extended() API and tell it that we are only interested in objects that locally exist and are immediately available by passing OBJECT_INFO_SKIP_FETCH_OBJECT bit to it.
That way, we will still retain the cheap and opportunistic sanity check for local objects.

Marshamarshal answered 26/9, 2018 at 22:12 Comment(6)
Don't see any documentation re: "partial" "narrow" or the "--filter" option. git version 2.19.0.windows.1Instinctive
@Instinctive You will see it in github.com/git/git/commit/…Marshamarshal
Apparently partial cloning (--filter) is not supported by github. "warning: filtering not recognized by server, ignoring" The entire 1.4 GB repo is still cloned. git clone --no-checkout --filter=blob:none "github.com/freebsd/freebsd-ports" pc1Instinctive
@Instinctive Exactly: this is still being deployed, and not yet supported by the major Git repo hosting services. You would need to run your own server with the latest Git as a mirror in order to clone/push/pull from that mirror.Marshamarshal
Thanks for enlightening me about the upcoming partial clone capability. Once it is supported by github it should be very useful feature. But for now needing a local full clone would defeat the purpose for me. Any idea on rough likely time-frame for github support? ThanksInstinctive
@Instinctive No idea: you would need to contact GitHub support for that question: github.com/contactMarshamarshal
M
0

In short what I want is a shallow AND sparse repository clone.

That will work faster with Git 2.42 (Q3 2023): "git diff-tree"(man) has been taught to take advantage of the sparse-index feature.

See commit 48c5fbf (18 May 2023) by Shuqi Liang (none).
(Merged by Junio C Hamano -- gitster -- in commit ca9c063, 13 Jun 2023)

diff-tree: integrate with sparse index

Helped-by: Victoria Dye
Signed-off-by: Shuqi Liang

The index is read in 'cmd_diff_tree' at two points:

  1. The first index read was added in fd66bcc ("diff-tree: read the index so attribute checks work in bare repositories", 2017-12-06, Git v2.16.0-rc0 -- merge listed in batch #10) to deal with reading '.gitattributes' content.
    77efbb3 ("attr: be careful about sparse directories", 2021-09-08, Git v2.34.0-rc0 -- merge listed in batch #7) established that, in a sparse index, we do not try to load a '.gitattributes' file from within a sparse directory.

  2. The second index access point is involved in rename detection, specifically when reading from stdin.This was initially added in f0c6b2a ("[PATCH] Optimize diff-tree -[CM]--stdin", 2005-05-27, Git v0.99 -- merge), where 'setup' was set to 'DIFF_SETUP_USE_SIZE_CACHE |DIFF_SETUP_USE_CACHE'.
    That assignment was later modified to drop the'DIFF_SETUP_USE_CACHE' in ff7fe37 ("diff.c: move read_index() code back to the caller", 2018-08-13, Git v2.19.0-rc0 -- merge).However, 'DIFF_SETUP_USE_SIZE_CACHE' seems to be unused as of 6e0b8ed (diff.c: do not use a separate , 2007-05-07, Git v1.5.2-rc3 -- merge) (diff.c: do not use a separate "size cache"., 2007-05-07) and nothing about 'detect_rename' otherwise indicates index usage.

Hence we can just set the requires-full-index to false for "diff-tree".

The p2000 tests demonstrate a ~98% execution time reduction for 'git diff-tree' using a sparse index:

Test                                                 before  after 
-----------------------------------------------------------------------
2000.94: git diff-tree HEAD (full-v3)                0.05    0.04 -20.0% 
2000.95: git diff-tree HEAD (full-v4)                0.06    0.05 -16.7% 
2000.96: git diff-tree HEAD (sparse-v3)              0.59    0.01 -98.3% 
2000.97: git diff-tree HEAD (sparse-v4)              0.61    0.01 -98.4% 
2000.98: git diff-tree HEAD -- f2/f4/a (full-v3)     0.05    0.05 +0.0% 
2000.99: git diff-tree HEAD -- f2/f4/a (full-v4)     0.05    0.04 -20.0% 
2000.100: git diff-tree HEAD -- f2/f4/a (sparse-v3)  0.58    0.01 -98.3% 
2000.101: git diff-tree HEAD -- f2/f4/a (sparse-v4)  0.55    0.01 -98.2%

And even faster, through git diff-index:

With Git 2.47 (Q4 2024), batch 11, the underlying machinery for "git diff-index"(man) has long been made to expand the sparse index as needed, but the command fully expanded the sparse index upfront, which now has been taught not to do.

See commit b44c926 (22 Aug 2024) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 11fd53a, 29 Aug 2024)

diff-index: integrate with the sparse index

Signed-off-by: Derrick Stolee

The sparse index allows focusing the index data structure on the files present in the sparse-checkout, leaving only tree entries for directories not within the sparse-checkout.
Each builtin needs a repository setting to indicate that it has been tested with the sparse index before Git will allow the index to be loaded into memory in its sparse form.
This is a safety precaution.

There are still some builtins that haven't been integrated due to the complexity of the integration and the lack of significant use.
However, 'git diff-index'(man) was neglected only because of initial data showing low usage.
The diff machinery was already integrated and there is no more work to be done there but add some tests to be sure 'git diff-index' behaves as expected.

For this purpose, we can follow the testing pattern used in 51ba65b ("diff: enable and test the sparse index", 2021-12-06, Git v2.35.0-rc0 -- merge listed in batch #4).
One difference here is that we only verify that the sparse index case agrees with the full index case, but do not generate the expected output.
The 'git diff'(man) tests use the '--name-status' option to ease the creation of the expected output, but that's not an option for 'diff-index'.
Since the underlying diff machinery is the same, a simple comparison is sufficient to give some coverage.

Marshamarshal answered 16/6, 2023 at 8:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.