What is the git clone --filter
option's syntax?
This is at least clearer with Git 2.27 (Q2 2020)
Before that, here is a quick TLDR; example of that command, combined with a (cone) sparse-checkout:
#fastest clone possible:
git clone --filter=blob:none --no-checkout https://github.com/git/git
cd git
git sparse-checkout init --cone
git read-tree -mu HEAD
That will bring back only the top folder files, excluding by default any subfolder.
The initial clone remains faster, because of the git clone --filter=blob:none --no-checkout
step.
Now, onto that git clone --filter
option's syntax:
See commit 4a46544 (22 Mar 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit fa0c1eb, 22 Apr 2020)
clone
: document --filter
options
Signed-off-by: Derrick Stolee
It turns out that the "--filter=<filter-spec>
" option is not documented anywhere in the "git clone
" page, and instead is detailed carefully in "git rev-list" where it serves a different purpose.
Add a small bit about this option in the documentation. It would be worth some time to create a subsection in the "git clone" documentation about partial clone as a concept and how it can be a surprising experience. For example, "git checkout" will likely trigger a pack download.
The git clone documentation now includes:
--filter=<filter-spec>
:
Use the partial clone feature and request that the server sends a subset of reachable objects according to a given object filter.
When using --filter
, the supplied <filter-spec>
is used for the partial clone filter.
For example, --filter=blob:none
will filter out all blobs (file contents) until needed by Git.
Also, --filter=blob:limit=<size>
will filter out all blobs of size at least <size>
.
For more details on filter specifications, see the --filter
option in git rev-list
.
That option is less useful than I had hoped. (It can't be used to combine clone
and filter-branch
).
And yet this filtering mechanism is the extension of one associated with clone, for implementing the partial cloning (or narrow clone) introduced in Dec. 2017 with Git 2.16.
But your Git repo hosting server must support the protocol v2, supported for now (Oct. 2018) only by GitLab.
Meaning you can use --filter
with git clone
, as a recent Git 2.20 patch illustrates (see below).
That filter was then added to git fetch
in this patch series.
It is part of a new pack-protocol capability "filter
", added to the fetch-pack and
upload-pack negotiation.
See "filter" in Documentation/technical/pack-protocol, which refers to the rev-list options.
With Git 2.20 (Q4 2018), a partial clone that is configured to lazily fetch missing objects will on-demand issue a "git fetch
" request to the originating
repository to fill not-yet-obtained objects.
The request has been optimized for requesting a tree object (and not the leaf blob
objects contained in it) by telling the originating repository that
no blobs are needed.
See commit 4c7f956, commit 12f19a9 (03 Oct 2018) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit fa54ccc, 19 Oct 2018)
fetch-pack
: exclude blobs when lazy-fetching trees
A partial clone with missing trees can be obtained using "git clone --filter=tree:none <repo>
".
In such a repository, when a tree needs to be lazily fetched, any tree or blob it directly or indirectly references is fetched as well, regardless of whether the original command required those objects, or if the local repository already had some of them.
This is because the fetch protocol, which the lazy fetch uses, does not
allow clients to request that only the wanted objects be sent, which
would be the ideal solution. This patch implements a partial solution:
specify the "blob:none" filter, somewhat reducing the fetch payload.
This change has no effect when lazily fetching blobs (due to how filters
work). And if lazily fetching a commit (such repositories are difficult
to construct and is not a use case we support very well, but it is
possible), referenced commits and trees are still fetched - only the
blobs are not fetched.
You can see further optimization with:
See commit e70a303, commit 6ab4055, commit 0177565, commit 99bcb88 (27 Sep 2018) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit 0527fba, 19 Oct 2018)
transport: allow skipping of ref listing
The get_refs_via_connect()
function both performs the handshake
(including determining the protocol version) and obtaining the list of
remote refs.
However, the fetch protocol v2 supports fetching objects without the listing of refs, so make it possible for the user to skip the listing by creating a new handshake()
function.
Note the syntax has changed/evolved with Git 2.21 (Q1 2019) and its update of the protocol message specification to allow only the limited use of scaled quantities.
This is ensure potential compatibility issues will not go out of hand.
See commit 87c2d9d (08 Jan 2019) by Josh Steadmon (steadmon
).
See commit 8272f26, commit c813a7c (09 Jan 2019) by Matthew DeVore (matvore
).
(Merged by Junio C Hamano -- gitster
-- in commit 073312b, 05 Feb 2019)
filter-options
: expand scaled numbers
When communicating with a remote server or a subprocess, use
expanded numbers rather than numbers with scaling suffix in the
object filter spec (e.g. "limit:blob=1k
" becomes "limit:blob=1024
").
Update the protocol docs to note that clients should always perform this expansion, to allow for more compatibility between server
implementations.
As an aside, Git 2.23 (Q3 2019) consider the "invalid filter-spec
" message is user-facing and not a BUG, so it makes localizeable.
See commit 5c03bc8 (31 May 2019) by Matthew DeVore (matvore
).
(Merged by Junio C Hamano -- gitster
-- in commit ca02d36, 21 Jun 2019)
list-objects-filter-options
: error is localizeable
The "invalid filter-spec
" message is user-facing and not a BUG, so make
it localizeable.
For reference, the message appears in this context:
$ git rev-list --filter=blob:nonse --objects HEAD
fatal: invalid filter-spec 'blob:nonse'
With Git 2.24 (Q4 2019), the http transport, which lacked some optimization the native transports learned to avoid unnecessary ref advertisement, has been fixed.
See commit fddf2eb, commit ac3fda8 (21 Aug 2019) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit f67bf53, 18 Sep 2019)
transport-helper: skip ls-refs if unnecessary
Commit e70a303 ("fetch
: do not list refs if fetching only hashes",
2018-10-07, Git v2.20.0-rc0) and its ancestors taught Git, as an optimization, to skip the ls-refs
step when it is not necessary during a protocol v2 fetch
(for example, when lazy fetching a missing object in a partial clone, or
when running "git fetch --no-tags <remote> <SHA-1>
").
But that was only done for natively supported protocols; in particular, HTTP was not supported.
Teach Git to skip ls-refs
when using remote helpers that support connect
or stateless-connect.
Another optimization in Git 2.24 (Q4 2019)
See commit d8bc1a5 (08 Oct 2019) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit c7d2ced, 15 Oct 2019)
send-pack
: never fetch when checking exclusions
Signed-off-by: Jonathan Tan
When building the packfile to be sent, send_pack()
is given a list of remote refs to be used as exclusions.
For each ref, it first checks if the ref exists locally, and if it does, passes it with a "^
" prefix to pack-objects
.
However, in a partial clone, the check may trigger a lazy fetch.
The additional commit ancestry information obtained during such fetches may show that certain objects that would have been sent are already known to the server, resulting in a smaller pack being sent.
But this is at the cost of fetching from many possibly unrelated refs, and the lazy fetches do not help at all in the typical case where the client is up-to-date with the upstream of the branch being pushed.
Ensure that these lazy fetches do not occur.
Finally, Git 2.24 (Q4 2019) includes a last-minute work-around for a lazy fetch glitch, which illustrates one usage of the filter
syntax.
See commit c7aadcc (23 Oct 2019) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit c32ca69, 04 Nov 2019)
fetch
: delay fetch_if_missing=0
until after config
Signed-off-by: Jonathan Tan
Suppose, from a repository that has ".gitmodules
", we clone with --filter=blob:none
:
git clone --filter=blob:none --no-checkout \
https://kernel.googlesource.com/pub/scm/git/git
Then we fetch:
git -C git fetch
This will cause a "unable to load config blob object
", because the fetch_config_from_gitmodules()
invocation in cmd_fetch()
will attempt to load ".gitmodules
" (which Git knows to exist because the client has the tree of HEAD) while fetch_if_missing
is set to 0.
fetch_if_missing
is set to 0 too early - ".gitmodules
" here should be lazily fetched.
Git must set fetch_if_missing
to 0 before the fetch because as part of the fetch, packfile negotiation happens (and we do not want to fetch any missing objects when checking existence of objects), but we do not need to set it so early.
Move the setting of fetch_if_missing
to the earliest possible point in cmd_fetch()
, right before any fetching happens.
With Git 2.25 (Q1 2020), debugging support for lazy cloning has been a bit improved.
git fetch
v2 now makes good use of promisor files.
See commit 5374a29 (15 Oct 2019) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit 026587c, 10 Nov 2019)
fetch-pack
: write fetched refs to .promisor
Signed-off-by: Jonathan Tan
Acked-by: Josh Steadmon
The specification of promisor packfiles (in partial-clone.txt
) states that the .promisor
files that accompany packfiles do not matter (just like .keep
files), so whenever a packfile is fetched from the promisor remote, Git has been writing empty .promisor
files.
But these files could contain more useful information.
So instead of writing empty files, write the refs fetched to these files.
This makes it easier to debug issues with partial clones, as we can identify what refs (and their associated hashes) were fetched at the time the packfile was downloaded, and if necessary, compare those hashes against what the promisor remote reports now.
This is implemented by teaching fetch-pack
to write its own non-empty .promisor
file whenever it knows the name of the pack's lockfile.
This covers the case wherein the user runs "git fetch
" with an internal protocol or HTTP protocol v2 (fetch_refs_via_pack()
in transport.c
sets lock_pack
) and with HTTP protocol v0/v1 (fetch_git()
in remote-curl.c
passes "--lock-pack
" to "fetch-pack
").
Before Git 2.29 (Q4 2020), fetching from a lazily cloned repository resulted at the server side in attempts to lazy fetch objects that the client side has, many of which will not be available from the third-party anyway.
See commit 77aa094 (16 Jul 2020) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit 37f382a, 30 Jul 2020)
upload-pack
: do not lazy-fetch "have
" objects
Signed-off-by: Jonathan Tan
When upload-pack
receives a request containing "have
" hashes, it (among other things) checks if the served repository has the corresponding objects. However, it does not do so with the OBJECT_INFO_SKIP_FETCH_OBJECT
flag, so if serving a partial clone, a lazy fetch will be triggered first.
This was discovered at $DAYJOB
when a user fetched from a partial clone (into another partial clone - although this would also happen if the repo to be fetched into is not a partial clone).
Therefore, whenever "have
" hashes are checked for existence, pass the OBJECT_INFO_SKIP_FETCH_OBJECT
flag.
Also add the OBJECT_INFO_QUICK
flag to improve performance, as it is typical that such objects do not exist in the serving repo, and the consequences of a false negative are minor (usually, a slightly larger pack sent).
With Git 2.29 (Q4 2020), the component to respond to "git fetch
"(man) request is made more configurable to selectively allow or reject object filtering specification used for partial cloning.
See commit 6cc275e (05 Aug 2020) by Jeff King (peff
).
See commit 5b01a4e, commit 6dd3456 (03 Aug 2020), and commit b9ea214 (31 Jul 2020) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit 73a9255, 11 Aug 2020)
upload-pack.c
: allow banning certain object filter(s)
Helped-by: Jeff King
Signed-off-by: Taylor Blau
Git clients may ask the server for a partial set of objects, where the set of objects being requested is refined by one or more object filters. Server administrators can configure 'git upload-pack
(man) ' to allow or ban these filters by setting the 'uploadpack.allowFilter
' variable to 'true
' or 'false
', respectively.
However, administrators using bitmaps may wish to allow certain kinds of object filters, but ban others. Specifically, they may wish to allow object filters that can be optimized by the use of bitmaps, while rejecting other object filters which aren't and represent a perceived performance degradation (as well as an increased load factor on the server).
Allow configuring 'git upload-pack
(man) ' to support object filters on a case-by-case basis by introducing two new configuration variables:
- '
uploadpackfilter.allow
'
- '
uploadpackfilter.<kind>.allow
'
where '' may be one of 'blobNone
', 'blobLimit
', 'tree
', and so on.
Setting the second configuration variable for any valid value of '<kind>
' explicitly allows or disallows restricting that kind of object filter.
If a client requests the object filter <kind>
and the respective configuration value is not set, 'git upload-pack
(man) ' will default to the value of 'uploadpackfilter.allow
', which itself defaults to 'true
' to maintain backwards compatibility.
Note that this differs from 'uploadpack.allowfilter
', which controls whether or not the 'filter
' capability is advertised.
git config
now includes in its man page:
uploadpackfilter.allow
Provides a default value for unspecified object filters (see: the
below configuration variable).
Defaults to true
.
uploadpackfilter.<filter>.allow
Explicitly allow or ban the object filter corresponding to
<filter>
, where <filter>
may be one of: blob:none
,
blob:limit
, tree
, sparse:oid
, or combine
.
If using
combined filters, both combine
and all of the nested filter
kinds must be allowed.
Defaults to uploadpackfilter.allow
.
With Git 2.30 (Q1 2021), Fix potential server side resource deallocation issues when responding to a partial clone request.
See commit 8d133f5, commit aab179d (03 Dec 2020) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit 21127fa, 17 Dec 2020)
upload-pack.c
: don't free allowed_filters
util pointers
Signed-off-by: Taylor Blau
To keep track of which object filters are allowed or not, 'git upload-pack
'(man) stores the name of each filter in a string_list,
and sets it ->util
pointer to be either 0 or 1, indicating whether it is banned or allowed.
Later on, we attempt to clear that list, but we incorrectly ask for the util pointers to be free()'d, too. This behavior (introduced back in 6dd3456a8c ("[
upload-pack.c](https
://github.com/git/git/blob/8d133f500a5390a089988141cdec8154a732764d/upload-pack.c): allow banning certain object filter(s)", 2020-08-03, Git v2.29.0-rc0 -- merge listed in batch #6)) leads to an invalid free, and causes us to crash.
In order to trigger this, one needs to fetch from a server that
(a) has at least one object filter allowed, and
(b) issue a fetch that contains a subset of the allowed filters (i.e., we cannot ask for a banned filter, since this causes us to die()
before we hit the bogus string_list_clear()
).
In that case, whatever banned filters exist will cause a noop free()
(since those ->util
pointers are set to 0), but the first allowed filter we try to free will crash us.
We never noticed this in the tests because we didn't have an example of setting 'uploadPackFilter' configuration variables and then following up with a valid fetch. The first new 'git clone
'(man) prevents further regression here. For good measure on top, add a test which checks the same behavior at a tree depth greater than 0.
A recent "git clone
"(man) fix left a temporary directory behind when the transport layer returned an failure.
That has been corrected with Git 2.33 (Q3 2021).
See commit 6aacb7d (19 May 2021) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit f4f7304, 14 Jun 2021)
clone
: clean up directory after transport_fetch_refs()
failure
Signed-off-by: Jeff King
git-clone
(man) started respecting errors from the transport subsystem in aab179d ("builtin/clone.c
: don't ignore transport_fetch_refs()
errors", 2020-12-03, Git v2.30.0-rc1 -- merge).
However, that commit didn't handle the cleanup of the filesystem quite right.
The cleanup of the directory that cmd_clone()
creates is done by an atexit()
handler, which we control with a flag.
It starts as JUNK_LEAVE_NONE
("clean up everything"), then progresses to JUNK_LEAVE_REPO
when we know we have a valid repo but not working tree, and then finally JUNK_LEAVE_ALL
when we have a successful checkout.
blob:
does can be seen at: stackoverflow.com/questions/600079/…--filter=combine:
also shown there. – Elisabethelisabethville