If, as in Brent Bradburn'answer, you do a repack in a Git partial clone, make sure to:
git clone --filter=blob:none --no-checkout https://github.com/me/myRepo
cd myRepo
git sparse-checkout init
# Add the expected pattern, to include just a subfolder without top files:
git sparse-checkout set /mySubFolder/
# populate working-tree with only the right files:
git read-tree -mu HEAD
Regarding the local optimization in a partial clone, as in:
git clean --dry-run # consider and tweak results then switch to --force
git gc
git repack -Ad
git prune
use Git 2.32 (Q2 2021), where "git repack -A -d
"(man) in a partial clone unnecessarily loosened objects in promisor pack before 2.32: fixed.
See commit a643157 (21 Apr 2021) by Rafael Silva (raffs
).
(Merged by Junio C Hamano -- gitster
-- in commit a0f521b, 10 May 2021)
repack
: avoid loosening promisor objects in partial clones
Reported-by: SZEDER Gábor
Helped-by: Jeff King
Helped-by: Jonathan Tan
Signed-off-by: Rafael Silva
When git repack -A -d
(man) is run in a partial clone, pack-objects
is invoked twice: once to repack all promisor objects, and once to repack all non-promisor objects.
The latter pack-objects
invocation is with --exclude-promisor-objects
and --unpack-unreachable
, which loosens all objects unused during this invocation.
Unfortunately, this includes promisor objects.
Because the -d
argument to git repack
(man) subsequently deletes all loose objects also in packs, these just-loosened promisor objects will be immediately deleted.
However, this extra disk churn is unnecessary in the first place.
For example, in a newly-cloned partial repo that filters all blob objects (e.g. --filter=blob:none
), repack
ends up unpacking all trees and commits into the filesystem because every object, in this particular case, is a promisor object.
Depending on the repo size, this increases the disk usage considerably: In my copy of the linux.git, the object directory peaked 26GB of more disk usage.
In order to avoid this extra disk churn, pass the names of the promisor packfiles as --keep-pack
arguments to the second invocation of pack-objects
.
This informs pack-objects
that the promisor objects are already in a safe packfile and, therefore, do not need to be loosened.
For testing, we need to validate whether any object was loosened.
However, the "evidence" (loosened objects) is deleted during the process which prevents us from inspecting the object directory.
Instead, let's teach pack-objects
to count loosened objects and emit via trace2 thus allowing inspecting the debug events after the process is finished.
This new event is used on the added regression test.
Lastly, add a new perf test to evaluate the performance impact made by this changes (tested on git.git):
Test HEAD^ HEAD
----------------------------------------------------------
5600.3: gc 134.38(41.93+90.95) 7.80(6.72+1.35) -94.2%
For a bigger repository, such as linux.git, the improvement is even bigger:
Test HEAD^ HEAD
-------------------------------------------------------------------
5600.3: gc 6833.00(918.07+3162.74) 268.79(227.02+39.18) -96.1%
These improvements are particular big because every object in the newly-cloned partial repository is a promisor object.
As noted with Git 2.33 (Q3 2021), the git-repack
(man) doc clearly states that it does operate on promisor packfiles (in a separate partition), with "-a
" specified.
Presumably the statements here are outdated, as they feature from the first doc in 2017 (and the repack support was added in 2018)
See commit ace6d8e (02 Jun 2021) by Tao Klerks (TaoK
).
(Merged by Junio C Hamano -- gitster
-- in commit 4009809, 08 Jul 2021)
Signed-off-by: Tao Klerks
Reviewed-by: Taylor Blau
Acked-by: Jonathan Tan
See technical/partial-clone
man page.
Plus, still with Git 2.33 (Q3 2021), "git read-tree
"(man) had a codepath where blobs are fetched one-by-one from the promisor remote, which has been corrected to fetch in bulk.
See commit d3da223, commit b2896d2 (23 Jul 2021) by Jonathan Tan (jhowtan
).
(Merged by Junio C Hamano -- gitster
-- in commit 8230107, 02 Aug 2021)
cache-tree
: prefetch in partial clone read-tree
Signed-off-by: Jonathan Tan
"git read-tree
"(man) checks the existence of the blobs referenced by the given tree, but does not bulk prefetch them.
Add a bulk prefetch.
The lack of prefetch here was noticed at $DAYJOB
during a merge involving some specific commits, but I couldn't find a minimal merge that didn't also trigger the prefetch in check_updates()
in unpack-trees.c
(and in all these cases, the lack of prefetch in cache-tree.c
didn't matter because all the relevant blobs would have already been prefetched by then).
This is why I used read-tree here to exercise this code path.
Git 2.39 (Q4 2022) avoids calling 'cache_tree_update()
' when doing so would be redundant.
See commit 652bd02, commit dc5d40f, commit 0e47bca, commit 68fcd48, commit 94fcf0e (10 Nov 2022) by Victoria Dye (vdye
).
(Merged by Taylor Blau -- ttaylorr
-- in commit a92fce4, 18 Nov 2022)
read-tree
: use 'skip_cache_tree_update
' option
Signed-off-by: Victoria Dye
Signed-off-by: Taylor Blau
When running 'read-tree' with a single tree and no prefix, 'prime_cache_tree()
' is called after the tree is unpacked.
In that situation, skip a redundant call to 'cache_tree_update()
' in 'unpack_trees()
' by enabling the 'skip_cache_tree_update
' unpack option.
Removing the redundant cache tree update provides a substantial performance improvement to 'git read-tree
'(man) <tree-ish>
, as shown by a test added to 'p0006-read-tree-checkout.sh':
Test before after ---------------------------------------------------------------------- read-tree `br_ballast_plus_1` 3.94(1.80+1.57) 3.00(1.14+1.28) -23.9%
Note that the 'read-tree
' in 't1022-read-tree-partial-clone.sh
' is updated to read two trees, rather than one.
The test was first introduced in d3da223 ("cache-tree
: prefetch in partial clone read-tree", 2021-07-23, Git v2.33.0-rc0 -- merge) to exercise the 'cache_tree_update()
' code path, as used in 'git merge
'(man).
Since this patch drops the call to 'cache_tree_update()
' in single-tree 'git read-tree
', change the test to use the two-tree variant so that 'cache_tree_update()
' is called as intended.