I did mentioned before in "Why do excluded files keep reappearing in my git sparse checkout?" how any skip-worktree file should not be modified or even looked at during a sparse checkout anymore with Git 2.27+.
But with the new sparceIndex
option with Git 2.32 (Q2 2021), that changes again:
Git 2.32 (Q2 2021) adds sparse-index.
And Git 2.39 (Q4 2022) documents it in Documentation/technical/sparse-checkout.txt
, as explained below.
See "Make your monorepo feel small with Git’s sparse index" from Derrick Stolee.
See commit 4589bca, commit 71f82d0, commit 5f11669 (12 Apr 2021), commit f5fed74, commit dc26b23, commit 0c18c05, commit 465a04a, commit f7ef64b, commit 3450a30, commit d425f65, commit 2508df0, commit a029120, commit e43e2a1, commit 299e2c4, commit 42f44e8, commit 46eb6e3, commit 2227ea1, commit 48b3c7d, commit cb8388d, commit 0f6d3ba, commit 1b850d3, commit 54beed2, commit 118a2e8, commit 95e0321, commit 847a9e5, commit 839a663 (01 Apr 2021), and commit c9e40ae, commit 9ad2d5e, commit 2de37c5, commit dcc5fd5, commit 122ba1f, commit 58300f4, commit 0938e6f, commit 13e1331, commit f442313, commit 6e77352, commit cd42415, commit 836e25c, commit 6863df3, commit 2782db3, commit e2df6c3, commit ecfc47c, commit 4300f84, commit 3964fc2, commit 4b3f765, commit 0b5fcb0, commit 0ad6090 (30 Mar 2021) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 8e97852, 30 Apr 2021)
sparse-index
: design doc and format update
Signed-off-by: Derrick Stolee
This begins a long effort to update the index format to allow sparse directory entries.
This should result in a significant improvement to Git commands when HEAD contains millions of files, but the user has selected many fewer files to keep in their sparse-checkout
definition.
Currently, the index format is only updated in the presence of extensions.sparseIndex
instead of increasing a file format version number.
This is temporary, and index v5 is part of the plan for future work in this area.
The design document details many of the reasons for embarking on this work, and also the plan for completing it safely.
technical/index-format
now includes in its man page:
An index entry typically represents a file. However, if sparse-checkout
is enabled in cone mode (core.sparseCheckoutCone
is enabled) and the
extensions.sparseIndex
extension is enabled, then the index may
contain entries for directories outside of the sparse-checkout definition.
These entries have mode 040000
, include the SKIP_WORKTREE
bit, and
the path ends in a directory separator.
technical/sparse-index
now includes in its man page:
Git Sparse-Index Design Document
The sparse-checkout feature allows users to focus a working directory on
a subset of the files at HEAD. The cone mode patterns, enabled by
core.sparseCheckoutCone
, allow for very fast pattern matching to
discover which files at HEAD belong in the sparse-checkout cone.
Three important scale dimensions for a Git working directory are:
HEAD
: How many files are present at HEAD
?
Populated: How many files are within the sparse-checkout cone.
Modified: How many files has the user modified in the working directory?
We will use big-O notation -- O(X)
-- to denote how expensive certain
operations are in terms of these dimensions.
These dimensions are ordered by their magnitude: users (typically) modify
fewer files than are populated, and we can only populate files at HEAD
.
Problems occur if there is an extreme imbalance in these dimensions. For
example, if HEAD
contains millions of paths but the populated set has
only tens of thousands, then commands like git status
and git add
can
be dominated by operations that require O(HEAD
) operations instead of
O(Populated). Primarily, the cost is in parsing and rewriting the index,
which is filled primarily with files at HEAD
that are marked with the
SKIP_WORKTREE
bit.
The sparse-index intends to take these commands that read and modify the
index from O(HEAD
) to O(Populated).
To do this, we need to modify the
index format in a significant way: add "sparse directory
" entries.
With cone mode patterns, it is possible to detect when an entire
directory will have its contents outside of the sparse-checkout definition.
Instead of listing all of the files it contains as individual entries, a
sparse-index contains an entry with the directory name, referencing the
object ID of the tree at HEAD
and marked with the SKIP_WORKTREE
bit.
If we need to discover the details for paths within that directory, we
can parse trees to find that list.
So you have a new option to git sparse-checkout init
: --[no-]sparse-index
sparse-checkout
: toggle sparse index from builtin
Signed-off-by: Derrick Stolee
The sparse index extension is used to signal that index writes should be in sparse mode.
This was only updated using GIT_TEST_SPARSE_INDEX=1
.
Add a '--[no-]sparse-index' option to 'git sparse-checkout init
'(man) that specifies if the sparse index should be used.
It also updates the index to use the correct format, either way.
Add a warning in the documentation that the use of a repository extension might reduce compatibility with third-party tools.
'git sparse-checkout init
already sets extension.worktreeConfig
, which places most sparse-checkout users outside of the scope of most third-party tools.
git sparse-checkout
now includes in its man page:
Use the --[no-]sparse-index
option to toggle the use of the sparse
index format.
This reduces the size of the index to be more closely
aligned with your sparse-checkout definition.
This can have significant
performance advantages for commands such as git status
or git add
.
This feature is still experimental. Some commands might be slower with
a sparse index until they are properly integrated with the feature.
WARNING: Using a sparse index requires modifying the index in a way
that is not completely understood by external tools. If you have trouble
with this compatibility, then run git sparse-checkout init --no-sparse-index
to rewrite your index to not be sparse.
Older versions of Git will not
understand the sparse directory entries index extension and may fail to
interact with your repository until it is disabled.
With Git 2.33 (Q3 2021), "git status
"(man) codepath learned to work with sparsely populated index without hydrating it fully.
See commit e5ca291, commit f8fe49e, commit fe0d576, commit d76723e, commit bf48e5a, commit 9eb00af, commit 69bdbdb, commit 523506d, commit bd6a3fd, commit cd807a5, commit 17a1bb5, commit bf26c06, commit e669ffb, commit 3d814b5, commit 4741077, commit fc6609d (14 Jul 2021) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit b271a30, 28 Jul 2021)
status
: skip sparse-checkout percentage with sparse-index
Reviewed-by: Elijah Newren
Signed-off-by: Derrick Stolee
'git status
'(man) began reporting a percentage of populated paths when sparse-checkout is enabled in 051df3c ("wt-status
: show sparse checkout status as well", 2020-07-18, Git v2.28.0-rc0 -- merge listed in batch #7).
This percentage is incorrect when the index has sparse directories.
It would also be expensive to calculate as we would need to parse trees to count the total number of possible paths.
Avoid the expensive computation by simplifying the output to only report that a sparse checkout exists, without the percentage.
This change is the reason we use 'git status
' --porcelain=v2 in t1092-sparse-checkout-compatibility.sh.
We don't want to ensure that this message is equal across both modes, but instead just the important information about staged, modified, and untracked files are compared.
Warning: Recent sparse-index
work broke safety against attempts to add paths with trailing slashes to the index, which has been corrected with Git 2.34 (Q4 2021).
See commit c8ad9d0, commit 2a1ae64, commit fc5e90b (07 Oct 2021) by René Scharfe (rscharfe
).
(Merged by Junio C Hamano -- gitster
-- in commit a86ed75, 18 Oct 2021)
read-cache
: let verify_path()
reject trailing dir separators again
Signed-off-by: René Scharfe
6e77352 ("sparse-index
: convert from full to sparse", 2021-03-30, Git v2.32.0-rc0 -- merge listed in batch #13) made verify_path()
accept trailing directory separators for directories, which is necessary for sparse directory entries.
This clemency causes "git stash
"(man) to stumble over sub-repositories, though, and there may be more unintended side-effects.
Avoid them by restoring the old verify_path()
behavior and accepting trailing directory separators only in places that are supposed to handle sparse directory entries.
With Git 2.35 (Q1 2022), ensure that the sparseness of the in-core index matches the index.sparse configuration specified by the repository immediately after the on-disk index file is read.
See commit 7ca4fc8, commit b93fea0, commit 13f69f3, commit 336d82e (23 Nov 2021) by Victoria Dye (vdye
).
(Merged by Junio C Hamano -- gitster
-- in commit 5396d7b, 10 Dec 2021)
sparse-index
: update do_read_index
to ensure correct sparsity
Helped-by: Junio C Hamano
Co-authored-by: Derrick Stolee
Signed-off-by: Victoria Dye
Reviewed-by: Elijah Newren
Unless command_requires_full_index
forces index expansion, ensure in-core index sparsity matches config settings on read by calling ensure_correct_sparsity
.
This makes the behavior of the in-core index more consistent between different methods of updating sparsity: manually changing the index.sparse
config setting vs.
executing git sparse-checkout --[no-]sparse-index init
(man)
Although index sparsity is normally updated with git sparse-checkout
init, ensuring correct sparsity after a manual index.sparse
change has some practical benefits:
- It allows for command-by-command sparsity toggling with
-c index.sparse=<true|false>
, e.g. when troubleshooting issues with the
sparse index.
- It prevents users from experiencing abnormal slowness after setting
index.sparse
to true
due to use of a full index in all commands until
the on-disk index is updated.
Warning: before Git 2.35 (Q1 2022), the sparse-index/sparse-checkout feature had a bug in its use of the matching code to determine which path is in or outside the sparse checkout patterns.
See commit 8c5de0d, commit 1b38efc (06 Dec 2021) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit e1d9288, 15 Dec 2021)
unpack-trees
: use traverse_path
instead of name
Reported-by: Gustave Granroth
Reported-by: Mike Marcelais
Signed-off-by: Derrick Stolee
The sparse_dir_matches_path()
method compares a cache entry that is a sparse directory entry against a 'struct traverse_info
*info' and a 'struct name_entry
*p' to see if the cache entry has exactly the right name for those other inputs.
This method was introduced in 523506d ("unpack-trees
: unpack sparse directory entries", 2021-07-14, Git v2.33.0-rc0 -- merge listed in batch #7), but included a significant mistake.
The path comparisons used 'info->name'
instead of 'info->traverse_path'
.
Since 'info->name'
only stores a single tree entry name while 'info->traverse_path'
stores the full path from root, this method does not work when 'info' is in a subdirectory of a directory.
Replacing the right strings and their corresponding lengths make the method work properly.
The previous change included a failing test that exposes this issue.
That test now passes.
The critical detail is that as we go deep into unpack_trees()
, the logic for merging a sparse directory entry with a tree entry during 'git checkout
'(man) relies on this sparse_dir_matches_path()
in order to avoid calling traverse_trees_recursive()
during unpack_callback()
in this hunk:
if (!is_sparse_directory_entry(src[0], names, info) &&
traverse_trees_recursive(n, dirmask, mask & ~dirmask,
names, info) < 0) {
return -1;
}
For deep paths, the short-circuit never occurred and traverse_trees_recursive()
was being called incorrectly and that was causing other strange issues.
Specifically, the error message from the now-passing test previously included this:
error: Your local changes to the following files would be overwritten by checkout:
deep/deeper1/deepest2/a
deep/deeper1/deepest3/a
Please commit your changes or stash them before you switch branches.
Aborting
These messages occurred because the 'current' cache entry in twoway_merge()
was showing as NULL
because the index did not contain entries for the paths contained within the sparse directory entries.
We instead had 'oldtree' given as the entry at HEAD and 'newtree' as the entry in the target tree.
This led to reject_merge()
listing these paths.
With Git 2.35 (Q1 2022), teach diff and blame to work well with sparse index.
See commit add4c86, commit 51ba65b, commit 338e2a9, commit 44c7e62, commit 27a443b, commit 0803f9c, commit e5b17bd (06 Dec 2021) by Lessley Dennington (ldennington
).
See commit ea6ae41 (29 Nov 2021) by Junio C Hamano (gitster
).
(Merged by Junio C Hamano -- gitster
-- in commit 8d2c373, 21 Dec 2021)
blame
: enable and test the sparse index
Signed-off-by: Lessley Dennington
Reviewed-by: Elijah Newren
Enable the sparse index for the 'git blame
'(man) command.
The index was already not expanded with this command, so the most interesting thing to do is to add tests that verify that 'git blame
' behaves correctly when the sparse index is enabled and that its performance improves.
More specifically, these cases are:
The index is not expanded for 'blame' when given paths in the sparse checkout cone at multiple levels.
Performance measurably improves for 'blame' with sparse index when given paths in the sparse checkout cone at multiple levels.
We do not include paths outside the sparse checkout cone because blame does not support blaming files that are not present in the working directory.
This is true in both sparse and full checkouts.
And:
diff
: enable and test the sparse index
Co-authored-by: Derrick Stolee
Signed-off-by: Lessley Dennington
Reviewed-by: Elijah Newren
Enable the sparse index within the 'git diff
'(man) command.
Its implementation already safely integrates with the sparse index because it shares code with the 'git status
'(man) and 'git checkout
'(man) commands that were already integrated.
For more details see:
d76723e ("status
: use sparse-index throughout", 2021-07-14, Git v2.33.0-rc0 -- merge listed in batch #7) 1ba5f45 ("checkout
: stop expanding sparse indexes", 2021-06-29, Git v2.33.0-rc1 -- merge)
The most interesting thing to do is to add tests that verify that 'git diff
' behaves correctly when the sparse index is enabled.
These cases are:
- The index is not expanded for 'diff' and 'diff --staged' 2. 'diff' and 'diff --staged' behave the same in full checkout, sparse checkout, and sparse index repositories in the following partially-staged scenarios (i.e.
the index, HEAD, and working directory differ at a given path):
- Path is within sparse-checkout cone
- Path is outside sparse-checkout cone
- A merge conflict exists for paths outside sparse-checkout cone