How to use git sparse-checkout in 2.27+
Asked Answered
C

4

11

I was trying to reproduce the few tutorial steps from:

https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout

Which was made for git 2.25, but now in 2.27, nothing happen at all when running:

$ git sparse-checkout set client/android

I can't find a way to make it works.

Here is a MWE:

$ git clone --no-checkout https://github.com/derrickstolee/sparse-checkout-example
Cloning into 'sparse-checkout-example'...

$ cd sparse-checkout-example/

$ git sparse-checkout init --cone

Using git 2.25, I obtain a non empty directory:

$ ls -a
.  .. bootstrap.sh LICENSE.md  README.md .git

Using git 2.27, I obtain an empty directory:

$ ls -a
.  .. .git
Canthus answered 17/6, 2020 at 7:45 Comment(6)
What do you expect to see that differs from what you actually see?Raucous
Is it clearer now ? Running the 3 given commands gives a totally different results, that's allCanthus
Much better, thanks! Unfortunately I can't tell why you're seeing this behavior, but hopefully someone else will know.Raucous
Is there a place to report git issues ?Canthus
Let me refer you to git-scm.com/communityRaucous
See also https://mcmap.net/q/11245/-why-do-excluded-files-keep-reappearing-in-my-git-sparse-checkoutAnywise
R
11

I believe I found the reason for this. Commit f56f31af0301 to Git changed the implementation of sparse-checkout so that, when you have an uninitialized working tree (as you would right after running git clone --no-checkout), running git sparse-checkout init will not check out any files into your working tree. In previous versions, the command would actually check out files, which could have unexpected effects given that you wouldn't have an active branch at that point.

The relevant commit, f56f31af0301 was included in Git 2.27, but not in 2.25. That accounts for why the behavior you see is not the behavior shown on the web page you're trying to follow. Basically, the behavior on the web page was a bug that nobody realized was a bug at the time, but with Git 2.27, it has been fixed.

This is explained very well, I think, in the message for commit b5bfc08a972d:

So...that brings us to the special case: a git clone performed with --no-checkout. As per the meaning of the flag, --no-checkout does not check out any branch, with the implication that you aren't on one and need to switch to one after the clone. Implementationally, HEAD is still set (so in some sense you are partially on a branch), but

  • the index is "unborn" (non-existent)
  • there are no files in the working tree (other than .git/)
  • the next time git switch (or git checkout) is run it will run unpack_trees with initial_checkout flag set to true.

It is not until you run, e.g. git switch <somebranch> that the index will be written and files in the working tree populated.

With this special --no-checkout case, the traditional read-tree -mu HEAD behavior would have done the equivalent of acting like checkout -- switch to the default branch (HEAD), write out an index that matches HEAD, and update the working tree to match. This special case slipped through the avoid-making-changes checks in the original sparse-checkout command and thus continued there.

After update_sparsity() was introduced and used (see commit f56f31a ("sparse-checkout: use new update_sparsity() function", 2020-03-27)), the behavior for the --no-checkout case changed: Due to git's auto-vivification of an empty in-memory index (see do_read_index() and note that must_exist is false), and due to sparse-checkout's update_working_directory() code to always write out the index after it was done, we got a new bug. That made it so that sparse-checkout would switch the repository from a clone with an "unborn" index (i.e. still needing an initial_checkout), to one that had a recorded index with no entries. Thus, instead of all the files appearing deleted in git status being known to git as a special artifact of not yet being on a branch, our recording of an empty index made it suddenly look to git as though it was definitely on a branch with ALL files staged for deletion! A subsequent checkout or switch then had to contend with the fact that it wasn't on an initial_checkout but had a bunch of staged deletions.

Raucous answered 19/6, 2020 at 23:51 Comment(0)
C
2

Here is a solution that will populate only files in the root folder:

$ git clone --filter=blob:none --sparse https://github.com/derrickstolee/sparse-checkout-example

Then subsequent sparse-checkout calls work like a charm.

Still no idea why the tutorial is broken.

Canthus answered 20/6, 2020 at 0:1 Comment(0)
A
2

With Git 2.35 (Q1 2022), the "init" and "set" subcommands in "git sparse-checkout"(man) have been unified for a better user experience and performance.

See commit dfac9b6 (23 Dec 2021), and commit d359541, commit d30e2bb, commit ba2f3f5, commit 4e25673, commit f2e3a21, commit be61fd1, commit f85751a, commit 45c5e47, commit 0b624e0, commit 1530ff3 (14 Dec 2021) by Elijah Newren (newren).
(Merged by Junio C Hamano -- gitster -- in commit 2dc94da, 03 Jan 2022)

sparse-checkout: enable set to initialize sparse-checkout mode

Reviewed-by: Derrick Stolee
Reviewed-by: Victoria Dye
Signed-off-by: Elijah Newren

The previously suggested workflow: git sparse-checkout init ... git sparse-checkout set ...

Suffered from three problems:

  1. It would delete nearly all files in the first step, then restore them in the second.
    That was poor performance and forced unnecessary rebuilds.
  2. The two-step process resulted in two progress bars, which was suboptimal from a UI point of view for wrappers that invoked both of these commands but only exposed a single command to their end users.
  3. With cone mode, the first step would delete nearly all ignored files everywhere, because everything was considered to be outside of the specified sparsity paths.
    (The user was not allowed to specify any sparsity paths in the init step.)

Avoid these problems by teaching set to understand the extra parameters that init takes and performing any necessary initialization if not already in a sparse checkout.


Those commands are detailed with Git 2.39 (Q4 2022)

See commit 20d87d3 (06 Nov 2022) by Elijah Newren (newren).
(Merged by Taylor Blau -- ttaylorr -- in commit e87a229, 18 Nov 2022)

sparse-checkout.txt: new document with sparse-checkout directions

Signed-off-by: Elijah Newren
Signed-off-by: Taylor Blau

Once upon a time, Matheus wrote some patches to make git grep [--cached | ] ... restrict its output to the sparsity specification when working in a sparse checkout (thread, see his second link in that email in particular).
That effort got derailed by two things:

  1. The --sparse-index work just beginning which we wanted to avoid creating conflicts for
  2. Never deciding on flag and config names and planned high level behavior for all commands.

More recently, Shaoxuan implemented a more limited form of Matheus' patches that only affected --cached, using a different flag name, but also changing the default behavior in line with what Matheus did.
This again highlighted the fact that we never decided on command line flag names, config option names, and the big picture path forward.

The --sparse-index work has been mostly complete (or at least released into production even if some small edges remain) for quite some time now.
We have also had several discussions on flag and config names, though we never came to solid conclusions.
Stolee once upon a time suggested putting all these into some document in Documentation/technical (Scroll to the very end for the final few paragraphs), which Victoria recently also requested.
I'm behind the times, but here's a patch attempting to finally do that.

technical/sparse-checkout now includes in its man page:

Table of contents:

  • Terminology
  • Purpose of sparse-checkouts
  • Usecases of primary concern
  • Oversimplified mental models ("Cliff Notes" for this document!)
  • Desired behavior
  • Behavior classes
  • Subcommand-dependent defaults
  • Sparse specification vs. sparsity patterns
  • Implementation Questions
  • Implementation Goals/Plans
  • Known bugs
  • Reference Emails

With Git 2.44 (Q1 2024), "git sparse-checkout (add|set) --[no-]cone --end-of-options"(man)" did not handle "--end-of-options" correctly after a recent update.

See commit f8ab66f (26 Dec 2023) by Elijah Newren (newren).
See commit 2e13ed4 (20 Dec 2023) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit d73db00, 08 Jan 2024)

sparse-checkout: be consistent with end of options markers

Signed-off-by: Elijah Newren

9385174 (parse-options: decouple , 2023-12-06, Git v2.44.0 -- merge listed in batch #3) (parse-options: decouple "--end-of-options" and "--", 2023-12-06) updated the world order to make callers of parse-options that set PARSE_OPT_KEEP_UNKNOWN_OPT responsible for deciding what to do with "--end-of-options" they may see after parse_options() returns.

This made a previous bug in sparse-checkout more visible; namely, that

git sparse-checkout [add|set] --[no-]cone --end-of-options ...

would simply treat "--end-of-options" as one of the paths to include in the sparse-checkout.
But this was already problematic before; namely,

git sparse-checkout [add|set| --[no-]cone --sikp-checks ...

would not give an error on the mis-typed "--skip-checks" but instead simply treat "--sikp-checks" as a path or pattern to include in the sparse-checkout, which is highly unfriendly.

This behavior began when the command was converted to parse-options in 7bffca9 ("sparse-checkout: add '--stdin' option to set subcommand", 2019-11-21, Git v2.25.0-rc0 -- merge).
Back then it was just called KEEP_UNKNOWN.
Later it was renamed to KEEP_UNKNOWN_OPT in 99d86d6 ("parse-options: PARSE_OPT_KEEP_UNKNOWN only applies to --options", 2022-08-19, Git v2.38.0-rc0 -- merge listed in batch #17) to clarify that it was only about dashed options; we always keep non-option arguments.
Looking at that original patch, both Peff and I think that the author was simply confused about the mis-named option, and really just wanted to keep the non-option arguments.
We never should have used the flag all along (and the other cases were cargo-culted within the file).

Remove the erroneous PARSE_OPT_KEEP_UNKNOWN_OPT flag now to fix this bug.
Note that this does mean that anyone who might have been using

git sparse-checkout [add|set] [--[no-]cone] --foo --bar

to request paths or patterns '--foo' and '--bar' will now have to use

git sparse-checkout [add|set] [--[no-]cone] -- --foo --bar

That makes sparse-checkout more consistent with other git commands, provides users much friendlier error messages and behavior, and is consistent with the all-caps warning in git-sparse-checkout.txt that this command "is experimental...its behavior...will likely change".
:-)

Anywise answered 6/1, 2022 at 0:39 Comment(1)
Thanks, it is not entirely clear to me what a new minimal workflow would be now. But since all this is clearly a work in progress, I'll wait for it to stabilize. Last time I tested it, the main drawback was that you can "uncheckout" data from your working directory, but not from your local copy (inside the .git), so it is useless for data versioning purpose as your local copy will always end up in a non sparse state.Canthus
A
1

I did mentioned before in "Why do excluded files keep reappearing in my git sparse checkout?" how any skip-worktree file should not be modified or even looked at during a sparse checkout anymore with Git 2.27+.

But with the new sparceIndex option with Git 2.32 (Q2 2021), that changes again:

Git 2.32 (Q2 2021) adds sparse-index.

And Git 2.39 (Q4 2022) documents it in Documentation/technical/sparse-checkout.txt, as explained below.

See "Make your monorepo feel small with Git’s sparse index" from Derrick Stolee.

sparse-index

See commit 4589bca, commit 71f82d0, commit 5f11669 (12 Apr 2021), commit f5fed74, commit dc26b23, commit 0c18c05, commit 465a04a, commit f7ef64b, commit 3450a30, commit d425f65, commit 2508df0, commit a029120, commit e43e2a1, commit 299e2c4, commit 42f44e8, commit 46eb6e3, commit 2227ea1, commit 48b3c7d, commit cb8388d, commit 0f6d3ba, commit 1b850d3, commit 54beed2, commit 118a2e8, commit 95e0321, commit 847a9e5, commit 839a663 (01 Apr 2021), and commit c9e40ae, commit 9ad2d5e, commit 2de37c5, commit dcc5fd5, commit 122ba1f, commit 58300f4, commit 0938e6f, commit 13e1331, commit f442313, commit 6e77352, commit cd42415, commit 836e25c, commit 6863df3, commit 2782db3, commit e2df6c3, commit ecfc47c, commit 4300f84, commit 3964fc2, commit 4b3f765, commit 0b5fcb0, commit 0ad6090 (30 Mar 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 8e97852, 30 Apr 2021)

sparse-index: design doc and format update

Signed-off-by: Derrick Stolee

This begins a long effort to update the index format to allow sparse directory entries.
This should result in a significant improvement to Git commands when HEAD contains millions of files, but the user has selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of extensions.sparseIndex instead of increasing a file format version number.
This is temporary, and index v5 is part of the plan for future work in this area.

The design document details many of the reasons for embarking on this work, and also the plan for completing it safely.

technical/index-format now includes in its man page:

An index entry typically represents a file. However, if sparse-checkout is enabled in cone mode (core.sparseCheckoutCone is enabled) and the extensions.sparseIndex extension is enabled, then the index may contain entries for directories outside of the sparse-checkout definition. These entries have mode 040000, include the SKIP_WORKTREE bit, and the path ends in a directory separator.

technical/sparse-index now includes in its man page:

Git Sparse-Index Design Document

The sparse-checkout feature allows users to focus a working directory on a subset of the files at HEAD. The cone mode patterns, enabled by core.sparseCheckoutCone, allow for very fast pattern matching to discover which files at HEAD belong in the sparse-checkout cone.

Three important scale dimensions for a Git working directory are:

  • HEAD: How many files are present at HEAD?

  • Populated: How many files are within the sparse-checkout cone.

  • Modified: How many files has the user modified in the working directory?

We will use big-O notation -- O(X) -- to denote how expensive certain operations are in terms of these dimensions.

These dimensions are ordered by their magnitude: users (typically) modify fewer files than are populated, and we can only populate files at HEAD.

Problems occur if there is an extreme imbalance in these dimensions. For example, if HEAD contains millions of paths but the populated set has only tens of thousands, then commands like git status and git add can be dominated by operations that require O(HEAD) operations instead of O(Populated). Primarily, the cost is in parsing and rewriting the index, which is filled primarily with files at HEAD that are marked with the SKIP_WORKTREE bit.

The sparse-index intends to take these commands that read and modify the index from O(HEAD) to O(Populated).

To do this, we need to modify the index format in a significant way: add "sparse directory" entries.

With cone mode patterns, it is possible to detect when an entire directory will have its contents outside of the sparse-checkout definition. Instead of listing all of the files it contains as individual entries, a sparse-index contains an entry with the directory name, referencing the object ID of the tree at HEAD and marked with the SKIP_WORKTREE bit. If we need to discover the details for paths within that directory, we can parse trees to find that list.

So you have a new option to git sparse-checkout init : --[no-]sparse-index

sparse-checkout: toggle sparse index from builtin

Signed-off-by: Derrick Stolee

The sparse index extension is used to signal that index writes should be in sparse mode.
This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init'(man) that specifies if the sparse index should be used.
It also updates the index to use the correct format, either way.
Add a warning in the documentation that the use of a repository extension might reduce compatibility with third-party tools.
'git sparse-checkout init already sets extension.worktreeConfig, which places most sparse-checkout users outside of the scope of most third-party tools.

git sparse-checkout now includes in its man page:

Use the --[no-]sparse-index option to toggle the use of the sparse index format.

This reduces the size of the index to be more closely aligned with your sparse-checkout definition.

This can have significant performance advantages for commands such as git status or git add. This feature is still experimental. Some commands might be slower with a sparse index until they are properly integrated with the feature.

WARNING: Using a sparse index requires modifying the index in a way that is not completely understood by external tools. If you have trouble with this compatibility, then run git sparse-checkout init --no-sparse-index to rewrite your index to not be sparse.

Older versions of Git will not understand the sparse directory entries index extension and may fail to interact with your repository until it is disabled.


With Git 2.33 (Q3 2021), "git status"(man) codepath learned to work with sparsely populated index without hydrating it fully.

See commit e5ca291, commit f8fe49e, commit fe0d576, commit d76723e, commit bf48e5a, commit 9eb00af, commit 69bdbdb, commit 523506d, commit bd6a3fd, commit cd807a5, commit 17a1bb5, commit bf26c06, commit e669ffb, commit 3d814b5, commit 4741077, commit fc6609d (14 Jul 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit b271a30, 28 Jul 2021)

status: skip sparse-checkout percentage with sparse-index

Reviewed-by: Elijah Newren
Signed-off-by: Derrick Stolee

'git status'(man) began reporting a percentage of populated paths when sparse-checkout is enabled in 051df3c ("wt-status: show sparse checkout status as well", 2020-07-18, Git v2.28.0-rc0 -- merge listed in batch #7).
This percentage is incorrect when the index has sparse directories.
It would also be expensive to calculate as we would need to parse trees to count the total number of possible paths.

Avoid the expensive computation by simplifying the output to only report that a sparse checkout exists, without the percentage.

This change is the reason we use 'git status' --porcelain=v2 in t1092-sparse-checkout-compatibility.sh.
We don't want to ensure that this message is equal across both modes, but instead just the important information about staged, modified, and untracked files are compared.


Warning: Recent sparse-index work broke safety against attempts to add paths with trailing slashes to the index, which has been corrected with Git 2.34 (Q4 2021).

See commit c8ad9d0, commit 2a1ae64, commit fc5e90b (07 Oct 2021) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit a86ed75, 18 Oct 2021)

read-cache: let verify_path() reject trailing dir separators again

Signed-off-by: René Scharfe

6e77352 ("sparse-index: convert from full to sparse", 2021-03-30, Git v2.32.0-rc0 -- merge listed in batch #13) made verify_path() accept trailing directory separators for directories, which is necessary for sparse directory entries.
This clemency causes "git stash"(man) to stumble over sub-repositories, though, and there may be more unintended side-effects.

Avoid them by restoring the old verify_path() behavior and accepting trailing directory separators only in places that are supposed to handle sparse directory entries.


With Git 2.35 (Q1 2022), ensure that the sparseness of the in-core index matches the index.sparse configuration specified by the repository immediately after the on-disk index file is read.

See commit 7ca4fc8, commit b93fea0, commit 13f69f3, commit 336d82e (23 Nov 2021) by Victoria Dye (vdye).
(Merged by Junio C Hamano -- gitster -- in commit 5396d7b, 10 Dec 2021)

sparse-index: update do_read_index to ensure correct sparsity

Helped-by: Junio C Hamano
Co-authored-by: Derrick Stolee
Signed-off-by: Victoria Dye
Reviewed-by: Elijah Newren

Unless command_requires_full_index forces index expansion, ensure in-core index sparsity matches config settings on read by calling ensure_correct_sparsity.
This makes the behavior of the in-core index more consistent between different methods of updating sparsity: manually changing the index.sparse config setting vs.
executing git sparse-checkout --[no-]sparse-index init(man)

Although index sparsity is normally updated with git sparse-checkout init, ensuring correct sparsity after a manual index.sparse change has some practical benefits:

  1. It allows for command-by-command sparsity toggling with -c index.sparse=<true|false>, e.g. when troubleshooting issues with the sparse index.
  2. It prevents users from experiencing abnormal slowness after setting index.sparse to true due to use of a full index in all commands until the on-disk index is updated.

Warning: before Git 2.35 (Q1 2022), the sparse-index/sparse-checkout feature had a bug in its use of the matching code to determine which path is in or outside the sparse checkout patterns.

See commit 8c5de0d, commit 1b38efc (06 Dec 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit e1d9288, 15 Dec 2021)

unpack-trees: use traverse_path instead of name

Reported-by: Gustave Granroth
Reported-by: Mike Marcelais
Signed-off-by: Derrick Stolee

The sparse_dir_matches_path() method compares a cache entry that is a sparse directory entry against a 'struct traverse_info *info' and a 'struct name_entry *p' to see if the cache entry has exactly the right name for those other inputs.

This method was introduced in 523506d ("unpack-trees: unpack sparse directory entries", 2021-07-14, Git v2.33.0-rc0 -- merge listed in batch #7), but included a significant mistake.
The path comparisons used 'info->name' instead of 'info->traverse_path'.
Since 'info->name' only stores a single tree entry name while 'info->traverse_path' stores the full path from root, this method does not work when 'info' is in a subdirectory of a directory.
Replacing the right strings and their corresponding lengths make the method work properly.

The previous change included a failing test that exposes this issue.
That test now passes.
The critical detail is that as we go deep into unpack_trees(), the logic for merging a sparse directory entry with a tree entry during 'git checkout'(man) relies on this sparse_dir_matches_path() in order to avoid calling traverse_trees_recursive() during unpack_callback() in this hunk:

if (!is_sparse_directory_entry(src[0], names, info) &&
    traverse_trees_recursive(n, dirmask, mask & ~dirmask,
                  names, info) < 0) {
  return -1;
}

For deep paths, the short-circuit never occurred and traverse_trees_recursive() was being called incorrectly and that was causing other strange issues.
Specifically, the error message from the now-passing test previously included this:

error: Your local changes to the following files would be overwritten by checkout:
        deep/deeper1/deepest2/a
        deep/deeper1/deepest3/a
Please commit your changes or stash them before you switch branches.
Aborting

These messages occurred because the 'current' cache entry in twoway_merge() was showing as NULL because the index did not contain entries for the paths contained within the sparse directory entries.
We instead had 'oldtree' given as the entry at HEAD and 'newtree' as the entry in the target tree.
This led to reject_merge() listing these paths.


With Git 2.35 (Q1 2022), teach diff and blame to work well with sparse index.

See commit add4c86, commit 51ba65b, commit 338e2a9, commit 44c7e62, commit 27a443b, commit 0803f9c, commit e5b17bd (06 Dec 2021) by Lessley Dennington (ldennington).
See commit ea6ae41 (29 Nov 2021) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit 8d2c373, 21 Dec 2021)

blame: enable and test the sparse index

Signed-off-by: Lessley Dennington
Reviewed-by: Elijah Newren

Enable the sparse index for the 'git blame'(man) command.
The index was already not expanded with this command, so the most interesting thing to do is to add tests that verify that 'git blame' behaves correctly when the sparse index is enabled and that its performance improves.
More specifically, these cases are:

  1. The index is not expanded for 'blame' when given paths in the sparse checkout cone at multiple levels.

  2. Performance measurably improves for 'blame' with sparse index when given paths in the sparse checkout cone at multiple levels.

We do not include paths outside the sparse checkout cone because blame does not support blaming files that are not present in the working directory.
This is true in both sparse and full checkouts.

And:

diff: enable and test the sparse index

Co-authored-by: Derrick Stolee
Signed-off-by: Lessley Dennington
Reviewed-by: Elijah Newren

Enable the sparse index within the 'git diff'(man) command.
Its implementation already safely integrates with the sparse index because it shares code with the 'git status'(man) and 'git checkout'(man) commands that were already integrated.
For more details see:

d76723e ("status: use sparse-index throughout", 2021-07-14, Git v2.33.0-rc0 -- merge listed in batch #7) 1ba5f45 ("checkout: stop expanding sparse indexes", 2021-06-29, Git v2.33.0-rc1 -- merge)

The most interesting thing to do is to add tests that verify that 'git diff' behaves correctly when the sparse index is enabled.
These cases are:

  1. The index is not expanded for 'diff' and 'diff --staged' 2. 'diff' and 'diff --staged' behave the same in full checkout, sparse checkout, and sparse index repositories in the following partially-staged scenarios (i.e.
    the index, HEAD, and working directory differ at a given path):
  2. Path is within sparse-checkout cone
  3. Path is outside sparse-checkout cone
  4. A merge conflict exists for paths outside sparse-checkout cone
Anywise answered 2/5, 2021 at 0:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.