How to shallow pull submodule that is tracked by branch name
Asked Answered
B

1

5

Hi I have a superproject that contains a submodule. The submodule is tracked by a branch name and not by a sha commit number. On our buildserver I would like to pull as minimum as possible. So I tried

git submodule update --remote --init 

This however is not shallow. It seems like pulls everything then switches to branch

git submodule update --remote --init --depth 1

This doesnt work, it fails on this:

git submodule update --remote --init --depth 1 ThirdParty/protobuf
Submodule 'ThirdParty/protobuf' (ssh://myrepo/thirdparty/protobuf.git) 
registered for path 'ThirdParty/protobuf'
Cloning into '/home/martin/jenkins/workspace/test_log_service/repo/ThirdParty/protobuf'...
fatal: Needed a single revision
Unable to find current origin/version/3.2.0-era revision in submodule path 'ThirdParty/protobuf'

There is a different question on shallow submodules however i dont see that working for branches, only for sha commits

Beachhead answered 28/4, 2020 at 15:8 Comment(4)
What happens if you try with a slightly bigger depth? (--depth 5 or 10, just for testing)Twospot
Does this answer your question? How to make shallow git submodules?Mccutchen
stackoverflow.com/search?q=%5Bgit-submodules%5D+shallowMccutchen
Actually i dont see the answer for tracking branches. As i mentioned in the auestion it works if i am tracking sha1, i am tracking a branch. I would just expect the submodule init to trigger a git clone ... -branch my-branch --single-branch --depth 1 ...Beachhead
T
9

TL;DR

I think you have hit a bug in Git. To work around it, use --no-single-branch or configure the branch manually.

Other things to know:

  • If you have recursive submodules, make sure your Git is recent and use --recommend-shallow to enable shallow submodules recursively, or --no-recommend-shallow to disable them.

  • You may need to do this in two steps. I'll show this as a two-step sequence below. I know this code has evolved a lot between Git 1.7 and current (2.26 or so) Git, and I expect the two-step sequence will work for most older versions too.

The two steps are:

N=...        # set your depth here, or expand it in the two commands
git submodule update --init --depth $N --no-single-branch
git submodule update --remote --depth $N

The Git folks have been fixing various shallow-clone submodule bugs recently as part of adding --recommend-shallow with recursive submodules, so this might all work as one command. Based on the analysis below, it should all work as one command in current Git. However, --no-single-branch fetches more objects than --single-branch.

Another option may be to allow single-branch mode but fix the fetch refspec in the submodule. This requires three steps—well, three separate Git commands, anyway:

branch=...   # set this to the branch you want
git submodule update --init --depth $N
(cd path/to/submodule &&
 git config remote.origin.fetch +refs/heads/$branch:refs/remotes/origin/$branch)
git submodule update --remote --depth $N

(You could do this in all submodules with git submodule foreach, but remember to pick the right branch name per-submodule.)

Just in general—this is not specific to your error—I recommend avoiding shallow submodules: they tend not to work very well. If you really want to use them, use a pretty-big depth: e.g., 50, or 100, or more. Tune this based on your own repositories and needs. (Your current setup does allow --depth 1, provided you work around the other problem.)

Long: it's probably a bug in Git

Note that the analysis below is based on the source code. I have not actually tested this so it's possible I missed something. The principles are all sound, though.

All submodules are always "sha commits", or maybe "sha1" commits—Git used to call them that, but now calls them OIDs, where OID stands for Object ID. A future Git will probably use SHA-2.1 So "OID", or "hash ID" if one wishes to avoid TLA syndrome,2 is certainly a better term. So let me put it this way: all submodules use OID / hash-ID commits.

What do I mean by "all submodules always use OIDs / hash IDs"? Well, that's one of the key to shallow submodules. Shallow submodules are inherently fragile, and it's tricky to get Git to use them correctly in all cases. This claim:

The submodule is tracked by a branch name and not by a sha commit number.

is wrong, in an important way. No matter how hard you try, submodules—or more precisely, submodule commits—are tracked by hash ID.

Now, it's true that there are branch names involved in cloning and fetching in the submodules. When you use --shallow with submodules, this can become very important, because most servers do not allow fetch-by-hash-ID (side note, Jan 2021: this is changing because some new features in Git need it—GitHub already allow fetch by ID—so over time this situation should improve). The depth you choose—and the single branch name, since --depth implies --single-branch—must therefore be deep enough to reach the commit the superproject Git chooses.

If you override Git's tracked-by-hash-ID commit tracking with submodules, you can bypass one fragility issue. That's what you're doing, but you've hit a bug.


1And won't that be fun. Git depends rather heavily on each commit having a unique OID; the introduction of a new OID namespace, so that each Git has two OIDs, with each one being unique within its namespace, means commits won't necessarily have the appropriate OID. All of the protocols get more complicated: any Git that only supports the old scheme requires a SHA-1 hash for the (single) OID, while any Git that uses the new scheme would like a SHA-2 hash, perhaps along with a SHA-1 hash to give to old Gits. Once we have the object, we can use it to compute the other hash(es), but if we only have one of the two hashes, it needs to be the right one.

The straightforward way to handle this is to put the burden of computing the "other guy's hash" on the Git that has the object, in the case of an object existing in a repository that uses a different OID namespace. But SHA-1 Gits cannot be changed, so we can't use that method. The burden has to be on new SHA-2 Gits.

2Note that "SHA" itself is a TLA: a Three Letter Acronym. TLAS, which stands for TLA Syndrome, is an ETLA: an Extended Three Letter Acronym. 😀


How does a superproject Git choose a submodule Git commit?

The git submodule command is currently still a big shell script, but uses a C language helper for much of its operation. While it is a complex shell script, the heart of it is to run:

(cd $path && git $command)

in order to do things within each submodule. The $path is the path for the submodule, and $command is the command to run within that submodule.

There's some chicken-and-egg stuff here though, because $path is initially just an empty directory: there's no actual clone yet, right after cloning the superproject. Until there is a clone, no Git command will work! Well, nothing except git clone itself, that is.

Meanwhile, each superproject commit has two items:

  • a .gitmodules file, listing the name of the submodule and any configuration data, and instructions for cloning it if/when needed; and
  • a gitlink for the submodule(s).

The gitlink contains the directive: this commit requires that submodule S be checked out as commit hash hash-value. At an interesting point below, we get a chance to use or ignore this hash value, but for now, note that each commit, in effect, says: I need a clone, and in that clone, I need one particular commit, by its hash ID.

Cloning a submodule repository

To clone a submodule, we need its URL. We'll run:

git clone $url $path

or maybe:

git clone --depth $N --no-single-branch $url $path

or similar. The URL and path are the most important parts. They're in the .gitmodules file, but that's not where Git wants them: Git wants them in the configuration file in the Git repository.

Running git submodule init copies the data from the .gitmodules file to where Git wants it. This command otherwise does not do anything interesting, really. Nobody seems to use it because git submodule update --init will do this for you every time. The separate init command exists so that you can, as the documentation puts it, "customize ... submodule locations" (tweak the URLs).

Running git submodule update (with or without --remote, --init, and/or --depth) will notice whether the clone exists. It does need the information that git submodule init would save, so if you haven't done a git submodule init yet, you need the --init option to make that happen. If the submodule itself is missing—if the superproject does not yet have a clone of the submodule—git submodule update will now run git clone. It's actually the submodule helper that runs git clone; see line 558 ff., though the line numbers will no doubt change in future Git releases.

Note these things about this git clone:

  1. It gets a --depth argument if you use --depth.
  2. If it does get a --depth argument, it sets --single-branch by default, unless you use --no-single-branch.
  3. It creates the actual repository for the submodule, but it is always told --no-checkout so it never does an initial git checkout of any commit.
  4. It never gets a -b / --branch argument. This is surprising to me, and possibly wrong, but see clone_submodule in the submodule--helper.c source.

Now, combine item 2 with item 4. Cloning with --depth implies --single-branch, which sets up the submodule repository to have:

remote.origin.fetch=+refs/heads/<name>:refs/remotes/origin/<name>

as its pre-configured fetch setting. But Git did not supply a branch name here so the default name is the one recommended by the other Git, i.e., the Git that you're cloning. It's not any name you have configured yourself, in your superproject.

Using --no-single-branch on the git submodule update --init line forces the clone to be made without --single-branch mode. This gets you --depth commits from the tip commit of all branches, and leaves the fetch line configured as:

remote.origin.fetch=+refs/heads/*:refs/remotes/origin/*

so that your submodule repository has all branch names in it (plus the depth-50, or however deep you specified, commits reachable from those names). Or, as I mentioned at the top, you could use git config in the submodule, at this point, to fix the remote.origin.fetch setting.

Checking out the right commit

Once we have a clone, the remaining task is to run the right git checkout or (other Git command) in the submodule. That is, of the:

(cd $path; git $command)

commands, we now have the path with the submodule work-tree; all we need is to find a hash ID and run git checkout on that hash ID.

The hash ID is stored in the gitlink. Normally, that's what Git would use here. With --remote, though, the git submodule script will now run the submodule helper to figure out the "right" branch name. That is, the submodule helper will find the name you configured, if you configured one, or use the superproject's branch name, if you didn't.

Note that this is rather late: the submodule is already cloned, and already has its remote.origin.fetch set to some other name. (Unless, perhaps, you're lucky: perhaps the other Git recommended the same name you'll get here with --remote. But probably not.)

Here is the interesting bit of code, from those source lines I linked above:

# enter here with:
#    $sm_path: set to the submodule path
#    $sha1: set to the hash from the gitlink
#    $just_cloned: a flag set to 1 if we just ran `git clone`

if test $just_cloned -eq 1
then
    subsha1=    # i.e., set this to the empty string
else
    subsha1=(...find hash ID that is currently checked out...)
fi

if test -n "$remote"
then
    branch=(...find the branch you want...)
    ... fetch_in_submodule "$sm_path" $depth ...
    sha1=(...use git rev-parse to find the hash ID for origin/$branch...)
fi

if test "$subsha1" != "$sha1" || test -n "$force"; then
    ... do stuff to the submodule ...
    ... in this case, git checkout -q $sha1 ...
fi

(I've omitted some irrelevant pieces and replaced a few $(...) sections with descriptions of what they do, rather than actual code).

What all of this work is about is this:

  • A submodule repository is normally in detached HEAD mode, with one particular commit checked out by hash ID. Even if it's in the other mode—on a branch, or attached HEAD mode to use the obvious opposite—it still has one particular commit hash ID checked out.

    (The only real exception here is right after the initial clone, when literally nothing is checked out.)

  • The subsha1 code section figures out which hash ID that is.

  • The remainder of the code figures out which hash ID should be checked out. With the --remote option, you tell the superproject Git: ignore the gitlink setting entirely. All other options use the gitlink setting, and any of those can cause trouble with --depth 1.

Your error message is triggered here

You're using --remote to tell your superproject Git: ignore the gitlink hash ID. This uses the branch=(...) and then sha1=(...) assignments to override the gitlink hash ID.

That sha1= assignment is literally this code:

sha1=$(sanitize_submodule_env; cd "$sm_path" &&
    git rev-parse --verify "${remote_name}/${branch}") ||
die "$(eval_gettext "Unable to find current \${remote_name}/\${branch} revision in submodule path '\$sm_path'")"

and here you'll recognize the error message you are getting:

Unable to find current origin/version/3.2.0-era revision in submodule path '...'

Now, a git fetch command should, one might hope, have fetched the commit named by the branch-name version/3.2.0-era. If it did fetch that commit, one would hope that it would have updated the right remote-tracking name, in this case, origin/version/3.2.0-era.

The only candidate git fetch command, however, is the one invoked by:

fetch_in_submodule "$sm_path" $depth

This command runs git fetch with the --depth parameter you provided. It doesn't provide any branch names! Other fetch_in_submodule calls, particularly this one on line 628, provide a raw hash ID (still not a branch name), but this only provides the --depth argument if you gave one.

Without a refspec, such as a branch name, git fetch origin only fetches whatever is configured in remote.origin.fetch. That's the name from the other Git.

If the fetch= setting doesn't fetch the desired branch name—and with a single-branch clone, that's pretty likely here—the git fetch won't fetch the commit we want, and the subsequent git rev-parse to turn the remote-tracking name origin/$branch into a hash ID will fail. That's the error you're seeing.

I am not going to try to say exactly where the bug is—and therefore, how to fix it, in terms of setting the right configuration and/or issuing a git fetch with appropriate arguments—here, but clearly the current Git setup doesn't work for your case. In the end, though, what Git tries to do here is find the right OID, or in this case, fail to find it.

Having found the right OID—using git rev-parse origin/version/3.2.0-era for your particular case—your superproject Git would then run:

(cd $path; git checkout $hash)

in the submodule, leaving you with a detached HEAD pointing to the same hash ID you asked for by branch-name. When you fix the problem, you will be in this commit-by-OID detached-HEAD mode. The only way to get out of it is manual: you have to do your own (cd $path; git checkout branch-name) operation.

If you ever don't use git submodule update --remote—if you have your CI system build the commit that the superproject repository says to build, rather than depending on some branch name that's under someone else's control—a shallow clone must contain that commit after a git fetch. This is where the depth stuff is fragile: how deep should N be? There isn't a right answer, which is why you have to set it yourself.

If you configure the origin Git with uploadpack.allowReachableSHA1InWant or uploadpack.allowAnySHA1InWant set to true, the git fetch-by-hash-ID can fetch an arbitrary commit, allowing --depth 1 to work, but you need to have control over the origin Git repository to do this (and see the caveats in the git config documentation regarding these settings).

Tarbes answered 29/4, 2020 at 0:38 Comment(13)
hmm actually it would be nice if the command git submodule init would trigger a git clone <repo> -b branch_name --single-branch --depth 1 ... or possibly if there was a way to completely override the git clone command (for example specifying a shell script, with some arguments. who knows what the arguments are but for flexibility, it would be path to .gitmodules and submodule name)Beachhead
according to your answer I guess I will just create a custom script in my CI/CD to clone the submodules manually, i need shallow just to level 1, as most modules aren't recursiveBeachhead
In theory, after git submodule init, you should be able to use git submodule foreach to do your preferred cloning, but I'd have to try it (or dig into the script again) to see if that actually works.Tarbes
you are right !!!, now the only problem would be to transform relative urls in the .gitmodules to absolute, although I am not very sure if git submodule init creates the foldersBeachhead
nope it doesnt work, after git submodule init, the foreach doesnt do anythingBeachhead
Ah, apparently having the information in the config is not sufficient. (It seems like it should be, but submodules have a lot of rough edges.)Tarbes
The doc for git submodule foreach starts with "Evaluates an arbitrary shell command in each checked out submodule", so that's why it can't work for cloning the submodulesNicolenicolea
On the fetch single hash front, I suppose you are alluding to https://mcmap.net/q/13621/-retrieve-specific-commit-from-a-remote-git-repository or https://mcmap.net/q/13875/-git-fetch-a-specific-commit-by-hash, with Git 2.5+ (July 2015), and the server config uploadpack.allowReachableSHA1InWant? That seems available on GitHub since 2015 indeed (github.com/isaacs/github/issues/436).Twospot
@VonC: no, this is a change made for partial clones: the new default for uploadpack.AllowReachableSHA1InWant is now true. (It's been set true on GitHub for a while now, but it used to default to false.)Tarbes
@VonC, @torek: regarding the uploadpack.allow*SHA1InWant configs, I think I remember that some (all?) of these config now default to "true" in the Git protocol v2, but I can't find a source for this...Nicolenicolea
@philb: yes, that's the partial clone enabler. I might have goofed in the comment 2 or 3 lines above as I think it's any OID, not just a reachable one (the reachability test being computationally expensive, it was just disabled entirely).Tarbes
git submodule update --depth 1 should do what OP wants, if it doesn't I agree that's a flat bug. git submodule init exists to set up the default config for any needed tweaking, git submodule update --init --depth 1 exists to do the update and just take the defaults.Whiffet
@Nicolenicolea Source for your comment: https://mcmap.net/q/12208/-how-does-git-39-s-transfer-protocol-workTwospot

© 2022 - 2024 — McMap. All rights reserved.