[Submodules] appear to be unnecessarily complicated ...
Probably true. However, submodules are also necessarily complicated. :-) I will also note that submodule support is noticeably better in Git 2.x than it was in the bad old days of Git 1.5 or 1.6 or so, which is when I learned why people called them sob-modules. Some of that history is probably why some of the complexity is here.
Before I dive into the longer answer, here's the short way to get started: use git clone --recurse-submodules
, or run git submodule update --init --recursive
right after cloning. (The second --recursive
is only required if the submodule has submodules of its own.) Adding the --recurse-submodules
option to git clone
just tells git clone
to do that git submodule update --init --recursive
after its normal sequence of operations. Note that this won't help you with the process of working within the submodules, though.
Long
How do I ...
Git is a tool, not a solution (a common saying in the construction business, apparently, but generally applicable to most technology). As with most tools, there are multiple ways to use them.
The thing to know about a submodule is that each submodule is just another Git repository. The only thing that makes a Git repository a "submodule" is the fact that there is some "outer later" repository that is controlling the inner repository in some way. From within the inner repository, we refer to the outer one as the superproject.
Within any Git repository in which you will do any work, you have a work-tree. The work-tree holds the files in their ordinary everyday form, where you (and the other programs on your computer) can work with them. Each Git repository also has an index, which is where you build up the next commit you will make. The index is also called the staging area and sometimes the cache, reflecting either its extremely important roles, or the poor choice of the word "index" for its original name (or perhaps both). And, of course, each Git repository has a collection of commits, with various branch names and/or tag names that identify specific commit hashes by some sort of human-readable name.
If that Git repository were standing on its own, those names—the branch and tag names—would be the useful ones to us humans, doing work in that repository. But we've just declared that this repository is a submodule that lives (or dies) at the command of some other repository—the superproject. Our own branch and tag names are nearly useless. They become useful if and when we treat this repository as a regular repository, not a mere adjunct to some superproject. When we treat this repository as a controlled entity, we want this repository to have a detached HEAD instead. The superproject, not the submodule therein, dictates the commit hash to check out, not by some sort of human-readable name, but by raw hash ID.
This feeds into all of the "how do I" answers. The superproject records, in the superproject's index, by its raw hash ID, the specific commit that should be checked out in the submodule.
Cloning
[How do I] Clone the repo ... such that [the clone] is fully populated with ... all the submodules checked out?
Like any clone, this can be made via git clone url [dir]
, which really consists of about six steps:
- Create a new, empty directory
dir
and switch (cd
) to it, or use some existing empty directory if so told: ([ -d dir ] || mkdir dir) && cd dir
. (If this fails, stop, don't do any of the remaining steps. If a subsequent step fails, remove the new directory if we made it, and remove all the file we made, leaving no trace of the partial failed clone.) If we don't give git clone
a directory name, it computes one from the url
argument.
- Create a new, empty repository:
git init
. This creates the .git
directory and an initial configuration.
- Do any required additional configuration from
-c
options given after git clone
.
- Add a remote given a url:
git remote add remote url
. The usual name for the remote is origin
but you can control this with the -o
option.
- Obtain commits from the remote:
git fetch remote
.
- Check out some branch or tag name:
git checkout name
. If this is a branch name, the branch does not exist yet, so this creates the branch the same way that git checkout
does. If this is a tag name, this checks out the commit as a detached HEAD. The name
here is the one you gave with a -b
option. If you did not give one, the name is obtained by asking the Git at the other end of the git fetch
operation which branch it recommends, which is pretty commonly main
. If that also fails—if the other Git has no name to recommend—the name used is main
.
The last step, step 6, checks out some specific commit, typically by getting "on" a branch such as main
, creating that branch name based off the names obtained during step 5 (git fetch
which made origin/main
). The act of checking out this particular commit fills in the repository's index and work-tree, so that now you have in your work-tree all the files required.
Submodules and gitlinks
If the commit you just checked out has submodules, it has a file named .gitmodules
and has, in that commit that you just checked out, one or more special entries each called a gitlink. A gitlink entry looks much like a file (blob
) entry or a tree
entry, but has type-code 160000
rather than 100644
(regular file) or 100755
(executable file) or 004000
(tree).1 These gitlink entries go into your index, and your Git creates an empty directory at the path given by the gitlink, the same way your Git would create a subdirectory for a tree
or a file for a blob
.2 The hash ID associated with these gitlink entries—every index entry has a hash ID—is that of one particular commit in the submodule, which Git can, but won't just yet, check out as a detached HEAD.
Note that I said here if the commit you just checked out has submodules. This is another key realization: the "submodule-ness" of a submodule is controlled by the specific commit in the superproject. That commit needs to have a gitlink entry, to give the hash ID to check out in the submodule, and a .gitmodules
file. But what is this .gitmodules
file for?
1There's one more index type-code, 120000
, for symbolic links. These are handled almost exactly the same way as blob
objects except that as long as symlinks are enabled, Git writes the contents as a symlink rather than as a file. If symlinks are disabled, Git writes the contents as a regular file, so that you can edit it and re-add it as a symlink later using git update-index
, if you know all the magic for dealing with index entries.
2The fact that Git will create an empty directory for a tree
object has led people to try to use Git's semi-secret empty tree to store empty directories. Unfortunately, the index itself has weird corner cases here and Git turns the empty tree into a gitlink
entry under various conditions. This then acts as a broken submodule—a gitlink without a .gitmodules
entry—which makes Git behave slightly badly.
The .gitmodules
file
We just saw, above, that git clone
needs at least one argument: the url for the repository to clone. The superproject stores the desired commit hash ID in the gitlink, but how will it know what url to use? The answer is to look in the .gitmodules
file.
The contents of a .gitmodules
are formatted the same way as .git/config
or $HOME/.gitconfig
or any other Git configuration file, and in fact, Git uses git config
to read them:
git config -f .gitmodules --get submodule.path/to/x.url
This looks for
[submodule "path/to/x"]
url = <whatever you put here>
in the .gitmodules
file, and when we find it, that provides the URL.
In fact, the contents will be:
[submodule "path/to/x"]
path = path/to/x
url = <whatever you put here>
and perhaps also one or both of:
branch = <name>
update = <control>
The path
must correspond to the relative path of the submodule within the superproject, and the name of the submodule must be the relative path of the submodule within the superproject. (What happens if one or the other of these are wrong / don't match, I am not quite sure. Git's submodule commands generally make sure they do match, so that the question never arises.)
This lets git submodule
find the URL to make the clone. This process is complicated. When you run git submodule init
or git submodule update --init
, Git will copy the url
setting from .gitmodules
to .git/config
. If there is an update = control
setting, it will copy that too, unless there's already a setting in .git/config
. (This is one of those "unnecessary complications" you mention, though I think it's to correct for historical mistakes.)
Without --init
, the git submodule update
command will only look at the entries in .git/config
, not the ones in .gitmodules
. This means you could use the two step sequence git submodule init && git submodule update
to do the same thing, but git submodule update --init
is easier to enter. More importantly, git submodule init
does not have a --recursive
option while git submodule update
does. This is actually sensible, because git submodule init
only copies from .gitmodules
to .git/config
(see below for more about this). The git submodule update
operation actually creates the clone, using the six-step process outlined above.
Detaching HEAD onto the correct commit in the submodule
We saw that the superproject lists the correct hash ID for the submodule, as a gitlink entry. This means Git needs to start in the superproject, read the gitlink entry out of the index, then switch into the submodule (cd path
) and git checkout
the correct commit by its hash ID. That will result in a detached HEAD with the correct commit checked out.
The command that does this is git submodule update
. And, that's usually what we want: to check out that specific commit, by its hash ID, as a detached HEAD. Now that we've gotten what we want in the submodule, we're done ... or are we? What if this Git repository—remember, each submodule is an ordinary Git repository, in its own right—what if this Git repository has submodules of its own?
Submodules can have submodules
If this submodule has its own submodules, we now want this sub-Git to git checkout
the correct commit, run git submodule init
to initialize its .git/config
for its submodules, and run git submodule update
to make its own submodules get checked-out to the correct commit. That's just what git submodule update
is already doing on behalf of our superproject, so we just want this git submodule update
to recursively operate on the submodule's submodules. This means that git submodule update
needs to be able to recurse into submodules and also --init
them.
So that's why git submodule update --init --recursive
exists: it's the workhorse that goes into each submodule from the superproject, sets up its .git/config
if needed, checks out the correct detached-HEAD hash, and then recurses on submodules of the submodule.
git clone
can invoke git submodule update
If we now rewind all the way back to git clone
, we can see that what we need after step 6 is a step 7: git submodule update --init --recursive
, to go into each submodule listed in the superproject and initialize it and check out the correct detached HEAD, and if that submodule is a superproject of additional submodules, handle them recursively. In the end, we'll have the superproject, with its particular commit, controlling all of its submodules which are on the correct commit as a detached HEAD, and for each of those submodules that is itself a superproject with submodules, the submodule-as-superproject's commit will control the submodule-as-superproject's submodules, recursively.
If you don't have recursive submodules, all of the recursion winds up doing nothing: it's a little bit of extra work but is harmless. So this is usually the way to go: just run git clone --recurse-submodules
and you get the clone created with its submodules checked out as detached HEAD repositories, and you are done.
Working within the submodules
You had what is almost a separate question:
How do I then update a file in other/submodule?
We saw above that the way a superproject controls / uses a submodule is by having the superproject specify, by absolute hash ID, which commit the submodule is to be locked into, as a detached HEAD. That's great for controlling and using the submodule, except when we want to update the submodule to some newer commit.
The traditional answer, dating back to the Git 1.5 days, is that since the submodule is a Git repository, just cd
into the submodule and git checkout <branchname>
and start working. This still works! It has an obvious drawback, though: how do you know which branch name to use?
In some cases, you just know. That's fine; go ahead and use them that way. If you want the superproject to know, though, this is where the superproject's branch =
setting comes in, and where arguments to git submodule update
and/or the submodule.name.update
settings (also in the superproject) come in. Remember, these settings from from the .git/config
file in the superproject, not from the submodule itself, and (normally3) not from the .gitmodules
file either—but the .gitmodules
file contents set up the default .git/config
settings. So there are a lot of ways to control this configuration.
Next, there's the question of what each configuration does, and how you want to set it up for your own purposes. These are enumerated (rather poorly in my opinion) in the git submodule
documentation. Here's my own summary of their summary, with additional commentary.
checkout
: the commit recorded in the superproject will be checked out in the submodule on a detached HEAD.
This is the default and is what we saw above.
rebase
: the current branch of the submodule will be rebased onto the commit recorded in the superproject.
This isn't useful unless you've already gone into the submodule and done something there. However, there's also a --remote
option described later in the documentation, which makes it more useful.
merge
: the commit recorded in the superproject will be merged into the current branch in the submodule.
As with rebase
, this isn't useful by itself: you need either --remote
or to do your own work in the submodule before doing this.
custom command: arbitrary shell command that takes a single argument (the sha1 of the commit recorded in the superproject) is executed.
This one is useful by itself, but requires that you do some up-front work in the superproject, to set up the configuration and define the command.
none
: the submodule is not updated.
This is primarily useful to mark a submodule that doesn't get updated when all the other submodules of this particular superproject do. If you have only one submodule, this setting has no function at all.
So far, we have not seen any use for the branch
setting copied from .gitmodules
to .git/config
. It's this --remote
option, described further down in the same documentation, that talks about how this setting is used:
... Instead of using the superproject's recorded SHA-1 to update the submodule, use the status of the submodule's remote-tracking branch.
That is, the superproject has a gitlink entry that says use hash a1b2c3d... or whatever, but instead of using that hash, when the superproject git submodule update
command goes poking around with the Git repository holding the submodule, the superproject command will look up, e.g., origin/main
in the submodule. The name main
here comes from that branch setting, so setting submodule.name.branch
to, say, develop
instead will make the superproject use origin/develop
instead of origin/main
.4
To make this useful, the superproject Git runs git fetch
in the submodule before starting any of this. That causes the submodule to bring over any new commits from its origin
Git, updating its origin/main
, origin/develop
, and so on. The assumption here is that you did not do any work in the submodule yourself! You are just grabbing work that someone else did in the origin
repository from which the submodule repository was cloned (whew!).
3The setting in .gitmodules
will be used if there is no setting in .git/config
and no override on the command line. I think this is yet another backwards-compatibility item.
4This assumes that origin/develop
is the remote-tracking name associated with branch develop
in the submodule repository, i.e., that things are set up as normal.
Preparing the updated submodule
If you are about to do your own work in your own submodule, none of this helps you at all. Instead, you should just cd
into the submodule repository and run git checkout branchname
. That will take you off your detached HEAD and put you on the given branch, and now you can work normally. Write code, git add
, and git commit
as you normally would. When everything is ready in the submodule, cd
back to the superproject. You will have your submodule on a branch (not in detached HEAD mode), on some particular commit.
If you are just picking up someone else's work, this git submodule update --remote --checkout
or whatever will git fetch
and then git checkout origin/main
or whatever, as appropriate, in the submodule. That will leave your submodule on no branch, in detached HEAD mode, on some particular commit. This is likely what you want.
Using the updated submodule within the superproject
Either way, from the superproject's point of view, what has happened is that the submodule is now on a different commit. The superproject does not care whether the submodule's HEAD is attached or detached; what matters is the current commit in the submodule.
Now that the submodule is on the desired commit, make any other changes you want in the superproject—maybe there is some file that should use some new feature of the submodule, for instance. When you are done making the required changes, git add
any updated files, and also run git add
on the submodule path (without a trailing slash):
git add features.ext # updated to use feature F of submodule sub/S
git add sub/S # record the new gitlink for sub/S!
This updates the superproject's index, so that now we have not only the updated file (features.ext
) but also the new correct hash ID for the submodule—the updated gitlink. Now we can run git commit
in the superproject as usual:
git commit
and this makes our new commit, which has a gitlink that records the fact that submodule sub/S
should be checked out with a detached HEAD at commit f37c219...
or whatever the current commit of sub/S
actually is. This new commit goes on whatever branch we have checked out in the superproject, whether that's main
or develop
or whatever.
Pushing
Let's say we did our own work in sub/S
, on its branch devel
, creating commit f37c219...
. Then we made our new commit in our superproject on the superproject's main
; by some strange chance its hash ID is abcdef1...
. Now that we have two repositories with updates, we can git push
them. But there is an order constraint!
Suppose we push our superproject now:
git push origin main
Our new commit abcdef1
goes to our upstream repository, and that Git's main
now names our new commit abcdef1
. Our new commit says that submodule sub/S
should be checked out at commit f37c219
. So Fred, over on Fred's computer, runs git clone
or git fetch
or whatever it is and gets our commit abcdef1
that says "use commit f37c219...
when using sub/S". Fred runs git submodule update
and his Git goes into his sub/S
and tries to check out f37c219
and, whoops, Fred doesn't have f37c219
. In fact, only we have f37c219
, because we just made it!
We'd best very quickly cd sub/S
and run git push origin develop
. (Remember, we made our f37c219
on our develop
in our submodule.) That way, when Fred tries to access f37c219
, it's at least available somewhere. It's better if we git push
that one first, then git push origin main
in the superproject, to push abcdef1
which refers to f37c219
. So this leads to update rule #2: push the submodules first, in deepest-submodule order. That way each superproject refers to a commit that Fred—or whoever—can get to.
There is still one more minor pain point for Fred
We introduced Fred above as the first guy to fetch (and merge or rebase or otherwise incorporate, perhaps even git pull
) our superproject commit that refers to new subproject commits. However, Fred here stands in for anyone who has cloned our superproject. They all have our superproject, and they all ran git submodule update --init --recursive
, perhaps as part of the very clone command that got them the superproject, so they have all the submodules already.
But they don't have any of the new commits in the submodules yet. When they update their superproject and tell their Git to git submodule update
, their Gits will go into their submodules and not find the right commit hashes. Fortunately, git submodule update
is smart enough to run git fetch
for you (or for Fred).
For this to work, though, whoever is updating has to be on line. This means you must run git submodule update
when connected. If you're always connected, that's no problem, but if not, there should be an easy way to fetch all submodules up front.
There's no git submodule fetch
, but there is a command that will do the trick:
git submodule foreach --recursive git fetch
This will run git fetch
in each submodule, to update it. That way a later git submodule update
, used with any commit in the superproject, will work even if you are off line and the submodules would have required updating.
git submodule add url path
. always make sure if the submodule already exists inside the git superproject that both url and path are identical e.g.git submodule add ./path/to/submodule ./path/to/submodule
that way git understands that the submodule is already there otherwise it will try to clone it to itself – Bergh