This is a hard problem in general. There are specific cases, or degenerate ways of applying the submodule contents, that make it easier. One compromise—which may or may not be good enough—is to simply combine the two commit histories into one repository, then make some slightly-scary transformations using either git filter-branch
, or just an automated git replace
(though using, or abusing, git replace
like this is likely to result in performance issues).
Here's the basic situation, i.e., what you need to know, as mental tools, before you think about generalizing the problem. Each repository contains a commit graph: a DAG of commits, with various entry points into the graph found, and hence preserved, by branch names. The superproject's commits have, in each commit that uses the submodule, a reference to one of the submodule's commits. These references are in "tree" object, as entries of type gitlink. Git does not actually inspect them when it comes to retaining commits, since they are assumed to identify commits in some other repository (the submodule).
You can easily use git fetch
to fetch the entire submodule's graph into the superproject repository, changing the submodule's branch names into different names in the superproject. (The default for git fetch
is to produce remote-tracking names, but with a bit of sneakiness, you can easily use an alternative namespace. For the solutions I propose, remote-tracking names are fine anyway.) The result, though, is just that you have two otherwise disconnected DAGs. The superproject commits still just have trees with gitlink entries that refer to commits in the other DAG. Those gitlink entries will not keep the commits reachable, so you must retain both sets of names. Except for having all commits contained in one repository database, this is really no improvement at all (and it might be worse as it is now hard to work with).
Here's the general problem: What Git stores is (are?) these commits. There is no separate item that we can call "the history"; the history in a Git repository is (are?) the commits in the repository. We can see, visually, the problem if we draw the commits. Let's simplify it to just five commits, A
through E
, in the superproject. The uppercase letters stand in for the actual hash IDs (which are useless to humans):
A--B--C <-- master
\
D--E <-- dev
Now let's put six commits in the subproject, using lowercase letters since it's the subproject:
a--b--c--d <-- master
\
e--f <-- issue213
Some superproject commits—maybe all of them, but for simplicity, let's say just C
and E
—have inside them references to some of the subproject commits, so if we yank all of the submodule's commits into the superproject, using names sub/*
to remember the branch tips, we get this:
A--B--C <-- master
\ :
D÷-E <-- dev
: :
: :
: :
: :
a--b--c--d <-- sub/master
\
e--f <-- sub/issue213
Suppose we now, somehow, replace commits C
(with its gitlink to b
) and E
(with its gitlink to d
) with commits whose trees have actual, direct references to the tree objects for commits b
and e
. Let's call these commits C'
and E'
. This is technically possible in Git—we just make the new commits C'
and E'
with the trees we want, that use the trees in b
and d
respectively, then change the names master
and dev
to refer to commits C'
and E'
. If we drop the sub/*
names, we have this:
A--B--C' <-- master
\
D--E' <-- dev
and if we now git checkout master
we will get a nice work-tree full of what was in the original C
plus what was from the submodule, obtained from its commit b
that the original C
used, as we can see from our diagram.
Similarly, if we now git checkout dev
we will get a nice work-tree full of what was in the original E
plus what was from the submodule, obtained from its commit d
.
The trees in this new modified repository contain all the sources for the snapshot that you'd get by checking out C
-and-submodule, or E
-and-submodule. But the commits that were in the submodule, i.e., the history of d
leading back to c
leading back to b
leading back to a
, plus the entire issue213
branch consisting of f
leading back to e
leading back to c
... well, those commits are gone! There is nothing to represent them any more.
Moreover, there is no place in which you could insert them. Where, in the graph that contains commits A
through E
(all uppercase), do commits a
through f
(all lowercase) fit? The only answer is "nowhere": there's no place they can go.
Now, in specific cases, we can invent an answer. We can insert new commits between existing commits, so that the new commits keep the superproject's files in place while updating submodule files. This is practical whenever there is a topological sort of the submodule graph that "fits inside" a topological sort of the superproject graph. (If there are multiple submodules, we need a complete topo-sort of the union of all graphs.) There is no guarantee that this situation exists, and it's easy to draw a case where it does not:
A--B--C <-- master
: :
: :
:
: :
: :
a--b--c <-- sub/master
Here, superproject commit A
refers to the last commit in the subproject, while superproject commit C
refers to the first commit in the subproject. These graph topologies are not composable.1 But it may be the case that your topologies are, in which case you can insert commit nodes as needed, if you want to make up a new graph that acts as the appropriate superset. There is no program that I know of to do this.
1I'm not sure if "composable" is a good term for this but I do not have time for a literature search. What I mean is that combining the DAGs could result in cycles, and I am calling such repositories "non-composable". See also Efficient algorithm for merging two DAGs for instance.
Doing the more complicated job with composable submodules
You will have to write some code. 😅 This is nontrivial and requires a bit of graph theory. It's not especially complicated, but I am definitely not going to do it here.
Doing the simpler job, if truncated history is acceptable
The simpler job, which in the above example consists of replacing commit C
with C'
and E
with E'
, is automatable: iterate through all commits, find their submodule gitlinks, and use git replace
to replace the tree object that has the submodule with a tree object that uses the submodule's tree. This actually replaces the tree object, rather than the commit object, so that the history really still is the way it was before, but you will now have a very large collection of replacement objects. Moreover, cloning the repository won't clone the replacement objects, so now it is time to rewrite all the commits, using git filter-branch
.
I don't have a handy recipe for using git replace
like this, but you would probably want to automate the git replace --edit
by setting your GIT_EDITOR
variable to a script that would find and replace the gitlink entry. (Writing such a script is going to be a bit tedious but not technically difficult.)
Since git filter-branch
respects replacements,2 and no other changes are required, you can just run git filter-branch --tag-name-filter cat -- --branches --tags
to perform all the commit replacements. (Note: do this on a clone you have made specifically for the purpose of experimenting with replace and filter-branch, so that you can start over if you mess it up.) You can then remove all the replacement references (git for-each-ref --format='delete %(refname)' | git update-ref --stdin
) as they are no longer needed and are just making Git slow now.
2Well, it does unless run as git --no-replace-objects filter-branch
.