Convert a git submodule to a regular directory and preserve the history in the main tree?
Asked Answered
R

3

10

I have a project that consists of many submodules. However, some of these submodules in hindsight shouldn't be submodules, as they aren't meant or would ever be used in another project and I'm occasionally transferring code between them. This project doubled as an experiment in submodules so I got a little crazy with it.

I was wondering if there was a way to convert the submodules to regular directories, maintaining the history of changes but rewriting the main project's history so that they're treated as regular directories.

I've seen stuff about subtree merging but I was hoping for a way to rewrite the commits so the file paths are prefixed with that of the submodule.

Ruthy answered 7/9, 2018 at 14:5 Comment(1)
I have found an answer for a similar question to be most helpful: https://mcmap.net/q/13595/-how-to-un-submodule-a-git-submoduleBrave
K
8

It is quite easy using git subtree if you just want to keep the history of a single branch for each submodule:

git fetch <path/to/submodule> HEAD
git rm <path/to/submodule>
git commit -m "Prepare to integrate Git submodules' history into repository"
git subtree add --prefix=<path/to/submodule> FETCH_HEAD 

This will integrate the history of the current checked-out revision of the submodule. Be sure to be in clean state before, thus running e.g. git submodule update and double check with git status.

You will get two commits: the first removes the submodule, the second integrates the prior history (now stored in FETCH_HEAD) into the repository. There is no easy way (at least I am not aware of it) to do it with an "atomic" commit. You would need to fiddle with Git's plumbing command set to do so.

If you need to integrate history of several submodules, I recommend to put all removal operations into the first commit, and all integrating operations into the second. In that case you need to remember the fetched HEADS by some other means.


Note: Although git subtree lives within ./contrib in upstream Git, it seems to be available (at least) on Debian since v1.9.1 (March 2014).

Kirbie answered 20/12, 2018 at 14:35 Comment(1)
Btw., have a look to git filter-repo for more sophisticated rewrites of commits. But be aware that it is a very sharp tool: powerful and fast – and dangerous. It rewrites a complete Git repository, hence you has to work on a fresh and independent clone.Kirbie
W
2

This is a hard problem in general. There are specific cases, or degenerate ways of applying the submodule contents, that make it easier. One compromise—which may or may not be good enough—is to simply combine the two commit histories into one repository, then make some slightly-scary transformations using either git filter-branch, or just an automated git replace (though using, or abusing, git replace like this is likely to result in performance issues).

Here's the basic situation, i.e., what you need to know, as mental tools, before you think about generalizing the problem. Each repository contains a commit graph: a DAG of commits, with various entry points into the graph found, and hence preserved, by branch names. The superproject's commits have, in each commit that uses the submodule, a reference to one of the submodule's commits. These references are in "tree" object, as entries of type gitlink. Git does not actually inspect them when it comes to retaining commits, since they are assumed to identify commits in some other repository (the submodule).

You can easily use git fetch to fetch the entire submodule's graph into the superproject repository, changing the submodule's branch names into different names in the superproject. (The default for git fetch is to produce remote-tracking names, but with a bit of sneakiness, you can easily use an alternative namespace. For the solutions I propose, remote-tracking names are fine anyway.) The result, though, is just that you have two otherwise disconnected DAGs. The superproject commits still just have trees with gitlink entries that refer to commits in the other DAG. Those gitlink entries will not keep the commits reachable, so you must retain both sets of names. Except for having all commits contained in one repository database, this is really no improvement at all (and it might be worse as it is now hard to work with).

Here's the general problem: What Git stores is (are?) these commits. There is no separate item that we can call "the history"; the history in a Git repository is (are?) the commits in the repository. We can see, visually, the problem if we draw the commits. Let's simplify it to just five commits, A through E, in the superproject. The uppercase letters stand in for the actual hash IDs (which are useless to humans):

A--B--C   <-- master
    \
     D--E   <-- dev

Now let's put six commits in the subproject, using lowercase letters since it's the subproject:

a--b--c--d   <-- master
       \
        e--f   <-- issue213

Some superproject commits—maybe all of them, but for simplicity, let's say just C and E—have inside them references to some of the subproject commits, so if we yank all of the submodule's commits into the superproject, using names sub/* to remember the branch tips, we get this:

A--B--C   <-- master
    \ :
     D÷-E   <-- dev
      : :
     :  :
    :    :
   :     :
a--b--c--d   <-- sub/master
       \
        e--f   <-- sub/issue213

Suppose we now, somehow, replace commits C (with its gitlink to b) and E (with its gitlink to d) with commits whose trees have actual, direct references to the tree objects for commits b and e. Let's call these commits C' and E'. This is technically possible in Git—we just make the new commits C' and E' with the trees we want, that use the trees in b and d respectively, then change the names master and dev to refer to commits C' and E'. If we drop the sub/* names, we have this:

A--B--C'  <-- master
    \
     D--E'  <-- dev

and if we now git checkout master we will get a nice work-tree full of what was in the original C plus what was from the submodule, obtained from its commit b that the original C used, as we can see from our diagram.

Similarly, if we now git checkout dev we will get a nice work-tree full of what was in the original E plus what was from the submodule, obtained from its commit d.

The trees in this new modified repository contain all the sources for the snapshot that you'd get by checking out C-and-submodule, or E-and-submodule. But the commits that were in the submodule, i.e., the history of d leading back to c leading back to b leading back to a, plus the entire issue213 branch consisting of f leading back to e leading back to c ... well, those commits are gone! There is nothing to represent them any more.

Moreover, there is no place in which you could insert them. Where, in the graph that contains commits A through E (all uppercase), do commits a through f (all lowercase) fit? The only answer is "nowhere": there's no place they can go.

Now, in specific cases, we can invent an answer. We can insert new commits between existing commits, so that the new commits keep the superproject's files in place while updating submodule files. This is practical whenever there is a topological sort of the submodule graph that "fits inside" a topological sort of the superproject graph. (If there are multiple submodules, we need a complete topo-sort of the union of all graphs.) There is no guarantee that this situation exists, and it's easy to draw a case where it does not:

A--B--C   <-- master
 :   :
  : :
   :
  : :
 :   :
a--b--c   <-- sub/master

Here, superproject commit A refers to the last commit in the subproject, while superproject commit C refers to the first commit in the subproject. These graph topologies are not composable.1 But it may be the case that your topologies are, in which case you can insert commit nodes as needed, if you want to make up a new graph that acts as the appropriate superset. There is no program that I know of to do this.


1I'm not sure if "composable" is a good term for this but I do not have time for a literature search. What I mean is that combining the DAGs could result in cycles, and I am calling such repositories "non-composable". See also Efficient algorithm for merging two DAGs for instance.


Doing the more complicated job with composable submodules

You will have to write some code. 😅 This is nontrivial and requires a bit of graph theory. It's not especially complicated, but I am definitely not going to do it here.

Doing the simpler job, if truncated history is acceptable

The simpler job, which in the above example consists of replacing commit C with C' and E with E', is automatable: iterate through all commits, find their submodule gitlinks, and use git replace to replace the tree object that has the submodule with a tree object that uses the submodule's tree. This actually replaces the tree object, rather than the commit object, so that the history really still is the way it was before, but you will now have a very large collection of replacement objects. Moreover, cloning the repository won't clone the replacement objects, so now it is time to rewrite all the commits, using git filter-branch.

I don't have a handy recipe for using git replace like this, but you would probably want to automate the git replace --edit by setting your GIT_EDITOR variable to a script that would find and replace the gitlink entry. (Writing such a script is going to be a bit tedious but not technically difficult.)

Since git filter-branch respects replacements,2 and no other changes are required, you can just run git filter-branch --tag-name-filter cat -- --branches --tags to perform all the commit replacements. (Note: do this on a clone you have made specifically for the purpose of experimenting with replace and filter-branch, so that you can start over if you mess it up.) You can then remove all the replacement references (git for-each-ref --format='delete %(refname)' | git update-ref --stdin) as they are no longer needed and are just making Git slow now.


2Well, it does unless run as git --no-replace-objects filter-branch.

Welkin answered 7/9, 2018 at 21:22 Comment(0)
H
1

I don't have much experience working with submodules but this is what i would do:

  • remove the submodule from the project**. Add the "original submodule" repo as a remote for your project and fetch.
  • Merge whatever branch you want to bring over into your project. If I wanted to have the files from that other project into a separate directory of my main project, what I would do is probably checkout the submodule branch (not a submodule anymore, it's a real remote branch now), rename the files there into the directory I mean for it (so that it doesn't clash with anything from the main project), then I would merge this new revision into my main project.

Perhaps not the best approach but that's what I would do if I wanted to bring over a different project into my project while keeping things separate.

** Is that even possible? I definitely need to get some more hand-on experience with submodules and such.

Hartshorn answered 7/9, 2018 at 19:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.