Convert a Git folder to a submodule retrospectively?
Asked Answered
U

9

150

Quite often it is the case that you're writing a project of some kind, and after a while it becomes clear that some component of the project is actually useful as a standalone component (a library, perhaps). If you've had that idea from early on, then there's a fair chance that most of that code is in its own folder.

Is there a way to convert one of the sub directories in a Git project to a submodule?

Ideally this would happen such that all of the code in that directory is removed from the parent project, and the submodule project is added in its place, with all the appropriate history, and such that all the parent project commits point to the correct submodule commits.

Unbosom answered 20/9, 2012 at 13:55 Comment(3)
#1366041 may help some :)Respite
This is not part of the original question, but what would be even cooler would be a way to keep the history of files that had started outside the folder, and were moved into it. At the moment, all of the answers lose all of the history prior to the move.Unbosom
@ggll's link is down. Here's an archived copy.Awhirl
D
104

To isolate a subdirectory into its own repository, use filter-branch on a clone of the original repository:

git clone <your_project> <your_submodule>
cd <your_submodule>
git filter-branch --subdirectory-filter 'path/to/your/submodule' --prune-empty -- --all

It's then nothing more than deleting your original directory and adding the submodule to your parent project.

Disconnection answered 20/9, 2012 at 15:13 Comment(13)
You probably also want to git remote rm <name> after the filter branch, and then perhaps add a new remote. Also, if there are ignored files, a git clean -xd -f may be usefulUnbosom
-- --all can be replaced with the name of a branch if the submodule should only be extracted from this branch.Foremost
Does git clone <your_project> <your_submodule> only download files for your_submodule?Laryngo
@DominicTobias: git clone source destination simply tells Git the location of where to put your cloned files. The actual magic to filter your submodule's files then happens in the filter-branch step.Disconnection
filter-branch is deprecated nowadays. You can use git clone --filter, but your Git server must be configured to allow filtering, otherwise you'll get warning: filtering not recognized by server, ignoring.Claimant
@MatthiasBraun: thanks, I didn't know about the --filter option to git clone. But nowhere in your linked warning (or the page in which it is contained) does it mention anything of being deprecated. But yes, nevertheless, filter-branch can be a dangerous command.Disconnection
I used "deprecated" in the sense that filter-branch shouldn't be used according to the Git developers ("its use is not recommended"). I didn't mean to imply that filter-branch will be removed from Git.Claimant
How to use git clone --filter to get the same result? @MatthiasBraunArmful
This command gives me "WARNING: git-filter-branch has a glut of gotchas generating mangled history rewrites...." What's the non-deprecated way to do this? Maybe with this github.com/newren/git-filter-repoWoodhead
When I did this I had a load of ref changes but all of the directories I wanted to filter out were still there?Underdone
@Underdone all directories which are not listed on the command line are removed. Only the one directory which is listed is keptDisconnection
@knittl, not when I tried it. They were still there. Is there another step I need to take here?Underdone
@Underdone no, normally not. Have you executed the command exactly as written in the answer? Were there any errors? Have you provided the correct refs to be rewritten?Disconnection
H
46

First change dir to folder which will be a submodule. Then:

git init
git remote add origin <repourl>
git add .
git commit -am 'first commit in submodule'
git push -u origin master
cd ..
rm -rf <folder> # the folder which will be a submodule
git commit -am 'deleting folder'
git submodule add <repourl> <folder> # add the submodule
git commit -am 'adding submodule'
Hollinger answered 8/4, 2016 at 12:43 Comment(3)
This will lose all of the history of that folder.Unbosom
history of the folder will be saved in main repository and new commits will save history in submoduleHollinger
If you have un-tracked content in the soon-to-be-submodule directory (e.g. content listed in your main repository's .gitignore), then the rm -rf <folder> will cause it to be lost. Better to cp <folder> <some_other_absolute_path> then later, after the rm -rf, you can cp <some_other_absolute_path> <folder> (and delete it at <some_other_absolute_path>).Florey
M
16

I know this is an old thread, but the answers here squash any related commits in other branches.

A simple way to clone and keep all those extra branches and commits:

1 - Make sure you have this git alias

git config --global alias.clone-branches '! git branch -a | sed -n "/\/HEAD /d; /\/master$/d; /remotes/p;" | xargs -L1 git checkout -t'

2 - Clone the remote, pull all branches, change the remote, filter your directory, push

git clone [email protected]:user/existing-repo.git new-repo
cd new-repo
git clone-branches
git remote rm origin
git remote add origin [email protected]:user/new-repo.git
git remote -v
git filter-branch --subdirectory-filter my_directory/ -- --all
git push --all
git push --tags
Mob answered 22/6, 2017 at 12:10 Comment(1)
My original had a link to a gist instead of embedding the code here on SOMob
E
9

The official git project now recommends using git-filter-repo

# install git-filter-repo, see [1] for install via pip, or other OS's.
sudo apt-get install git-filter-repo 

# copy your repo; everything EXCEPT the subdir will be deleted, and the subdir will become root.
# --no-local is required to prevent git from hard linking to files in the original, and is checked by `filter-branch`
git clone working-dir/.git working-dir-copy --no-local
cd working-dir-copy

# extract the desired subdirectory and its history.
git filter-repo --subdirectory-filter foodir

# foodir is now its own directory. Push it to github/gitlab etc
git remote add origin user@hosting/project.git
git push -u origin --all
git push -u origin --tags

Thanks to this gist as well.

EDIT: For LFS users (poor folks) git clone does NOT pull the entire lfs history of an image, which causes git push to fail.

// Original branch needs to get history of all images
git lfs fetch --all

// clone needs to copy the history
git lfs install --skip-smudge
git lfs pull working-dir --all

https://github.com/newren/git-filter-repo/blob/main/INSTALL.md

Eighteen answered 4/9, 2022 at 10:3 Comment(1)
This works pretty well. To replace in the original repo: rm -rf <subdir> - git add . - git commit -m 'removed subdir' - git submodule add <submodule> <subdir>Firenew
R
5

Status quo

Let's assume we have a repository called repo-old which contains a subdirectory sub that we would like to convert into a submodule with its own repo repo-sub.

It is further intended that the original repo repo-old should be converted into a modified repo repo-new where all commits touching the previously existing subdirectory sub shall now point to the corresponding commits of our extracted submodule repo repo-sub.

Let's change

It is possible to achieve this with the help of git filter-branch in a two step process:

  1. Subdirectory extraction from repo-old to repo-sub (already mentioned in the accepted answer)
  2. Subdirectory replacement from repo-old to repo-new (with proper commit mapping)

Remark: I know that this question is old and it has already been mentioned that git filter-branch is kind of deprecated and might be dangerous. But on the other hand it might help others with personal repositories that are easy to validate after conversion. So be warned! And please let me know if there is any other tool that does the same thing without being deprecated and is safe to use!

I'll explain how I realized both steps on linux with git version 2.26.2 below. Older versions might work to some extend but that needs to be tested.

For the sake of simplicity I will restrict myself to the case where there is just a master branch and a origin remote in the original repo repo-old. Also be warned that I rely on temporary git tags with the prefix temp_ which are going to be removed in the process. So if there are already tags named similarily you might want to adjust the prefix below. And finally please be aware that I have not extensively tested this and there might be corner cases where the recipe fails. So please backup everything before proceeding!

The following bash snippets can be concatenated into one big script which should then be executed in the same folder where the repo repo-org lives. It is not recommended to copy and paste everything directly into a command window (even though I have tested this successfully)!

0. Preparation

Variables

# Root directory where repo-org lives
# and a temporary location for git filter-branch
root="$PWD"
temp='/dev/shm/tmp'

# The old repository and the subdirectory we'd like to extract
repo_old="$root/repo-old"
repo_old_directory='sub'

# The new submodule repository, its url
# and a hash map folder which will be populated
# and later used in the filter script below
repo_sub="$root/repo-sub"
repo_sub_url='https://github.com/somewhere/repo-sub.git'
repo_sub_hashmap="$root/repo-sub.map"

# The new modified repository, its url
# and a filter script which is created as heredoc below
repo_new="$root/repo-new"
repo_new_url='https://github.com/somewhere/repo-new.git'
repo_new_filter="$root/repo-new.sh"

Filter script

# The index filter script which converts our subdirectory into a submodule
cat << EOF > "$repo_new_filter"
#!/bin/bash

# Submodule hash map function
sub ()
{
    local old_commit=\$(git rev-list -1 \$1 -- '$repo_old_directory')

    if [ ! -z "\$old_commit" ]
    then
        echo \$(cat "$repo_sub_hashmap/\$old_commit")
    fi
}

# Submodule config
SUB_COMMIT=\$(sub \$GIT_COMMIT)
SUB_DIR='$repo_old_directory'
SUB_URL='$repo_sub_url'

# Submodule replacement
if [ ! -z "\$SUB_COMMIT" ]
then
    touch '.gitmodules'
    git config --file='.gitmodules' "submodule.\$SUB_DIR.path" "\$SUB_DIR"
    git config --file='.gitmodules' "submodule.\$SUB_DIR.url" "\$SUB_URL"
    git config --file='.gitmodules' "submodule.\$SUB_DIR.branch" 'master'
    git add '.gitmodules'

    git rm --cached -qrf "\$SUB_DIR"
    git update-index --add --cacheinfo 160000 \$SUB_COMMIT "\$SUB_DIR"
fi
EOF
chmod +x "$repo_new_filter"

1. Subdirectory extraction

cd "$root"

# Create a new clone for our new submodule repo
git clone "$repo_old" "$repo_sub"

# Enter the new submodule repo
cd "$repo_sub"

# Remove the old origin remote
git remote remove origin

# Loop over all commits and create temporary tags
for commit in $(git rev-list --all)
do
    git tag "temp_$commit" $commit
done

# Extract the subdirectory and slice commits
mkdir -p "$temp"
git filter-branch --subdirectory-filter "$repo_old_directory" \
                  --tag-name-filter 'cat' \
                  --prune-empty --force -d "$temp" -- --all

# Populate hash map folder from our previously created tag names
mkdir -p "$repo_sub_hashmap"
for tag in $(git tag | grep "^temp_")
do
    old_commit=${tag#'temp_'}
    sub_commit=$(git rev-list -1 $tag)

    echo $sub_commit > "$repo_sub_hashmap/$old_commit"
done
git tag | grep "^temp_" | xargs -d '\n' git tag -d 2>&1 > /dev/null

# Add the new url for this repository (and e.g. push)
git remote add origin "$repo_sub_url"
# git push -u origin master

2. Subdirectory replacement

cd "$root"

# Create a clone for our modified repo
git clone "$repo_old" "$repo_new"

# Enter the new modified repo
cd "$repo_new"

# Remove the old origin remote
git remote remove origin

# Replace the subdirectory and map all sliced submodule commits using
# the filter script from above
mkdir -p "$temp"
git filter-branch --index-filter "$repo_new_filter" \
                  --tag-name-filter 'cat' --force -d "$temp" -- --all

# Add the new url for this repository (and e.g. push)
git remote add origin "$repo_new_url"
# git push -u origin master

# Cleanup (commented for safety reasons)
# rm -rf "$repo_sub_hashmap"
# rm -f "$repo_new_filter"

Remark: If the newly created repo repo-new hangs during git submodule update --init then try to re-clone the repository recursively once instead:

cd "$root"

# Clone the new modified repo recursively
git clone --recursive "$repo_new" "$repo_new-tmp"

# Now use the newly cloned one
mv "$repo_new" "$repo_new-bak"
mv "$repo_new-tmp" "$repo_new"

# Cleanup (commented for safety reasons)
# rm -rf "$repo_new-bak"
Radiate answered 26/5, 2020 at 10:8 Comment(2)
This is an amazing answer. Exactly what I needed, and clearly outlines the steps needed, in an easy to follow format.Crites
Fantastic guide - than you very much.Hindquarter
H
2

The current answer by @knittl using filter-branch gets us quite close to the desired effect, but when tried, Git threw a warning at me:

WARNING: git-filter-branch has a glut of gotchas generating mangled history
         rewrites.  Hit Ctrl-C before proceeding to abort, then use an
         alternative filtering tool such as 'git filter-repo'
         (https://github.com/newren/git-filter-repo/) instead.  See the
         filter-branch manual page for more details; to squelch this warning,
         set FILTER_BRANCH_SQUELCH_WARNING=1.

Now 9 years after this question was first asked and answered, filter-branch is deprecated in favor of git filter-repo. Indeed, when I looked at my git history using git log --all --oneline --graph, it was full of irrelevant commits.

How to use git filter-repo then? Github has a pretty good article outlining that here. (Note that you will need to install it independently from git. I used the python version with pip3 install git-filter-repo)

In case they decide to move/delete the article, I will summarize and generalize their procedure below:

git clone <your_old_project_remote> <your_submodule>
cd <your_submodule>
git filter-repo --path path/to/your/submodule
git remote set-url origin <your_new_submodule_remote>
git push -u origin <branch_name>

From there, you just need to register the new repository as a submodule where you want it to be:

cd <path/to/your/parent/module>
git submodule add <your_new_submodule_remote>
git submodule update
git commit
Hemlock answered 2/12, 2021 at 10:24 Comment(2)
This leaves the code in the same sub directory (path/to/your/submodule) also in the new repo. How to let the sub directory in the old repo be the top-level directory in the new one (without simply moving it after the filtering is done)?Melanie
git filter-repo --subdirectory-filter path/to/your/submodule seems to have done the trickMelanie
R
1

It can be done, but it's not simple. If you search for git filter-branch, subdirectory and submodule, there are some decent write-ups on the process. It essentially entails creating two clones of your project, using git filter-branch to remove everything except the one subdirectory in one, and removing only that subdirectory in the other. Then you can establish the second repository as a submodule of the first.

Rouleau answered 20/9, 2012 at 15:8 Comment(0)
A
0

This does the conversion in-place, you can back it out as you would any filter-branch (I use git fetch . +refs/original/*:*).

I have a project with a utils library that's started to be useful in other projects, and wanted to split its history off into a submodules. Didn't think to look on SO first so I wrote my own, it builds the history locally so it's a good bit faster, after which if you want you can set up the helper command's .gitmodules file and such, and push the submodule histories themselves anywhere you want.

The stripped command itself is here, the doc's in the comments, in the unstripped one that follows. Run it as its own command, with subdir set, like subdir=utils git split-submodule if you're splitting the utils directory. It's hacky because it's a one-off, but I tested it on the Documentation subdirectory in the Git history.

#!/bin/bash
# put this or the commented version below in e.g. ~/bin/git-split-submodule
${GIT_COMMIT-exec git filter-branch --index-filter "subdir=$subdir; ${debug+debug=$debug;} $(sed 1,/SNIP/d "$0")" "$@"}
${debug+set -x}
fam=(`git rev-list --no-walk --parents $GIT_COMMIT`)
pathcheck=(`printf "%s:$subdir\\n" ${fam[@]} \
    | git cat-file --batch-check='%(objectname)' | uniq`)
[[ $pathcheck = *:* ]] || {
    subfam=($( set -- ${fam[@]}; shift;
        for par; do tpar=`map $par`; [[ $tpar != $par ]] &&
            git rev-parse -q --verify $tpar:"$subdir"
        done
    ))
    git rm -rq --cached --ignore-unmatch  "$subdir"
    if (( ${#pathcheck[@]} == 1 && ${#fam[@]} > 1 && ${#subfam[@]} > 0)); then
        git update-index --add --cacheinfo 160000,$subfam,"$subdir"
    else
        subnew=`git cat-file -p $GIT_COMMIT | sed 1,/^$/d \
            | git commit-tree $GIT_COMMIT:"$subdir" $(
                ${subfam:+printf ' -p %s' ${subfam[@]}}) 2>&-
            ` &&
        git update-index --add --cacheinfo 160000,$subnew,"$subdir"
    fi
}
${debug+set +x}

#!/bin/bash
# Git filter-branch to split a subdirectory into a submodule history.

# In each commit, the subdirectory tree is replaced in the index with an
# appropriate submodule commit.
# * If the subdirectory tree has changed from any parent, or there are
#   no parents, a new submodule commit is made for the subdirectory (with
#   the current commit's message, which should presumably say something
#   about the change). The new submodule commit's parents are the
#   submodule commits in any rewrites of the current commit's parents.
# * Otherwise, the submodule commit is copied from a parent.

# Since the new history includes references to the new submodule
# history, the new submodule history isn't dangling, it's incorporated.
# Branches for any part of it can be made casually and pushed into any
# other repo as desired, so hooking up the `git submodule` helper
# command's conveniences is easy, e.g.
#     subdir=utils git split-submodule master
#     git branch utils $(git rev-parse master:utils)
#     git clone -sb utils . ../utilsrepo
# and you can then submodule add from there in other repos, but really,
# for small utility libraries and such, just fetching the submodule
# histories into your own repo is easiest. Setup on cloning a
# project using "incorporated" submodules like this is:
#   setup:  utils/.git
#
#   utils/.git:
#       @if _=`git rev-parse -q --verify utils`; then \
#           git config submodule.utils.active true \
#           && git config submodule.utils.url "`pwd -P`" \
#           && git clone -s . utils -nb utils \
#           && git submodule absorbgitdirs utils \
#           && git -C utils checkout $$(git rev-parse :utils); \
#       fi
# with `git config -f .gitmodules submodule.utils.path utils` and
# `git config -f .gitmodules submodule.utils.url ./`; cloners don't
# have to do anything but `make setup`, and `setup` should be a prereq
# on most things anyway.

# You can test that a commit and its rewrite put the same tree in the
# same place with this function:
# testit ()
# {
#     tree=($(git rev-parse `git rev-parse $1`: refs/original/refs/heads/$1));
#     echo $tree `test $tree != ${tree[1]} && echo ${tree[1]}`
# }
# so e.g. `testit make~95^2:t` will print the `t` tree there and if
# the `t` tree at ~95^2 from the original differs it'll print that too.

# To run it, say `subdir=path/to/it git split-submodule` with whatever
# filter-branch args you want.

# $GIT_COMMIT is set if we're already in filter-branch, if not, get there:
${GIT_COMMIT-exec git filter-branch --index-filter "subdir=$subdir; ${debug+debug=$debug;} $(sed 1,/SNIP/d "$0")" "$@"}

${debug+set -x}
fam=(`git rev-list --no-walk --parents $GIT_COMMIT`)
pathcheck=(`printf "%s:$subdir\\n" ${fam[@]} \
    | git cat-file --batch-check='%(objectname)' | uniq`)

[[ $pathcheck = *:* ]] || {
    subfam=($( set -- ${fam[@]}; shift;
        for par; do tpar=`map $par`; [[ $tpar != $par ]] &&
            git rev-parse -q --verify $tpar:"$subdir"
        done
    ))

    git rm -rq --cached --ignore-unmatch  "$subdir"
    if (( ${#pathcheck[@]} == 1 && ${#fam[@]} > 1 && ${#subfam[@]} > 0)); then
        # one id same for all entries, copy mapped mom's submod commit
        git update-index --add --cacheinfo 160000,$subfam,"$subdir"
    else
        # no mapped parents or something changed somewhere, make new
        # submod commit for current subdir content.  The new submod
        # commit has all mapped parents' submodule commits as parents:
        subnew=`git cat-file -p $GIT_COMMIT | sed 1,/^$/d \
            | git commit-tree $GIT_COMMIT:"$subdir" $(
                ${subfam:+printf ' -p %s' ${subfam[@]}}) 2>&-
            ` &&
        git update-index --add --cacheinfo 160000,$subnew,"$subdir"
    fi
}
${debug+set +x}
Afire answered 26/5, 2020 at 18:27 Comment(0)
M
0

If it's acceptable to keep the previous history in the parent folder only, a simple solution is removing the subfolder from the index and starting a new repository or submodule in the same path. For example:

  1. Add subdir to .gitignore
  2. rm -r --cached subdir
  3. git add .gitignore && git commit
  4. cd subdir && git init && git add .
  5. Commit initial files in the new subdir repository

From git help rm:

--cached: Use this option to unstage and remove paths only from the index. Working tree files, whether modified or not, will be left alone.

Having used submodules in production code, I can say that it's a nice solution, especially since it documents the project's dependencies.

For a simple project, or if there aren't other developers, or there isn't a strong dependency and the folder structure is more of a convenience, submodules may be a little too much. If you choose to go that route however, skip step 1 and proceed accordingly.

Myosin answered 18/1, 2023 at 11:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.