GIT Split Repository directory preserving *move / renames* history
Asked Answered
git
V

6

20

Let's say you have the repository:

myCode/megaProject/moduleA
myCode/megaProject/moduleB

Over time (months), you re-organise the project. Refactoring the code to make the modules independent. Files in the megaProject directory get moved into their own directories. Emphasis on move - the history of these files is preserved.

myCode/megaProject
myCode/moduleA
myCode/moduleB

Now you wish to move these modules to their own GIT repos. Leaving the original with just megaProject on its own.

myCode/megaProject
newRepoA/moduleA
newRepoB/moduleB

The filter-branch command is documentated to do this but it doesn't follow history when files were moved outside of the target directory. So the history begins when the files were moved into their new directory, not the history the files had then they lived in the old megaProject directory.

How to split a GIT history based on a target directory, and, follow history outside of this path - leaving only commit history related to these files and nothing else?

The numerous other answers on SO focus on generally splitting apart the repo - but make no mention of splitting apart and following the move history.

Valdivia answered 2/1, 2016 at 17:49 Comment(0)
H
10

This is a version based on @rksawyer's scripts, but it uses git-filter-repo instead. I found it was much easier to use and much much faster than git-filter-branch (and is now recommended by git as a replacement).

# This script should run in the same folder as the project folder is.
# This script uses git-filter-repo (https://github.com/newren/git-filter-repo).
# The list of files and folders that you want to keep should be named <your_repo_folder_name>_KEEP.txt. I should contain a line end in the last line, otherwise the last file/folder will be skipped.
# The result will be the folder called <your_repo_folder_name>_REWRITE_CLONE. Your original repo won't be changed.
# Tags are not preserved, see line below to preserve tags.
# Running subsequent times will backup the last run in <your_repo_folder_name>_REWRITE_CLONE_BKP.

# Define here the name of the folder containing the repo: 
GIT_REPO="git-test-orig"

clone="$GIT_REPO"_REWRITE_CLONE
temp=/tmp/git_rewrite_temp
rm -Rf "$clone"_BKP
mv "$clone" "$clone"_BKP
rm -Rf "$temp"
mkdir "$temp"
git clone "$GIT_REPO" "$clone"
cd "$clone"
git remote remove origin
open .
open "$temp"

# Comment line below to preserve tags
git tag | xargs git tag -d

echo 'Start logging file history...'
echo "# git log results:\n" > "$temp"/log.txt

while read p
do
    shopt -s dotglob
    find "$p" -type f > "$temp"/temp
    while read f
    do
        echo "## " "$f" >> "$temp"/log.txt
        # print every file and follow to get any previous renames
        # Then remove blank lines.  Then remove every other line to end up with the list of filenames       
        git log --pretty=format:'%H' --name-only --follow -- "$f" | awk 'NF > 0' | awk 'NR%2==0' | tee -a "$temp"/log.txt
        
        echo "\n\n" >> "$temp"/log.txt
    done < "$temp"/temp
done < ../"$GIT_REPO"_KEEP.txt > "$temp"/PRESERVE

mv "$temp"/PRESERVE "$temp"/PRESERVE_full
awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE

sort -o "$temp"/PRESERVE "$temp"/PRESERVE

echo 'Starting filter-branch --------------------------'
git filter-repo --paths-from-file "$temp"/PRESERVE --force --replace-refs delete-no-add
echo 'Finished filter-branch --------------------------'

It logs the result of git log into a file in /tmp/git_rewrite_temp/log.txt, so you can get rid of these lines if you don't need a log.txt and want it to run faster.

Holifield answered 21/1, 2020 at 0:43 Comment(6)
Awesome example of the use of an awesome tool! After a day of troubles with filter-branch, running for 40 minutes only not to work, this solved it correctly in about 5 seconds.Mugger
I had some messy old, empty commits, so I ended up adding --prune-empty alwaysto the filter-repo command.Mugger
The auto setting will prune all commits that end up as empty when rewriting the repo. In my case, I guess I have actual empty commits. They seem to originate from the repo before it was git (svn), and probably wound up empty for some reason, either through svn being svn, or in the migration to git. Anyways, no reason to keep the commits, and they should probably just be removed from the original repo itself.Mugger
I'm kind of new to git-filter-repo, but reading through the documentation, shouldn't git filter-repo --analyze be able to give you information on renames?Rubicund
I found your shell script version a little too different from what I'd have implemented to feel comfortable with it, so I wrote one in Python which behaves more similarly to bare git-filter-repo, has --help, and has a bunch of safety guards. I'm not sure what would be the most appropriate way to make it its own answer in this particular case. (It's a Gist, but it's also too long to code-block here IMO.)Lashawna
I'd add it as an answer. If it's an improvement it's better for the community, so deserves more visibility. Although I know my script works well, my shell skills are meagre so the code is ugly.Holifield
C
4

Running git filter-branch --subdirectory-filter in your cloned repository will remove all commits that don't affect content in that subdirectory, which includes those affecting the files before they were moved.

Instead, you need to use the --index-filter flag with a script to delete all files you're not interested in, and the --prune-empty flag to ignore any commits affecting other content.

There's a blog post from Kevin Deldycke with a good example of this:

git filter-branch --prune-empty --tree-filter 'find ./ -maxdepth 1 -not -path "./e107*" -and -not -path "./wordpress-e107*" -and -not -path "./.git" -and -not -path "./" -print -exec rm -rf "{}" \;' -- --all

This command effectively checks out each commit in turn, deletes all uninteresting files from the working directory and, if anything has changed from the last commit then it checks it in (rewriting the history as it goes). You would need to tweak that command to delete all files except those in, say, /moduleA, /megaProject/moduleA and the specific files you want to keep from /megaProject.

Climb answered 2/1, 2016 at 20:40 Comment(1)
It didn't work for me, for some reason it deletes .git/refs/heads, destroying my repo. Interestingly enough not all files inside .git are deleted. Do you know why this may be happening? Also, I fail to see how this solution would preserve moves/renames.Holifield
C
2

I'm aware of no simple way to do this, but it can be done.

The problem with filter-branch is that it works by

applying custom filters on each revision

If you can create a filter which won't delete your files they will be tracked between directories. Of course this is likely to be non-trivial for any repository which isn't trivial.

To start: Let's assume it is a trivial repository. You have never renamed a file, and you have never had files in two modules with the same name. All you need to do is get a list of the files in your module find megaProject/moduleA -type f -printf "%f\n" > preserve and then run your filter using those filenames, and your directory:

preserve.sh

cmd="find . -type f ! -name d1"
while read f; do
  cmd="$cmd ! -name $f"
done < /path/to/myCode/preserve
for i in $($cmd)
do
  rm $i
done

git filter-branch --prune-empty --tree-filter '/path/to/myCode/preserve.sh' HEAD

Of course it's renames that make this difficult. One of the nice things that git filter-branch does is gives you the $GIT_COMMIT environment variable. You can then get fancy and use things like:

for f in megaProject/moduleA
do
 git log --pretty=format:'%H' --name-only --follow -- $f |  awk '{ if($0 != ""){ printf $0 ":"; next; } print; }'
done > preserve

to build a filename history, with commits, that could be used in place of the simple preserve file in the trivial example, but the onus is going to be on you to keep track of what files should be present at each commit. This actually shouldn't be too hard to code out, but I haven't seen anybody who's done it yet.

Culicid answered 2/1, 2016 at 21:26 Comment(1)
That looks cool if polished, but doesn't work if applied asisFellatio
A
1

Following on to the answer above. First iterate through all of the files in the directory that is being kept using git log --follow to git the old paths/names from prior moves/renames. Then use filter-branch to iterate through every revision removing any files that were not on the list created in step 1.

#!/bin/bash
DIRNAME=dirD

# Catch all files including hidden files
shopt -s dotglob
for f in $DIRNAME/*
do
# print every file and follow to get any previous renames
# Then remove blank lines.  Then remove every other line to end up with the list of filenames
 git log --pretty=format:'%H' --name-only --follow -- $f | awk 'NF > 0' | awk 'NR%2==0'
done > /tmp/PRESERVE

sort -o /tmp/PRESERVE /tmp/PRESERVE
cat /tmp/PRESERVE

Then create a script (preserve.sh) that filter-branch will call for each revision.

#!/bin/bash
DIRNAME=dirD

# Delete everything that's not in the PRESERVE list
echo 'delete this files:'
cmd=`find . -type f -not -path './.git/*' -not -path './$DIRNAME/*'`
echo $cmd > /tmp/ALL


# Convert to one filename per line and remove the lead ./
cat /tmp/ALL | awk '{NF++;while(NF-->1)print $NF}' | cut -c3- > /tmp/ALL2
sort -o /tmp/ALL2 /tmp/ALL2

#echo 'before:'
#cat /tmp/ALL2

comm -23 /tmp/ALL2 /tmp/PRESERVE > /tmp/DELETE_THESE
echo 'delete these:'
cat /tmp/DELETE_THESE
#exit 0

while read f; do
  rm $f
done < /tmp/DELETE_THESE

Now use filter-branch, if all files are removed in the revision, then prune that commit and it's message.

 git filter-branch --prune-empty --tree-filter '/FULL_PATH/preserve.sh' master
Acrosstheboard answered 14/9, 2019 at 20:50 Comment(3)
This works well! I had only to change a few things to make it work with paths that contain spaces.Holifield
@Holifield Hi, by any chance, do you still have the version that fixes the spaces?Flatten
@Flatten Hi. You have to add quotes when using the variables, like "$DIRNAME". I posted mine as a new answer.Holifield
T
1

Here's my version of the script @Roberto posted, written for linux/wsl. If you don't specify a "myrepo_KEEP.txt" it will create one based on the current file structure. Pass in the repo to work on:

prune.sh MyRepo

# This script should run one level up from the git repo folder (i.e. the  containing folder)
# This script uses git-filter-repo (github.com/newren/git-filter-repo).
# The result will be the folder called <your_repo_folder_name>_REWRITE_CLONE. Your original repo won't be changed.
# Tags are not preserved, see line below to preserve tags.
# Running subsequent times will backup the last run in <your_repo_folder_name>_REWRITE_CLONE_BKP.
# Optionally, list the files and folders that you want to keep the KEEP_FILE (<your_repo_folder_name>_KEEP.txt) 
## It should contain a line end in the last line, otherwise the last file/folder will be skipped.
## If this file is missing it will be created by this script with all current folders listed. 

echo "Prune git repo"

# User needs to pass in the repo name
GIT_REPO=$1

if [ -z $GIT_REPO ]; then
    echo "Pass in the directory to prune"
else
    KEEP_FILE="${GIT_REPO}"_KEEP.txt

    # Build up a list of current directories in the repo, if one hasn't been supplied
    if [ ! -f "${KEEP_FILE}" ]; then
        echo "Keeping all current files in repo (generating keep file)"
        cd $GIT_REPO
        find . -type d -not -path '*/\.*' > "../${KEEP_FILE}"
        cd ..
    fi

    echo "Pruning $GIT_REPO"

    clone="${GIT_REPO}_REWRITE_CLONE"
    
    # Shift backup
    bkp="${clone}_BKP"
    temp=/tmp/git_rewrite_temp
    echo $clone
    rm -Rf "$bkp"
    mv "$clone" "$bkp"
    
    # Setup temp
    rm -Rf "$temp"
    mkdir "$temp"   
    
    # Clone
    echo "Cloning repo...from $GIT_REPO to $clone"
    if git clone "$GIT_REPO" "$clone"; then
        cd "$clone"
        git remote remove origin

        # Comment line below to preserve tags
        git tag | xargs git tag -d

        echo 'Start logging file history...'
        echo "# git log results:\n" > "$temp"/log.txt

        # Follow the renames
        while read p
        do
            shopt -s dotglob
            find "$p" -type f > "$temp"/temp
            while read f
            do
                echo "## " "$f" >> "$temp"/log.txt
                # print every file and follow to get any previous renames
                # Then remove blank lines.  Then remove every other line to end up with the list of filenames       
                git log --pretty=format:'%H' --name-only --follow -- "$f" | awk 'NF > 0' | awk 'NR%2==0' | tee -a "$temp"/log.txt

                echo "\n\n" >> "$temp"/log.txt
            done < "$temp"/temp
        done < ../"${KEEP_FILE}" > "$temp"/PRESERVE

        mv "$temp"/PRESERVE "$temp"/PRESERVE_full
        awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE

        sort -o "$temp"/PRESERVE "$temp"/PRESERVE

        echo 'Starting filter-branch --------------------------'
        git filter-repo --paths-from-file "$temp"/PRESERVE --force --replace-refs delete-no-add
        echo 'Finished filter-branch --------------------------'
        cd ..
    fi
fi

Credit to @rksawyer & @Roberto.

Thug answered 16/6, 2021 at 17:18 Comment(1)
Few enhancements: 1) For generation the KEEP file I would use this: find . -maxdepth 1 -type d -not -path '/\.' -not -path '.' > "../${KEEP_FILE}" 2) instead of: done < ../"${KEEP_FILE}" > "$temp"/PRESERVE mv "$temp"/PRESERVE "$temp"/PRESERVE_full awk '!a[$0]++' "$temp"/PRESERVE_full > "$temp"/PRESERVE sort -o "$temp"/PRESERVE "$temp"/PRESERVE You can do simply: done < ../"${KEEP_FILE}" | sort | uniq > "$temp"/PRESERVEManville
R
-2

We painted ourselves into a much worse corner, with dozens of projects across dozens of branches, with each project dependent on 1-4 others, and 56k commits total. filter-branch was taking up to 24 hours just to split a single directory off.

I ended up writing a tool in .NET using libgit2sharp and raw file system access to split an arbitrary number of directories per project, and only preserve relevant commits/branches/tags for each project in the new repos. Instead of modifying the source repo, it writes out N other repos with only the configured paths/refs.

You're welcome to see if this suits your needs, modify it, etc. https://github.com/CurseStaff/GitSplit

Roommate answered 19/1, 2016 at 17:27 Comment(2)
The linked repo doesn't exist or isn't public.Ammunition
Sounds great, would be nice to be able to see it? If you want this answer to be upvoted you'll want to post some useful details not just posting a hyperlink too, btw.Thug

© 2022 - 2024 — McMap. All rights reserved.