How to delete the old history after running git filter-branch?
Asked Answered
E

5

19

Suppose I have such tree:

... -- a -- b -- c -- d -- ...
             \
              e -- a -- k

and I want it become just

... -- a -- b -- c -- d -- ...

I know how to attach branch name to "e". I know that what I'm going to do will change history, and this is bad. Also I guess I need to use something like rebase or filter-branch. But how exactly - I'm lost.

Ok. Situation is following: I have rather big tree now (like this)

                 s -- p -- r   
                /
a -- b -- c -- d -- e --- g -- w
           \               \
            t -- p -- l     y -- k

but in my one of first commits (like to "b" for ex.) I added binary files, which makes whole repo very heavy. So I decided to take them away. I did it with filter-branch. And Now I have 2 long branches of commits identical to each other starting from second commit.

                 s -- p -- r   
                /
a -- b -- c -- d -- e --- g -- w
      \    \               \
       \    t -- p -- l     y -- k
        \
         \             s'-- p'-- r'  
          \           /
           b'-- c'-- d'-- e'--- g'-- w'
                 \               \
                  t'-- p'-- l'    y'-- k'

where b' is commit without binary file in it. So I can't do merge. I don't want this whole tree to be in history duplicated so.

Ever answered 12/5, 2011 at 21:1 Comment(2)
Why exactly do you want to do this? Couldn't you just use git merge instead?Voguish
@Christopher - I've added explanation into question.Ever
J
40

After importing a Subversion repository with multiple years of history, I ran into a similar problem with bloat from lots of binary assets. In git: shrinking Subversion import, I describe trimming my git repo from 4.5 GiB to around 100 MiB.

Assuming you want to delete from all commits the files removed in “Delete media files” (6fe87d), you can adapt the approach from my blog post to your repo:

$ git filter-branch -d /dev/shm/git --index-filter \
  "git rm --cached -f --ignore-unmatch media/Optika.1.3.?.*; \
   git rm --cached -f --ignore-unmatch media/lens.svg; \
   git rm --cached -f --ignore-unmatch media/lens_simulation.swf; \
   git rm --cached -f --ignore-unmatch media/v.html" \
  --tag-name-filter cat --prune-empty -- --all

Your github repo doesn't have any tags, but I include a tag-name filter in case you have private tags.

The git filter-branch documentation covers the --prune-empty option.

--prune-empty
Some kinds of filters will generate empty commits that leave the tree untouched. This switch allows git-filter-branch to ignore such commits …

Using this option means your rewritten history will not contain a “Delete media files” commit because it no longer affects the tree. The media files are never created in the new history.

At this point, you'll see duplication in your repository due to another documented behavior.

The original refs, if different from the rewritten ones, will be stored in the namespace refs/original/.

If you're happy with the newly rewritten history, then delete the backup copies.

$ git for-each-ref --format="%(refname)" refs/original/ | \
  xargs -n 1 git update-ref -d

Git is vigilant about protecting your work, so even after all this intentional rewriting and deleting the reflog is keeping the old commits alive. Purge them with a sequence of two commands:

$ git reflog expire --verbose --expire=0 --all
$ git gc --prune=0

Now your local repository is ready, but you need to push the updates to GitHub. You could do them one at a time. For a local branch, say master, you'd run

$ git push -f origin master

Say you don't have a local issue5 branch any more. Your clone still has a ref called origin/issue5 that tracks where it is in your GitHub repository. Running git filter-branch modifies all the origin refs too, so you can update GitHub without a branch.

$ git push -f origin origin/issue5:issue5

If all your local branches match their respective commits on the GitHub side (i.e., no unpushed commits), then you can perform a bulk update.

$ git for-each-ref --format="%(refname)" refs/remotes/origin/ | \
  grep -v 'HEAD$' | perl -pe 's,^refs/remotes/origin/,,' | \
  xargs -n 1 -I '{}' git push -f origin 'refs/remotes/origin/{}:{}'

The output of the first stage is a list of refnames:

$ git for-each-ref --format="%(refname)" refs/remotes/origin/
refs/remotes/origin/HEAD
refs/remotes/origin/issue2
refs/remotes/origin/issue3
refs/remotes/origin/issue5
refs/remotes/origin/master
refs/remotes/origin/section_merge
refs/remotes/origin/side-media-icons
refs/remotes/origin/side-pane-splitter
refs/remotes/origin/side-popup
refs/remotes/origin/v2

We don't want the HEAD pseudo-ref and remove it with grep -v. For the rest, we use Perl to strip off the refs/remotes/origin/ prefix and for each one run a command of the form

$ git push -f origin refs/remotes/origin/BRANCH:BRANCH
Jaundice answered 12/5, 2011 at 22:33 Comment(6)
+1 Nice article. Just what I was looking for (about to break up a large repo at work into several repos, using them as sub modules instead) :)Eboni
Btw. When I did "clone" of my local git repo into another place - there wasn't this additional branch of commits. BUT! When I pushed it with force to my main remote - it's duplicated there with original commits. =( So I again have 2 branches of commits.Ever
@Aleksandr What's keeping the old commits alive? Do you have other branch heads with the heavy commits in their histories? Did you use --tag-name-filter when you ran git filter-branch? Do you have shell access to the host where your main remote lives?Jaundice
@Greg, When I do it with --tag-name-filter and -- --all on the end (as it showed in link) I got very mixed up tree with lot of duplications. But this nice splited tree show in description is made if I don't use --tag-name-filter and HEAD on the end.Ever
Btw. I don't have ssh, and this is little project on github github.com/soswow/e-textbook/network So you can see what it this. Curretly there are untoched version is pushed.Ever
@Aleksandr A few commits add files named media/lens.svg and media/lens_simulation.swf. Do you want to remove them from all commits or only from b?Jaundice
E
1

You can use git filter-branch again, but this time with --parent-filter option. With this you can unlink the commits by setting their parents references to nothing. I think you can use the --commit-filter option for the same purpose. This will leave a lot of different loose objects in your repo, so you need to to do git gc --prune=now.

Here's an example of how the --parent-filter can be used to drop the parents http://git.661346.n2.nabble.com/purging-unwanted-history-td1507638.html

Eboni answered 12/5, 2011 at 22:25 Comment(5)
I see no need to remove any parents here - we want to remove the superfluous (duplicated) children, not the parents.Jammiejammin
How do exactly should I do it? I think it's logical if I could set null for parent of b' and then GC on it. I tried some variants, but nothing yet worked.Ever
@Alex: I added a link to a thread discussing this. As you can see they are using sed to replace the parent with an empty string.Eboni
@Paũlo: What I meant was the parent reference from a commit to its parent. If the history is A-B-C-D-E... and you want to remove C-D-E, then the parent references for C, D and E should be set to nothing, to make the commits dangling.Eboni
If a commit is dangling is not defined by whether it has a parent reference, but whether some child of it has a tag or branch pointing to it.Jammiejammin
M
0

Try:

git branch -d name

You may need to use this instead:

git branch -D name

Mitten answered 12/5, 2011 at 21:9 Comment(3)
It will only remove label. commits will stay.Ever
@Aleksandr, commits with no label to them are collected by git gc, which is run automatically from time to time, or you can run it yourself.Tetragram
Then you can delete the corresponding remote branches, if any.Exposed
W
0

You can delete the branches with git branch -D branch_name and delete remote branches with git push remote_name :branch_name.

The commits will stay unreferenced in your repository for some time (see git gc doc), but will only use disk space in case you realize later you made a mistake.

And since you deleted the remote branches, a new git clone should not retrieve the unreferenced commits .

Wanton answered 13/5, 2011 at 6:52 Comment(3)
No, this still only removes refs, not the commits. Since the commits are still referenced through their children.Eboni
I guess he will delete all the children branches (r, w, k and l in his schema). So if k is unreachable from any branch, git won't reach y because it will never consider k's parent.Accompany
Edited your post so I could remove my -1 :)Eboni
V
-2

From your example, you might be able to try git rebase b b'?

Voguish answered 12/5, 2011 at 21:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.