Remove spam from git history
Asked Answered
R

2

7

I have "inherited" a dirty git repository with about 5k valid commits and about 50k spam commits (this is the edit history for something that used to be a world-writable wiki). We're migrating formats so this is a good time to rewrite history. I don't want to loose the history entirely, but both by commit volume and raw content volume the spam is overwhelming. The old moderation technique of rolling back to the last good commit left a lot of junk.

I can find about 80% of the bad commits without too much trouble using git log -S and some regular expression work. Most of the spam content is pretty obvious. The problem is I'm not sure what do to with the massive list of commits I want to drop.

Note I'm quite familiar with git and use git rebase hourly (that would have been minutely except git revise has taken over a lot of the load), and I know how to accomplish this manually, but I need an automated solution. Normally I would turn to git filter-branch, but I'm not sure what tool to reach for to inspect the current diff.

I thought about writing a script to manipulate a rebase script, but I think that's going to get me in trouble with false positives. I can probably catch and drop both the original defacing and the rollback, but what happens when I miss one side of that equation? I want the REST of the possible matches to succeed not fail when one of them doesn't rebase cleanly.

Note I don't want to manipulate the contents of files or add/remove files based on my matches, I want to inspect the content of the patch and decide to pick or drop based on that.

What's the best git tool to reach for?

Reflectance answered 13/8, 2019 at 13:8 Comment(11)
What do you mean by "spam commits"? Am I correct in assuming that "rolling back to the last good commit" does not mean that the branch was reset, but rather that a revert commit was added after the fact leaving both the broken commits and the reset as separate commits in linear history? In that case, you might try to search for commits with the same tree hashes and ignore all the history between those commits.Supplication
Is the git repository public? Being able to see the actual history might help us formulate an answer...Morganite
@cmaster No unfortunately not, the "reverts" were done my manually removing garbage from the wiki page and saving (hence producing a new commit). The result is similar to a revert commit, but almost never identical. The reverts usually happened manually and the spam was happening automatically, so there are 10 small spam commits (say, adding ~10 links each) then a big rollback commit (removing ~100 links).Reflectance
@Morganite Unfortunately it is not public right now. It's headed that way eventually but there is some private stuff that needs redacting first. I can hardly work on that end of things for all the spam in the way though.Reflectance
Would git filter-branch help?Lovelace
@Lovelace I don't think you read the full question.Reflectance
@Reflectance you got me. I have to admit, I stopped reading after "I know how to use git rebase" because I was short on time. My bet is still on git filter-branch and there's even an example in its manpage calling --commit-filter which sounds very familiar to your usecase ("skip commits authored by XY"). I'll create a demo repo and see if I can make it work, then post an answer.Lovelace
@Reflectance I provided a new answer which should be pretty straight-forward to implement and I think it fits your requirements perfectly.Lovelace
I wonder if any of the answers ever helped in solving the problem that triggered this question.Lovelace
@Lovelace Not exactly. It's been a long time now but if memory serves me neither answer really solved my problem because they both hand waved over my real problem: validating my list of candidate commits and handling resolution when one didn't apply cleanly. I think I eventually worked it out with a dirty script that tried rebasing the next N commits after each step and backtracknig to try other solutions (not dropping commits, dropping more commits) to find combinations that caused no later rebase errors. I also remember it was a long-running operation I had to baby-sit for a week or something.Reflectance
@Reflectance thanks for the reply. It would be really awesome if the solution could be shared, in case anybody else faces a similar problem and finds this question.Lovelace
L
1

One possibility is usage of Git's graftfile or git replace. First, identify all "good" commits, i.e. the non-spam commits, including also the "cleanup/revert" commits. For instance by filtering your history by committer email or similar mechanism (you mentioned pickaxe/-S).

Once you have the list of "good" commits, a simple transformation with the paste command gives you the content of the graftsfile, which is:

commit parent1 parent2 parent3...

Say, your good commits are as follows (newest commits on top):

b3fb1155cd5352da674d93ce4b0a1567674f6d27
b460ef0aea564e587e5866107c0fc52adf552ca1
9f803dd18c89e13f47170e1ace1d0abb992cfeee

then you need the following content in your graftsfile:

b3fb1155cd5352da674d93ce4b0a1567674f6d27 b460ef0aea564e587e5866107c0fc52adf552ca1
b460ef0aea564e587e5866107c0fc52adf552ca1 9f803dd18c89e13f47170e1ace1d0abb992cfeee

Which is fairly easy to obtain via:

sed 1d commits | paste commits - | sed '$d'

Move this file to .git/info/grafts and verify the resulting history with git log or gitk. If you are satisfied with the result, use git filter-branch to rewrite the history and persist your graftsfile. You can then remove .git/info/grafts.

See https://mcmap.net/q/13533/-setting-git-parent-pointer-to-a-different-parent for how to use the non-deprecated replace mechanism. Using the graftsfile is easier to explain in this situation (and it still works with current Git versions, so why not use it? :))

Lovelace answered 23/8, 2020 at 9:13 Comment(0)
L
0

One possible solution, involving git rebase:

You mentioned you are able to identify which commits to drop and rebase expects a list of commits to pick (or drop even). But you cannot simply drop, because then your "revert" commits would need to be dropped too (and they might contain unrelated changes?).

Considering the following rebase script:

pick A normal edit
pick B spam
pick C spam
pick D spam
pick E spam
pick F revert spam
pick G normal edit

I assume you want to "remove" all changes that were spam and the revert commits. This could be achieved with the following rebase script:

pick A normal edit
fixup B spam
fixup C spam
fixup D spam
fixup E spam
fixup F revert spam
pick G normal edit

If you have the list of commits you want to "drop" (including the "revert" commits), you should be able to feed it through sed or similar tools to replace all matching lines with fixup instead of pick.

It would be even easier if you could identify the faulty commits by their commit subject.

Lovelace answered 17/8, 2020 at 19:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.