I have "inherited" a dirty git repository with about 5k valid commits and about 50k spam commits (this is the edit history for something that used to be a world-writable wiki). We're migrating formats so this is a good time to rewrite history. I don't want to loose the history entirely, but both by commit volume and raw content volume the spam is overwhelming. The old moderation technique of rolling back to the last good commit left a lot of junk.
I can find about 80% of the bad commits without too much trouble using git log -S
and some regular expression work. Most of the spam content is pretty obvious. The problem is I'm not sure what do to with the massive list of commits I want to drop.
Note I'm quite familiar with git
and use git rebase
hourly (that would have been minutely except git revise has taken over a lot of the load), and I know how to accomplish this manually, but I need an automated solution. Normally I would turn to git filter-branch
, but I'm not sure what tool to reach for to inspect the current diff.
I thought about writing a script to manipulate a rebase script, but I think that's going to get me in trouble with false positives. I can probably catch and drop both the original defacing and the rollback, but what happens when I miss one side of that equation? I want the REST of the possible matches to succeed not fail when one of them doesn't rebase cleanly.
Note I don't want to manipulate the contents of files or add/remove files based on my matches, I want to inspect the content of the patch and decide to pick or drop based on that.
What's the best git
tool to reach for?
git filter-branch
help? – Lovelacegit filter-branch
and there's even an example in its manpage calling--commit-filter
which sounds very familiar to your usecase ("skip commits authored by XY"). I'll create a demo repo and see if I can make it work, then post an answer. – Lovelace