Is simplified semantics for the 'blame' command a good thing?

Asked 29/12, 2012 at 0:21 Answered 29/12, 2012 at 5:35

I'm working on a new weave-based data structure for storing version control history. This will undoubtedly cause some religious wars about whether it's The Right Way Of Doing Things when it comes out, but that isn't my question right now.

My question has to do with what output blame should give. When a line of code has been added, removed, and merged into itself a number of times, it isn't always clear what revision should get blame for it. Notably this means that when a section of code is deleted, all records of it having been there is gone, and there is no blame for the removal. Everyone I've gone over this issue with has said that trying to do better simply isn't worth it. Sometimes people put in the hack that the line after the section which got deleted has its blame changed from whatever it actually was to the revision when the section got deleted. Presumably if the section is at the end then the last line get its blame changed, and if the file winds up empty then the blame really does disappear into the aether, because there's literally nowhere left to put blame information. For various technical reasons I won't be using this hack, but assume that continuing but with this completely undocumented but de facto standard practice will be uncontroversial (but feel free to flame me and get it out of your system).

Moving on to my actual question. Usually in blame for each line you look at the complete history of where it was added and removed in the history and using three-way merge (or, in the case of criss-cross merges, random bullshit) and based on the relationships between those you determine whether the line should have been there based on its history, and if it shouldn't but is then you mark it as new with the current revision. In the case where a line occurs in multiple ancestors with different blames then it picks which one to inherit arbitrarily. Again, I assume that continuing with this completely undocumented but de facto standard practice will be uncontroversial.

Where my new system diverges is that rather than doing a complicated calculation of whether a given line should be in the current revision based on a complex calculation of the whole history, it simply looks at the immediate ancestors, and if the line is in any of them it picks an arbitrary one to inherit the blame from. I'm making this change for largely technical reasons (and it's entirely possible that other blame implementations do the same thing, for similar technical reasons and a lack of caring) but after thinking about it a bit part of me actually prefers the new behavior as being more intuitive and predictable than the old one. What does everybody think?

Epsomite answered 29/12, 2012 at 0:21 Comment(3)

All blame commands have the same semantics, so the question applies to all of them. – Epsomite 29/12, 2012 at 0:49

Re: open/close debates: I think this question is highly constructive and represents fundamental design research on version tracking systems and tools which is or will be highly relevant to several developers and quite interesting to many more. – Helmet 29/12, 2012 at 11:49

What I'd like to see is gitk-like history for any line of code. – Apfel 29/12, 2012 at 14:59

I actually wrote a one of the blame implementations out there (Subversion's current one I believe, unless someone replaced it in the past year or two). I helped with some others as well.

At least most implementations of blame don't do what you describe:

Usually in blame for each line you look at the complete history of where it was added and removed in the history and using three way merge (or, in the case of criss-cross merges, random bullshit) and based on the relationships between those you determine whether the line should have been there based on its history, and if it shouldn't but is then you mark it as new with the current revision. In the case where a line occurs in multiple ancestors with different blames then it picks which one to inherit arbitrarily. Again, I assume that continuing with this completely undocumented but de facto standard practice will be uncontroversial.

Actually, most blames are significantly less complex than this and don't bother trying to use the relationships at all, but they just walk parents in some arbitrary order, using simple delta structures (usually the same internal structure whatever diff algorithm they have uses before it turns it into textual output) to see if the chunk changed, and if so, blame it, and mark that line as done.

For example, Mercurial just does an iterative depth first search until all lines are blamed. It doesn't try to take into account whether the relationships make it unlikely it blamed the right one.

Git does do something a bit more complicated, but still, not quite like you describe.

Subversion does what Mercurial does, but the history graph is very simple, so it's even easier.

In turn, what you are suggesting is, in fact, what all of them really do:

Pick an arbitrary ancestor and follow that path down the rabbit hole until it's done, and if it doesn't cause you to have blamed all the lines, arbitrarily pick the next ancestor, continue until all blame is assigned.

Prostration answered 29/12, 2012 at 4:32 Comment(3)

Thanks, that was very helpful. The data structure I'm working on now will allow blame to be done with a single pass over the history, which might make a big performance difference for some use cases. – Epsomite 29/12, 2012 at 7:18

git blame (and I think also blame / annotate commands in other SCM) simply traverses history in some order and blames a last commit (counting from beginning) that has given line. Note that thanks to heuristic similarity-based rename detection it can detect when line was moved or copied and blame commit that changed / introduced line and not the one that merely moved it (you probably want to ignore whitespace to discard reindent). See e.g. git gui blame <file> in action (with two blame columns: one for change, one for movement). – Predestine 29/12, 2012 at 12:31

So, subversion also does blame with a single pass over history using a datastructure to track lines. Of course, SVN has a linear history graph, which makes life easier. I imagine your datastructure will end up similar. – Prostration 29/12, 2012 at 16:11

On a personal level, I prefer your simplified option.

Reason: Blame isn't used very much anyway.

So I don't see a point in wasting a lot of time doing a comprehensive implementation of it.

It's true. Blame has largely turned out to be one of those "pot of gold at the end of the rainbow" features. It looked really cool from those of us standing on the ground, dreaming about a day when we could just click on a file and see who wrote which lines of code. But now that it's widely implemented, most of us have come to realize that it actually isn't very helpful. Check the activity on the blame tag here on Stack Overflow. It is underwhemingly desolate.

I have run across dozens of "blame-worthy" scenarios in recent months alone, and in most cases I have attempted to use blame first, and found it either cumbersome or utterly unhelpful. Instead, I found the information I needed by doing a simple filtered changelog on the file in question. In some cases, I could have found the information using Blame as well, had I been persistent, but it would have taken much longer.

The main problem is code formatting changes. The first-tier blame for almost everything was listed as... me! Why? Because I'm the one responsible for fixing newlines and tabs, re-sorting function order, splitting functions into separate utility modules, fixing comment typos, and improving or simplifying code flow. And if it wasn't me, someone else had done a whitespace or block-move somewhere along-the-way as well. In order to get a meaningful blame on anything dating back to a time before I can already remember without the help of blame, I had to roll back revisions and re-blame. And re-blame again. And again.

So in order for a blame to actually be a useful time saver for more than the most lucky of situations, the blame has to be able to heuristicly make its way past newline, whitespace, and ideally block copy/move changes. That sounds like a very tall order, especially when scouring the changelog for a single file, most of the time, it won't yield many diffs anyway and you can just sift through by hand fairly quickly. (The notable exception being, perhaps, badly engineered source trees where 90% of the code is stuffed in one or two ginormous files... but who these days in a collaborative coding environment does much of that anymore?).

Conclusion: Give it a bare-bones implementation of blame, because some people like to see "it can blame!" on the features list. And then move on to things that matter. Enjoy!

Heedless answered 29/12, 2012 at 4:24 Comment(4)

I think that the blame tag on StackOverflow is desolate because the feature is very simple and I cannot see many questions coming up in day to day usage. – Stimulus 29/12, 2012 at 4:48

Thanks, that's been the general feedback. I'm adding blame because it can basically comes for free with what I'm working on, but wanted to make sure that I'm getting it right, because a non-working blame would be worse than no blame. – Epsomite 29/12, 2012 at 7:27

I like your post but I'm not so sure about your conclusion. From the same anecdotes, you could conclude, "What we need is a much smarter blame command, go for it!" – Copalite 29/12, 2012 at 12:8

As git blame is hooked into the same mechanism that git uses for diffs, it can ignore whitespace changes, and detect code movement and copying in the same file and across files. See also git gui blame GUI. – Predestine 29/12, 2012 at 12:33

The line-merge algorithm is stupider than the developer. If they disagree, that just indicates that the merger is wrong rather than indicating a decision point. So, the simplified logic should actually be more correct.

Shend answered 29/12, 2012 at 5:35 Comment(0)

Recommended topics

Hot tags