Smart Indent algorithm documentation?

Asked 23/8, 2009 at 16:41 Answered 3/9, 2009 at 3:30

algorithm ide indentation code-formatting

I'm a big fan of documenting the proper behavior of IDE features that have a subtle but significant impact on coding flow - things like auto-completion selection and commenting/uncommenting code you might not realize you take advantage of but at the end of the day you got just a bit more done than you might have. I do so in hopes that other language services I have to use incorporate the feature(s), subsequently improving my daily coding life. "Real" Smart Indent, i.e. the Visual Studio 2008 C# editor, is one of those features.

Basic block code indentation is reasonably straightforward and can be hacked together in a reasonable amount of time well enough to get the job done. True Smart Indent, on the other hand, is quite possibly the most technically challenging task I've had to implement in the IDE to date, and I've implemented my fair share. Even full-blown on-the-fly automatic code reformatting is easier; it just defers to Smart Indent for the heavy lifting.

I'm looking for high-level discussions of general purpose Smart Indent algorithms. In particular, I'm looking for either research on smart indent strategies, or an objective description of all normal and "edge" cases that could be tested to ensure repeatable, bug-free results. Eventually, I'd like to provide both a detailed workflow of the functionality, a concrete foundation for actually implementing the feature, and finally assembling a language-specific version from that and integrating it into my language services.

PS: Visual Studio 2010's C# editor has several small bugs in this feature. Having implemented it myself, I have a whole new respect for the work it takes to polish it.

Edit (8/25): I managed to write down a draft the rules for how I think things should be handled when the smart indent is inside a code comment. I'll probably be working from a C++/C# perspective on the rules, but later they should be able to be parameterized for aspects of other languages.

Formulism answered 23/8, 2009 at 16:41 Comment(9)

@Chris: Without the two links I put back, you can't tell the scope of the two features and how carefully I've thought them out. – Formulism 23/8, 2009 at 17:8

With the hyperlinks, this is spam imho. – Goodbye 23/8, 2009 at 17:9

@Chris: That's why I left one out, but not the two that are just code flow diagrams on my blog – Formulism 23/8, 2009 at 17:14

I read them. Although they may seem relevent to you, they weren't IMO relevent to being able to answer the question, i.e. to citing "academic discussions of general purpose Smart Indent algorithms". – Goodbye 23/8, 2009 at 18:15

I don't see why you would prefer academic papers over actual solutions. For example, have you looked at how Doxygen does this? – Esta 23/8, 2009 at 19:56

@Neil: Smart indent is where (as one single example) you press enter at the end of a line of code, and the cursor is placed on the next line at the correct indentation level. I'm looking for a precise, as language-agnostic as possible, description of how it should work so it can be objectively implemented and tested. – Formulism 23/8, 2009 at 20:46

@280Z28, I don't think you'll ever make this truly language agnostic, but assuming that you can at least approximately tokenize the language in question, what's wrong with incrementing the indent level for each opening brace/paren/bracket token, and decrementing it for each closing one? Note that this doesn't require the code to be grammatically correct, just tokenizable. – Hesterhesther 24/8, 2009 at 6:46

@Theran: aside from the performance problem, it produces "highly irritating results." It was truly awful. :) I'll try and point out sections in the documentation I'm working on that show where tracking the document indentation level causes poor results. – Formulism 26/8, 2009 at 2:18

hey Sam, I've been researching this recently (after being frustrated with Sublime Text and Atom's inability), do you possibly have any more conclusions on top of what's already in here? – Arnaud 17/12, 2015 at 8:6

Emacs CC Mode manual: Indentation Engine Basics.

Steve Yegge blog rant: js2-mode: a new JavaScript mode for Emacs.

Quote from the latter: "Amazingly, surprisingly, counterintuitively, the indentation problem is almost totally orthogonal to parsing and syntax validation."

Jukebox answered 24/8, 2009 at 12:59 Comment(3)

Steve's blog makes complete sense. I totally understand where he's coming from. One thing to note: his blog post is missing a ton of cases that need to be considered, but that's most likely because there are too many to list. – Formulism 24/8, 2009 at 14:54

Well I'll be. My only consolation is that Steve too believes that it was surprising to learn that the problem is orthagonal to parsing. :) +1 – Lombard 24/8, 2009 at 16:4

My personal opinion, based on building real tools, is that "smart indentation" requires you parse the code, and then prettyprint. See other ansswers. – Cannonball 3/9, 2009 at 3:18

The magic search phrase you are looking for might be "pretty print".

Klemens answered 26/8, 2009 at 2:18 Comment(1)

Amen. +1 (Apparantly I need to type 15 characters here, just to say I gave you a point). – Cannonball 3/9, 2009 at 3:19

Like another responder, the key idea for doing this right is prettyprinting, that is, generating text from the abstract syntax structure of the code.

Basically you take advantage of the nesting of the tree to produce nesting of the printed text. Key ideas are the notion of building primitive strings from leaves of the tree, gluing horizontal boxes [rectangles of text] together from other boxes from subttrees to provide horizontal composition, and gluing boxes on top of one another to get bigger vertical boxes.

Tricky parts: regenerating the langauge literals with formatting information from the tree leaves (just how many leading zeros did that binary float point number have?), handling right margin overflow by allowing alternative box layouts and backtracking, and pattern matching complex tree structures to prettyprint particular trees in nice ways (e.g., nested if-then-if-then-if....)

Here's a research paper on the topic (Full text PDF).

Here's what we did for prettyprinting with the DMS Software Reengineering Toolkit to prettyprint ASTs generated by large-scale metaprogramming.

Cannonball answered 3/9, 2009 at 3:30 Comment(1)

+1 That paper will definitely be helpful for documenting a generalized Smart Indent behavior. I do believe you are vastly underestimating the difficulty of relating Smart Indent to any normal parsing algorithm. – Formulism 3/9, 2009 at 3:53

Maybe I'm missing something, but the "smart indentation" would be entirely tied up in the grammar specification of the language. The closest thing to an academic paper I could find after a bit of google-fu was, in fact, another SO question pertaining to a particular language, here.

So, I'm afraid I can't technically provide an answer, as I did not find any academic papers, but as a sort of meta-answer (sadly, in the form of a question): is it any harder than parsing the language? I use the term "harder" in a vague computability/complexity sense, not referring to the actual time/effort/tears a person would actually put in.

Consider: indentation level changes, in my experience, within certain sub-clauses. If statements, loops, classes, structs, etc. etc. All of these are already detected by the parser. Just as one can decorate a parse tree to build a semantic tree (here's a shard of a random university website), can't you instead decorate the parse tree with "indent information"?

I guess I just don't see what the call for academic papers is all about. Unless if, of course, there's something I'm missing. Which is quite possible, as I've certainly never dared attempt this. :) But, from my vantage point, it would seem that this smart indenting is possible simply by running a modified parser, and instead of reporting "parse errors", it automatically reformats the code so that it is valid (assuming that the "real" parser already okays the block). Real-time running would certainly cause issues, and there are ambigous levels of indentation in whitespace-dependent language (as the indent level is the end of the block).

As a final (honestly, I'm almost done! :)) note: the Emacs text editory is shockingly good, in my experience. I have no idea how it works, but if I were to try this, that would be the first place I'd look... after SO, of course. :))

Lombard answered 24/8, 2009 at 3:31 Comment(2)

I changed up the question ever-so-slightly (or a lot). I'm having an enormous time creating a testing procedure to prevent regressions while I fix pesky bugs. It's significantly harder than parsing because 1) speed matters big time and 2) the document is almost never syntactically correct at the time Smart Indent is invoked. – Formulism 24/8, 2009 at 5:1

+1 for "document is almost never syntactically correct". That indeed makes it harder. You can still do well by parsing with error correction; the least cost repair tells you what should have been there, and then you can reduce the problem to prettyprinting a clean tree. – Cannonball 3/9, 2009 at 3:34

Recommended topics

Hot tags