Parsing "off-side" (indentation-based) languages

Asked 1/2, 2010 at 16:52 Answered 12/6, 2023 at 2:5

language-design indentation compiler-theory

An off-side language is the one where

...the scope of declarations (a block) in that language is expressed by their indentation.

Examples of such languages are Python, Boo, Nemerle, YAML and several more.

So my question is this: how do I actually parse these? How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)? Are parser generators of any help here or do I have to hand-code lexer/parser myself?

Glance answered 1/2, 2010 at 16:52 Comment(5)

Ill leave this as a comment, but it has been noted by a professor at MIT that "off-side" languages are a bad idea in real implementations. Just some food for thought – Intersection 19/3, 2010 at 20:20

@Woot Can you provide a link to an article or something of the kind? – Glance 26/3, 2010 at 8:8

I will search for the video link again, but it was from the intro to algorithms course at MIT that is posted on youtube. I'll find it later this evening again – Intersection 26/3, 2010 at 12:31

Oh jeez totally forgot about this. – Intersection 28/8, 2012 at 12:11

@MasBagol this was 5 years ago, so no. However, it seems that it is in the MIT open courseware for intro to algorithms. I would start there. – Intersection 6/10, 2017 at 14:40

Python has a lexer that generates Indent and Dedent tokens, that are equivalent to curly braces ("{", "}"). There is even an example on Stack Overflow, with a simple implementation of such a lexer.

For tabs vs. spaces, Python only has a coding convention: Use 4 spaces per indentation level. Tabs are legal syntax though.

Flatcar answered 10/8, 2010 at 23:13 Comment(0)

The easiest way to resolve the tabs versus spaces problem is to disallow combinations of spaces and tabs (this is what's done in F#, for instance). Any modern editor will allow tabs to be converted to some number of spaces.

As for whether you need to abandon parser generators, probably not, but you will have to hack the offsides identification in there somewhere. This may require a bit of creativity on your part. Based on browsing the F# source, it looks like they use a post-lexing step to create additional tokens representing offside language elements.

Xerography answered 12/2, 2010 at 15:56 Comment(0)

How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)?

It depends on how the editor's settings if two tabs will equal eight spaces.

The off-side rule, as expressed by the originator, mentions relative positioning of two successive lines of code and not the absolute number of whitespaces. Here's a nice read to help you better understand (and some quote):

"Whitespace is significant in Python source code."

No, not in general. Only the indentation level of your statements is significant (i.e. the whitespace at the very left of your statements). Everywhere else, whitespace is not significant and can be used as you like, just like in any other language. You can also insert empty lines that contain nothing (or only arbitrary whitespace) anywhere.

Also, the exact amount of indentation doesn't matter at all, but only the relative indentation of nested blocks (relative to each other). [...]

Lithometeor answered 1/2, 2010 at 17:2 Comment(0)

For what it's worth, Haskell is also indentation-based and optionally { foo; bar; etc } for when whitespace is inconvenient. I wrote a simple indentation-based parser with Parsec, it's indended to read like Lisp but indentation indicates operator application. Parentheses can be only used on one line.

(aaa bb) cc
         e fffff (ggg hhh) iii
                 jjj kkk
         ddd

Here aaa is applied to bb. What results is a ternary function. It is applied to arguments cc, e applied to one argument, and ddd. See how the application is based on column alignment, and not X spaces.

The parser could probably be a lot simpler, too.

Isis answered 5/10, 2010 at 8:50 Comment(0)

You've got several options w.r.t. tabs and spaces: either disallow mixing tabs and spaces, assume a fixed ratio of tabs to spaces, or permit the programmer to decide on a per project or per source file basis (some sort of "#pragma tab(4)" style directive to permit tabs and/or change the number of spaces they represent).

Parser generator such as ANTLR 3 can easily cope with this; I've been toying with an example myself, compiling to its C# target. The link in DirkGently's answer explains the Python algorithm, which translates directly into code. My approach was simply to define separate tokens for whitespace and newlines, and override the "emit token" function used by the lexer to insert additional indent/dedent tokens on the fly. This turned out to be simpler to implement than other approaches I've seen around which override the "get last token" function, but either works pretty well.

Winchell answered 18/3, 2010 at 21:32 Comment(0)

I am working on a parser for an indentation based language. It so far has involved a lot of manual code. I want to try to do the dedent/indent thing that Python was mentioned doing, but it's kind of hard TBH.

I divided it into 3 phases:

Tokenizing
Directing
Treeifying

The tokenization phase is about 100 lines, plus the definitions of the regexps. It spits out a sequence of helpful tokens.

The "directing" phase (or "folding" phase) takes the tokens and spits out push and pop basically, to create instructions on how to make a tree data structure. It essentially folds the list into a tree.

The final "treeifying" phase takes the tree instructions and actually builds the tree. It turns out to be pretty hard, mentally doing gymnastics to think about how the list of tokens becomes a tree. I spent all weekend trying to get it to work, but still have a ways to go to get the output tree to be properly aligned with push and pop.

It should serve as a real-world example of how to build an indentation-based parser, though it's not that great of code.

Hel answered 12/6, 2023 at 2:5 Comment(0)

-1

I got one solution by myself, which I analyse the code just like blocks for the nesting tree. For the part of brackets, I just used normal method.

Here's that parser: https://github.com/jiyinyiyong/cirru-parser

Bullring answered 10/12, 2013 at 12:29 Comment(0)

Recommended topics

Hot tags