How to turn a token stream into a parse tree [closed]

Asked 19/1, 2009 at 4:47 Answered 22/4, 2014 at 10:18

Solved parsing token lexical-analysis parse-tree

I have a lexer built that streams out tokens from in input but I'm not sure how to build the next step in the process - the parse tree. Does anybody have any good resources or examples on how to accomplish this?

Peen answered 19/1, 2009 at 4:47 Comment(0)

I would really recommend http://www.antlr.org/ and of course the classic Dragon Compilers book.

For an easy language like JavaScript it's not hard to hand roll a recursive descent parser, but it's almost always easier to use a tool like yacc or antlr.

I think to step back to the basics of your question, you really want to study up on BNF-esque grammar syntax and pick a syntax for your target. If you have that, the parse tree should sort of fall out, being the 'instance' manifestation of that grammar.

Also, don't try to turn the creation of your parse tree into your final solution (like generating code, or what-not). It might seem do-able and more effecient; but invariably there will come a time when you'll really wish you had that parse tree 'as is' laying around.

Wendellwendi answered 19/1, 2009 at 4:51 Comment(2)

You should know that a lot of this is for the learning experience rather than to accomplish a specific task, so using tools seems to be beyond the point for me. – Peen 19/1, 2009 at 5:34

The thing is that creating parse trees from grammars is a very well understood theoretical thing. The professionals use the tools. If you are interested in learning, take a cs theory class and use the tool. Experience is using a tool and tweaking it for imperfect language/grammars. – Mesdemoiselles 9/3, 2009 at 12:35

You should investigate parser generator tools for your platform. A parser generator allows you to specify a context-free grammar for your language. The language consists of a number of rules which "reduce" a series of symbols into a new symbol. You can usually also specify precedence and associativity for different rules to eliminate ambiguity in the language. For instance, a very simple calculator language might look something like this:

%left PLUS, MINUS           # low precedence, evaluated left-to-right
%left TIMES, DIV            # high precedence, left-to-right

expr ::= INT
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIV expr
| LEFT_PAREN expr RIGHT_PAREN

Usually, you can associate a bit of code with each rule to construct a new value (in this case an expression) from the other symbols in that rule. The parser generator will take in the grammar and produce code in your language that translates a token stream to a parse tree.

Most parser generators are language specific. ANTLR is well-known and supports C, C++, Objective C, Java, and Python. I've heard it's hard to use though. I've used bison for C/C++, CUP for Java, and ocamlyacc for OCaml, and they're all pretty good. If you are already using a lexer generator, you should look for a parser generator that is specifically compatible with it.

Stomy answered 19/1, 2009 at 5:36 Comment(0)

I believe a common a approach is to use a Finite State Machine. For example if you read an operand you go into a state where you next expect an operator, and you usually use the operator as the root node for the operands and so on.

Akeyla answered 19/1, 2009 at 4:52 Comment(3)

incorrect - you're talking about writing a lexer. To write a parser, you need some sort of "stack", which by definition a finite state machine does not have. – Continuity 10/5, 2009 at 15:9

No, he's correct; it seems like he's describing the Earley parsing algorithm – Necrophilism 23/3, 2011 at 5:11

wow sorry only saw this comment 4 years later! The problem with a finite state machine is that it cannot recognize things like "balanced parentheses" - you need an unlimited (i.e. not finite) number of "states" to recognize these sorts of expressions. – Continuity 9/3, 2015 at 9:50

As described above by Marcos Marin, a state machine that uses your language rules in BNF to parse your token list will do the trick if you want to do it yourself. Only, as said in above comment by Paul Hollingsworth, the easier way is to use a Pushdown-Automaton that has a simple FiFo memory stack. Every class of token has a next expected token in your grammar, which also is represented in your state-machine. The stack is used to "remember" what the previous token class was, to reduce the required states (could be done without stack, but you would need a new state for every class and subclass split in the grammar tree). The accepting state(s) would be (in natural languages and most programming languages too) the starting state, and maybe some other state in particular cases.

Antlr would be my suggestion if you want to use a tool (waaay faster and less extensive). Good luck!

Pronunciamento answered 22/4, 2014 at 10:18 Comment(0)

Recommended topics

Hot tags