antlr 4: Should all of these tokens be showing up in the AST?
Asked Answered
S

1

1

My ultimate goal is to parse a structured file as a tree of in-memory objects that I can then manipulate. The file format that I'm using is fairly sophisticated with about 200 keywords/tags, and this seemed like a good reason to learn about parser/lexer frameworks.

Unfortunately, there are so many concepts (and hundreds of tutorials and guides) that the learning process so far feels like trying to drink from a fire hose. So I'm taking some very meager baby steps, starting with this example.

I modified the grammar to create the following test, Nano.g4:

grammar Nano;

r  : root ;
root : START ROOT ID END ROOT;
START : 'StartBlock' ;
END : 'EndBlock' ;
ROOT : 'RootItem' ;
ID : [a-z]+ ;             // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines

Next, I created a simple input file, nano.txt:

StartBlock RootItem
   foo
EndBlock RootItem

I then loaded the code using the following commands:

del *.class
del *.java
java org.antlr.v4.Tool Nano.g4
javac nano*.java
java org.antlr.v4.runtime.misc.TestRig Nano r -gui < nano.txt

That gives me this result:

ANTLR output

The tree above is my first conceptual hangup about what to expect from a lexer and parser. The "StartBlock RootItem" and "EndBlock RootItem" tokens are necessary in terms of making the input file legal, but conceptually I don't need them after I've proven that the file is properly formatted. The only thing that I care about from that point on is that there's a RootItem that contains "foo", as shown here:

enter image description here

Again, I'm painfully new to parser/lexer concepts. Should I (or, is it even possible to) write the grammar so the output tree matches the image above? Or should I take care of that in some subsequent step that traverses the AST and only extracts the relevant data fields?

Standifer answered 12/9, 2013 at 15:50 Comment(0)
P
4

ANTLR 4 produces parse trees, not ASTs. This is an important distinction from the behavior of ANTLR 3, and was chosen to help with long-term maintenance of grammars. In particular, situations may arise where users do want access to the tokens, e.g. as part of a semantic highlighting component in an IDE. Rather than force users to write application-specific modified grammars in such a scenario, we chose to always include all tokens in the parse tree.

Phosphocreatine answered 12/9, 2013 at 22:51 Comment(5)
Sam, can you explain what the difference is between a parse tree and a syntax tree? Maybe a URL to read up?Cogwheel
@MikeLischke In ANTLR 3, the user defined the shape of the AST produced by a parser using the AST operators (^ and !) or through rewrite rules (-> operator). In ANTLR 4, the shape of the parse tree exactly matches the shape of the grammar, where rule nodes form the structure and terminal nodes are the leaves (although a rule that doesn't match anything will also not have any children in the tree).Phosphocreatine
Sam, thanks for making this point--it has saved me a lot of head-scratching (particularly my confusion over why the ^, !, and -> operators in ANTLR 3 tutorials didn't seem to have any effect!). Is there a best practice for turning an ANTLR 4 parse tree into an AST? Or do people typically drop the parse tree into a factory that creates the appropriate tree of objects?Standifer
@Standifer I generally keep the parse trees since it's so easy to work with the automatically generated listener and visitor classes/interfaces.Phosphocreatine
@MikeLischke the simplest example is the AST for arithmetic expressions. The AST has terminal nodes (numbers/variables) and non-terminal operator nodes, so you can think about it as having only two types of node. On the other hand, the parse tree mirrors the structure of the grammar. For the expression grammar to work with parenthesis and to have the correct operator precedence there need to be a few more productions, e.g., this, so the parse tree has more than two node types.Hostetter

© 2022 - 2024 — McMap. All rights reserved.