How to output the AST built using ANTLR?
Asked Answered
C

1

32

I'm making a static analyzer for C. I have done the lexer and parser using ANTLR in which generates Java code.

Does ANTLR build the AST for us automatically by options {output=AST;}? Or do I have to make the tree myself? If it does, then how to spit out the nodes on that AST?

I am currently thinking that the nodes on that AST will be used for making SSA, followed by data flow analysis in order to make the static analyzer. Am I on the right path?

Chevaldefrise answered 8/2, 2011 at 9:25 Comment(0)
A
62

Raphael wrote:

Does antlr build the AST for us automatically by option{output=AST;}? Or do I have to make the tree myself? If it does, then how to spit out the nodes on that AST?

No, the parser does not know what you want as root and as leaves for each parser rule, so you'll have to do a bit more than just put options { output=AST; } in your grammar.

For example, when parsing the source "true && (false || true && (true || false))" using the parser generated from the grammar:

grammar ASTDemo;

options { 
  output=AST; 
}

parse
  :  orExp
  ;

orExp
  :  andExp ('||' andExp)*
  ;

andExp
  :  atom ('&&' atom)*
  ;

atom
  :  'true'
  |  'false'
  |  '(' orExp ')'
  ;

// ignore white space characters
Space
  :  (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
  ;

the following parse tree is generated:

enter image description here

(i.e. just a flat, 1 dimensional list of tokens)

You'll want to tell ANTLR which tokens in your grammar become root, leaves, or simply left out of the tree.

Creating AST's can be done in two ways:

  1. use rewrite rules which look like this: foo : A B C D -> ^(D A B);, where foo is a parser rule that matches the tokens A B C D. So everything after the -> is the actual rewrite rule. As you can see, the token C is not used in the rewrite rule, which means it is omitted from the AST. The token placed directly after the ^( will become the root of the tree;
  2. use the tree-operators ^ and ! after a token inside your parser rules where ^ will make a token the root, and ! will delete a token from the tree. The equivalent for foo : A B C D -> ^(D A B); would be foo : A B C! D^;

Both foo : A B C D -> ^(D A B); and foo : A B C! D^; will produce the following AST:

enter image description here

Now, you could rewrite the grammar as follows:

grammar ASTDemo;

options { 
  output=AST; 
}

parse
  :  orExp
  ;

orExp
  :  andExp ('||'^ andExp)* // Make `||` root
  ;

andExp
  :  atom ('&&'^ atom)* // Make `&&` root
  ;

atom
  :  'true'
  |  'false'
  |  '(' orExp ')' -> orExp // Just a single token, no need to do `^(...)`, 
                            // we're removing the parenthesis. Note that
                            // `'('! orExp ')'!` will do exactly the same.
  ;

// ignore white space characters
Space
  :  (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
  ;

which will create the following AST from the source "true && (false || true && (true || false))":

enter image description here

Related ANTLR wiki links:

Raphael wrote:

I am currently thinking that the nodes on that AST will be used for making SSA, followed by data flow analysis in order to make the static analyzer. Am I on the right path?

Never did anything like that, but IMO the first thing you'd want is an AST from the source, so yeah, I guess your on the right path! :)

EDIT

Here's how you can use the generated lexer and parser:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String src = "true && (false || true && (true || false))";
    ASTDemoLexer lexer = new ASTDemoLexer(new ANTLRStringStream(src));
    ASTDemoParser parser = new ASTDemoParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}
Allout answered 8/2, 2011 at 14:0 Comment(9)
Thanks for the answer, but I still don't know how to decide which of the tokens that should be the root and leaves. Any suggestions?Chevaldefrise
@Raphael, whatever makes the most sense to you. Naturally, parenthesis, semicolons etc. can be easily left from your tree, and operators become the root. A while statement's root will probably be the keyword 'while' and will likely have 2 children: expression and statementBlock (which is zero or more statement's) which is executed while the expression evaluates to true. Again: whatever makes the most sense to you: you're the one who is going to "walk" the AST and do all the hard work. Good luck!Allout
@Samuel, see my EDIT (the CommonTree tree is what you want).Allout
@BartKiers: Excuse me! Do you know how to print out AST trees using Python? Currently I'm using Python runtime for antlr-3.1.3.Spoon
Why is parse tree flat in the example? The parse process could not be deduced from the flat parse tree.Electrophoresis
@Electrophoresis because that is how ANTLR3 does it. When no instructions on how to construct the AST are given, ANTLR "thinks" you don't want any structure at all, and produces a flat list of tokens.Allout
@Malinda they're usually available through archive.org/web (I've updated the links)Allout
@BartKiers thanks, I am trying to execute the above code. It gives me a compilation error at StringTemplate st = gen.toDOT(tree). The error is class file for org.antlr.stringtemplate.StringTemplate not found. I am using Antlr 3.5.2. Do you have an idea?Inconsequential
Some JAR don't include the string-template classes. Try downloading one that does: antlr3.org/download.htmlAllout

© 2022 - 2024 — McMap. All rights reserved.