Python AST from ANTLR Parse Tree?
Asked Answered
C

4

6

I found an ANTLRv4 Python3 grammer, but it generates a parse-tree, which generally has many useless nodes.

I'm looking for a known package to get a Python AST from that parse tree.

Does something like this exist?

EDIT: Clarification regarding use of the Python ast package: my project is in Java and I need to parse Python files.

EDIT 2: By 'AST' I mean http://docs.python.org/2/library/ast.html#abstract-grammar, while by 'parse tree' I mean http://docs.python.org/2/reference/grammar.html.

Cleary answered 15/7, 2014 at 19:25 Comment(6)
Are you set on that parse tree (and if so, why)? Otherwise there's a perfectly good ast module in the standard library.Foetor
That's true only if you code in Python :) I'm actually writing in Java, but I need to parse files which are written in a subset of Python.Cleary
I've looked for a Python 3 parser in Java but never found one (which does not mean there is no such thing, of course). What are you planning to do that requires an AST and cannot be done with a plain parse tree?Maharashtra
Oh, that was surprising (@BartKiers responding here) :). I suppose I can work directly with the PT, but it's a hassle. I need to extract function calls (and their arguments). Unfortunately "f(arg1='1')" generates a tree with 63 nodes, and I can't just visit the "atom" nodes...Cleary
Yeah, I see what you mean. I'll ponder this for a bit...Maharashtra
That would be absolutely fantastic :)Cleary
M
7

The following could be a start:

public class AST {

    private final Object payload;

    private final List<AST> children;

    public AST(ParseTree tree) {
        this(null, tree);
    }

    private AST(AST ast, ParseTree tree) {
        this(ast, tree, new ArrayList<AST>());
    }

    private AST(AST parent, ParseTree tree, List<AST> children) {

        this.payload = getPayload(tree);
        this.children = children;

        if (parent == null) {
            walk(tree, this);
        }
        else {
            parent.children.add(this);
        }
    }

    public Object getPayload() {
        return payload;
    }

    public List<AST> getChildren() {
        return new ArrayList<>(children);
    }

    private Object getPayload(ParseTree tree) {
        if (tree.getChildCount() == 0) {
            return tree.getPayload();
        }
        else {
            String ruleName = tree.getClass().getSimpleName().replace("Context", "");
            return Character.toLowerCase(ruleName.charAt(0)) + ruleName.substring(1);
        }
    }

    private static void walk(ParseTree tree, AST ast) {

        if (tree.getChildCount() == 0) {
            new AST(ast, tree);
        }
        else if (tree.getChildCount() == 1) {
            walk(tree.getChild(0), ast);
        }
        else if (tree.getChildCount() > 1) {

            for (int i = 0; i < tree.getChildCount(); i++) {

                AST temp = new AST(ast, tree.getChild(i));

                if (!(temp.payload instanceof Token)) {
                    walk(tree.getChild(i), temp);
                }
            }
        }
    }

    @Override
    public String toString() {

        StringBuilder builder = new StringBuilder();

        AST ast = this;
        List<AST> firstStack = new ArrayList<>();
        firstStack.add(ast);

        List<List<AST>> childListStack = new ArrayList<>();
        childListStack.add(firstStack);

        while (!childListStack.isEmpty()) {

            List<AST> childStack = childListStack.get(childListStack.size() - 1);

            if (childStack.isEmpty()) {
                childListStack.remove(childListStack.size() - 1);
            }
            else {
                ast = childStack.remove(0);
                String caption;

                if (ast.payload instanceof Token) {
                    Token token = (Token) ast.payload;
                    caption = String.format("TOKEN[type: %s, text: %s]",
                            token.getType(), token.getText().replace("\n", "\\n"));
                }
                else {
                    caption = String.valueOf(ast.payload);
                }

                String indent = "";

                for (int i = 0; i < childListStack.size() - 1; i++) {
                    indent += (childListStack.get(i).size() > 0) ? "|  " : "   ";
                }

                builder.append(indent)
                        .append(childStack.isEmpty() ? "'- " : "|- ")
                        .append(caption)
                        .append("\n");

                if (ast.children.size() > 0) {
                    List<AST> children = new ArrayList<>();
                    for (int i = 0; i < ast.children.size(); i++) {
                        children.add(ast.children.get(i));
                    }
                    childListStack.add(children);
                }
            }
        }

        return builder.toString();
    }
}

and can be used to create an AST for the input "f(arg1='1')\n" as follows:

public static void main(String[] args) {

    Python3Lexer lexer = new Python3Lexer(new ANTLRInputStream("f(arg1='1')\n"));
    Python3Parser parser = new Python3Parser(new CommonTokenStream(lexer));

    ParseTree tree = parser.file_input();
    AST ast = new AST(tree);

    System.out.println(ast);
}

which would print:

'- file_input
   |- stmt
   |  |- small_stmt
   |  |  |- atom
   |  |  |  '- TOKEN[type: 35, text: f]
   |  |  '- trailer
   |  |     |- TOKEN[type: 47, text: (]
   |  |     |- arglist
   |  |     |  |- test
   |  |     |  |  '- TOKEN[type: 35, text: arg1]
   |  |     |  |- TOKEN[type: 53, text: =]
   |  |     |  '- test
   |  |     |     '- TOKEN[type: 36, text: '1']
   |  |     '- TOKEN[type: 48, text: )]
   |  '- TOKEN[type: 34, text: \n]
   '- TOKEN[type: -1, text: ]

I realize this still contains nodes you might not want, but you could even add a set of token types you'd like to exclude. Feel free to hack away!

Here is a Gist containing a version of the code above with the proper import statements and some JavaDocs and inline comments.

Maharashtra answered 16/7, 2014 at 19:27 Comment(1)
That's awesome :) you basically contract chains into a single node, right? (EDIT: now I saw the gist - so yes)Cleary
C
0

The Eclipse DLTK project Python subproject implements a custom Python AST model in Java. It is built from from an AntlrV3 ast, but should not be too difficult to refit to build from an AntlrV4 parse tree.

The Eclipse PyDev project presumably also implements a Java-based AST for python source. Note, the layout of the source tree in both projects should be quite similar.

Naturally, you should check the licenses before using code from these sources, just to be sure.

Coenocyte answered 15/7, 2014 at 22:31 Comment(5)
A quick glance suggests that both generate a PT, not an AST. If there's a transformation as a second stage, I couldn't easily find it. Do you know your way around these code bases?Cleary
Not sure why you think it is a parse tree - the first link puts you in a directory "AST", which contains an AST model of Python. The second link puts you in the directory with the source that builds the the Antlr3 AST and drives the construction of the explicitly Python AST model. The grammar and generated parser/lexer are here: github.com/KDReleng/org.eclipse.dltk.python/tree/master/plugins/…Coenocyte
The difference between PT and AST is described in https://mcmap.net/q/174057/-what-39-s-the-difference-between-parse-trees-and-abstract-syntax-trees-asts . The link you gave sends me a parser which generates a PT, at least according to the fact that small_stmt contains simple_stmt, and nowhere there is a Call or Keyword node.Cleary
The AST represents calls as a type of expression, specifically a CallHolder. Keywords are represented as keyword specific nodes, such as ForEachStatement. It may not be the model you are looking for, but it is quite clearly an AST.Coenocyte
For posterity, by 'ast' I mean docs.python.org/2/library/ast.html#abstract-grammar, while by 'parse tree' I mean docs.python.org/2/reference/grammar.html . The difference is that the grammar is written with ease of definition in mind, while the ast is written with ease of use in mind.Cleary
C
0

I found a workaround:

Use Jython and ast (thanks @delnan for leading me there). Or, do everything you need directly in Python code, and just spit out the results back to Java.

PythonInterpreter interpreter = new PythonInterpreter();
interpreter.exec("import ast");
PyObject o = interpreter.eval(
    "ast.dump(ast.parse('f(arg1=\\'1\\')', 'filename', 'eval'))" + "\n");
System.out.print(o.toString());

Output is

Expression(body=Call(func=Name(id='f', ctx=Load()), args=[], keywords=[keyword(arg='arg1', value=Str(s='1'))], starargs=None, kwargs=None))

This doesn't strictly answer the question, and might not be applicable for all users, so I'm leaving this answer unselected.

Cleary answered 16/7, 2014 at 3:46 Comment(0)
T
0

ANTLR4 can generate a visitor, which you can use to traverse the parse tree and to construct an AST. Python has an ast package, so this should not be a problem (if you're using Python).

I have written a toy Python interpreter in Python 3 using ANTLR4 (as a part of my study). Visitor code is located in /tinypy/AST/builder/, so you can get an idea of how it's done.

Tempura answered 25/12, 2015 at 19:42 Comment(1)
Thanks for the suggestions - but I'm looking for something more complete. My project is only parsing Python, and I'd rather not spend lots of time writing a parser. There are 2 issues with this - (1) there's a big difference between a toy parser and something that is correct for all Python source code, and (2) it greatly increases my maintenance cost.Cleary

© 2022 - 2024 — McMap. All rights reserved.