Using ANTLR to analyze and modify source code; am I doing it wrong?

Asked 23/7, 2012 at 7:28 Answered 24/7, 2012 at 4:35

Solved parsing antlr code-analysis antlr3

I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:

foo(['a', 'b', 'c']);

Extract 'a', 'b', and 'c' and rewrite the code as:

foo('bar', [0, 1, 2]);

I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.

The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.

My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:

a.b.foo();

To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:

Program
    CallExpression
        Identifier('foo')
        ArgumentList
            ArrayLiteral
                StringLiteral('a')
                StringLiteral('b')
                StringLiteral('c')

This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.

I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?

Hardenberg answered 23/7, 2012 at 7:28 Comment(1)

"Is there a more proper way for me to solve this problem?": No, not AFAIK. You parse your input, manipulate it, and then output it yourself. StringTemplate, can, as Dave mentions, help you with this. – Ezmeralda 23/7, 2012 at 9:47

So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:

var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;

var dependencies = new List<IModule>();

var functionCall = (
    from callExpression in program.Children.OfType<CallExpression>()
    where callExpression.Children[0].Text == "foo"
    select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;

tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
    tokenStream.Replace(
        (array.Children[i] as StringLiteral).Token.TokenIndex, 
        i.ToString());
}

var rewrittenCode = tokenStream.ToString();

Hardenberg answered 24/7, 2012 at 4:35 Comment(0)

What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.

Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.

The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.

For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:

       domain ECMAScript.

       tag replace;  -- says this is a special kind of temporary tree


       rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
           expression->expression
        =   " \function_name ( '[' \list ']' ) "
        -> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";


       rule replace_unit_list(s:character_literal):
           expression_list -> expression_list
           replace(s) -> compute_index_for(s);

       rule replace_long_list(s:character_list, list:expression_list):
           expression_list -> expression_list
           "\replace\(\s\,\list)->  "compute_index_for\(\s\),\list";

with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.

Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.

This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).

Your specific example may vary somewhat since you really weren't precise about the details.

Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).

You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.

Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.

Receipt answered 23/7, 2012 at 13:14 Comment(4)

Thank you for the detailed answer. I was hoping to avoid having to write an entire pretty printer (or, in terms of DMS, I would want a fidelity printer). I'm only wanting to inject/replace portions of the code, so it would be unfortunate if I had to write code to output an entire program just to do this. – Hardenberg 23/7, 2012 at 15:17

I understand your desire to avoid work :-} Bad news: its pretty hard to do when you want to make changes to code. Good news: somebody has done all the essentials :-} – Receipt 23/7, 2012 at 15:35

It is not completely clear if you are in fact affiliated with this company that provides this product. If so can you preface your answer with disclosure of your personal affiliation with this company? – Overfly 25/1, 2019 at 14:45

Stack Overflow long ago resolved the issue of necessary declaration of relationships. The phrase "our" <tool> makes it clear. For your edification I will elaborate: this is a tool that I started thinking about in 1980 (yes, really), architectected and personally built (with some really expert helpers) starting in 1996, so it has 20 linear years of PhD level engineering invested. And yes, I'm the founder of the company that supplies the tool. You can learn more from my bio. You will note I provided references to other (noncommericial) tools that can do similar things. – Receipt 25/1, 2019 at 14:50

So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:

var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;

var dependencies = new List<IModule>();

var functionCall = (
    from callExpression in program.Children.OfType<CallExpression>()
    where callExpression.Children[0].Text == "foo"
    select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;

tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
    tokenStream.Replace(
        (array.Children[i] as StringLiteral).Token.TokenIndex, 
        i.ToString());
}

var rewrittenCode = tokenStream.ToString();

Hardenberg answered 24/7, 2012 at 4:35 Comment(0)

Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.

Here is an article on translation via ANTLR

Sollie answered 23/7, 2012 at 8:45 Comment(1)

I'd hoped to avoid creating code to output an entire JavaScript program when all I need to do is inject and replace portions of JavaScript code. But if I do need to create an entire output engine, String Template looks promising. Thanks for the helpful link to the article. – Hardenberg 23/7, 2012 at 15:20

Recommended topics

Hot tags