How can I modify the text of tokens in a CommonTokenStream with ANTLR?

Asked 9/2, 2010 at 11:57 Answered 15/11, 2019 at 10:50

compiler-construction antlr antlr3 lexical-analysis

I'm trying to learn ANTLR and at the same time use it for a current project.

I've gotten to the point where I can run the lexer on a chunk of code and output it to a CommonTokenStream. This is working fine, and I've verified that the source text is being broken up into the appropriate tokens.

Now, I would like to be able to modify the text of certain tokens in this stream, and display the now modified source code.

For example I've tried:

import org.antlr.runtime.*;
import java.util.*;

public class LexerTest
{
    public static final int IDENTIFIER_TYPE = 4;

    public static void main(String[] args)
    {
    String input = "public static void main(String[] args) { int myVar = 0; }";
    CharStream cs = new ANTLRStringStream(input);


        JavaLexer lexer = new JavaLexer(cs);
        CommonTokenStream tokens = new CommonTokenStream();
        tokens.setTokenSource(lexer);

        int size = tokens.size();
        for(int i = 0; i < size; i++)
        {
            Token token = (Token) tokens.get(i);
            if(token.getType() == IDENTIFIER_TYPE)
            {
                token.setText("V");
            }
        }
        System.out.println(tokens.toString());
    }  
}

I'm trying to set all Identifier token's text to the string literal "V".

Why are my changes to the token's text not reflected when I call tokens.toString()?
How am I suppose to know the various Token Type IDs? I walked through with my debugger and saw that the ID for the IDENTIFIER tokens was "4" (hence my constant at the top). But how would I have known that otherwise? Is there some other way of mapping token type ids to the token name?

EDIT:

One thing that is important to me is I wish for the tokens to have their original start and end character positions. That is, I don't want them to reflect their new positions with the variable names changed to "V". This is so I know where the tokens were in the original source text.

Invoice answered 9/2, 2010 at 11:57 Comment(1)

Just wondering - is it a requirement that you use ANTLR for this? – Lightship 18/12, 2013 at 22:11

ANTLR has a way to do this in it's grammar file.

Let's say you're parsing a string consisting of numbers and strings delimited by comma's. A grammar would look like this:

grammar Foo;

parse
  :  value ( ',' value )* EOF
  ;

value
  :  Number
  |  String
  ;

String
  :  '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
  ;

Number
  :  '0'..'9'+
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

This should all look familiar to you. Let's say you want to wrap square brackets around all integer values. Here's how to do that:

grammar Foo;

options {output=template; rewrite=true;} 

parse
  :  value ( ',' value )* EOF
  ;

value
  :  n=Number -> template(num={$n.text}) "[<num>]" 
  |  String
  ;

String
  :  '"' ( ~( '"' | '\\' ) | '\\\\' | '\\"' )* '"'
  ;

Number
  :  '0'..'9'+
  ;

Space
  :  ( ' ' | '\t' ) {skip();}
  ;

As you see, I've added some options at the top, and added a rewrite rule (everything after the ->) after the Number in the value parser rule.

Now to test it all, compile and run this class:

import org.antlr.runtime.*;

public class FooTest {
  public static void main(String[] args) throws Exception {
    String text = "12, \"34\", 56, \"a\\\"b\", 78";
    System.out.println("parsing: "+text);
    ANTLRStringStream in = new ANTLRStringStream(text);
    FooLexer lexer = new FooLexer(in);
    CommonTokenStream tokens = new TokenRewriteStream(lexer); // Note: a TokenRewriteStream!
    FooParser parser = new FooParser(tokens);
    parser.parse();
    System.out.println("tokens: "+tokens.toString());
  }
}

which produces:

parsing: 12, "34", 56, "a\"b", 78
tokens: [12],"34",[56],"a\"b",[78]

Parulis answered 9/2, 2010 at 13:50 Comment(0)

In ANTLR 4 there is a new facility using parse tree listeners and TokenStreamRewriter (note the name difference) that can be used to observe or transform trees. (The replies suggesting TokenRewriteStream apply to ANTLR 3 and will not work with ANTLR 4.)

In ANTL4 an XXXBaseListener class is generated for you with callbacks for entering and exiting each non-terminal node in the grammar (e.g. enterClassDeclaration() ).

You can use the Listener in two ways:

As an observer - By simply overriding the methods to produce arbitrary output related to the input text - e.g. override enterClassDeclaration() and output a line for each class declared in your program.
As a transformer using TokenRewriteStream to modify the original text as it passes through. To do this you use the rewriter to make modifications (add, delete, replace) tokens in the callback methods and you use the rewriter and the end to output the modified text.

See the following examples from the ANTL4 book for an example of how to do transformations:

https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialIDListener.java

and

https://github.com/mquinn/ANTLR4/blob/master/book_code/tour/InsertSerialID.java

Sluff answered 5/4, 2015 at 17:19 Comment(2)

The links to GitHub repo are dead now. – Ax 18/9, 2017 at 18:30

I found and example here : #41180899 – Amaro 1/10, 2018 at 11:17

The other given example of changing the text in the lexer works well if you want to globally replace the text in all situations, however you often only want to replace a token's text during certain situations.

Using the TokenRewriteStream allows you the flexibility of changing the text only during certain contexts.

This can be done using a subclass of the token stream class you were using. Instead of using the CommonTokenStream class you can use the TokenRewriteStream.

So you'd have the TokenRewriteStream consume the lexer and then you'd run your parser.

In your grammar typically you'd do the replacement like this:

/** Convert "int foo() {...}" into "float foo();" */
function
:
{
    RefTokenWithIndex t(LT(1));  // copy the location of the token you want to replace
    engine.replace(t, "float");
}
type id:ID LPAREN (formalParameter (COMMA formalParameter)*)? RPAREN
    block[true]
;

Here we've replaced the token int that we matched with the text float. The location information is preserved but the text it "matches" has been changed.

To check your token stream after you would use the same code as before.

Sello answered 9/2, 2010 at 18:36 Comment(3)

Thanks for the info. Do you have any idea why calling setText on the individual tokens didn't work? – Invoice 9/2, 2010 at 18:59

@Simucal, id you try using a TokenRewriteStream instead of a CommonTokenStream? – Parulis 9/2, 2010 at 19:7

@Simucal, I haven't dug into the java source for antlr, as I normally use C++, but I'd imagine that you are modifying a copy of the token stream and not the actual stream. – Sello 9/2, 2010 at 19:10

I've used the sample Java grammar to create an ANTLR script to process an R.java file and rewrite all the hex values in a decompiled Android app with values of the form R.string.*, R.id.*, R.layout.* and so forth.

The key is using TokenStreamRewriter to process the tokens and then output the result.

The project (Python) is called RestoreR

The modified ANTLR listener for rewriting

I parse with a listener to read in the R.java file and create a mapping from integer to string and then replace the hex values as a I parse the programs java files with a different listener containing a rewriter instance.

class RValueReplacementListener(ParseTreeListener):
    replacements = 0
    r_mapping = {}
    rewriter = None

    def __init__(self, tokens):
        self.rewriter = TokenStreamRewriter(tokens)

    // Code removed for the sake of brevity

    # Enter a parse tree produced by JavaParser#integerLiteral.
    def enterIntegerLiteral(self, ctx:JavaParser.IntegerLiteralContext):
        hex_literal = ctx.HEX_LITERAL()
        if hex_literal is not None:
            int_literal = int(hex_literal.getText(), 16)
            if int_literal in self.r_mapping:
                # print('Replace: ' + ctx.getText() + ' with ' + self.r_mapping[int_literal])
                self.rewriter.replaceSingleToken(ctx.start, self.r_mapping[int_literal])
                self.replacements += 1

Sandell answered 15/11, 2019 at 10:50 Comment(0)

The modified ANTLR listener for rewriting

Recommended topics

Hot tags