How do I get the original text that an antlr4 rule matched?
Asked Answered
M

4

48

Using the Java 7 grammar https://github.com/antlr/grammars-v4/blob/master/java7/Java7.g4 I want to find methods with a specific name and then just print out that method. I see that I can use the methodDeclaration rule when I match. So I subclass Java7BaseListener and override this listener method:

@Override public void enterMethodDeclaration(Java7Parser.MethodDeclarationContext ctx) { }

How do I get the original text out? ctx.getText() gives me a string with all the whitespace stripped out. I want the comments and original formatting.

Midget answered 2/5, 2013 at 16:37 Comment(1)
See also this question and this question.Ihab
M
63

ANTLR's CharStream class has a method getText(Interval interval) which will return the original source in the give range. The Context object has methods to get the beginning and end. Assuming you have a field in your listener called input which has the CharStream being parsed, you can do this:

    int a = ctx.start.getStartIndex();
    int b = ctx.stop.getStopIndex();
    Interval interval = new Interval(a,b);
    input.getText(interval);
Midget answered 2/5, 2013 at 16:37 Comment(5)
If you don't have access to or don't want to keep track of the CharStream, use ctx.start.getInputStream() to retrieve it.Mouthwash
CharStream input = ctx.start.getInputStream(); input.getText(interval); Gives me runtime errors .checkBoundsOffCount(String.java:3101)Pitcher
And where it doesn't fail it still removes whitespacePitcher
Your answer helped me fix a weird error where calling getText() on a Token failed because somehow my original CharStream got garbage collected. I kept a reference to it and getText works now.Digit
Totally nonintuitive as far as solutions go, this was a lifesaver for me.Kenay
M
14

demo:

SqlBaseParser.QueryContext queryContext = context.query();
int a = queryContext.start.getStartIndex();
int b = queryContext.stop.getStopIndex();
Interval interval = new Interval(a,b);
String viewSql = context.start.getInputStream().getText(interval);
Malpighiaceous answered 28/8, 2018 at 3:23 Comment(0)
G
2

Python implementation:

def extract_original_text(self, ctx):
    token_source = ctx.start.getTokenSource()
    input_stream = token_source.inputStream
    start, stop  = ctx.start.start, ctx.stop.stop
    return input_stream.getText(start, stop)
Gradualism answered 12/7, 2021 at 8:58 Comment(0)
U
2

The accepted answer doesn't work when there are errors during the parsing, and antlr fixes those errors (which is the default behavior).
By default antlr uses DefaultErrorStrategy which creates tokens with startIndex=endIndex=-1 for missing tokens (here is the source code).
The code in the accepted answer will throw an exception in case of such tokens.

Also antlr's smart error handling can delete some "extra" tokens.

As a result, "the matched text" can consists of multiple chunks of the original text + some tokens that have no matched original text.

Some possible solutions for this problem:

  • either use ANTLRErrorStrategy without smart error handling (e.g. BailErrorStrategy)
  • or iterate recursively over node's children and collect text from valid tokens only.
Upraise answered 1/12, 2023 at 6:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.