ANTLR 4 - How to access hidden comment channel from custom listener?

Writing a pretty-printer for legacy code in an older language. The plan is for me to learn parsing and unparsing before I write a translator to output C++. I kind of got thrown into the deep end with Java and ANTLR back in June, so I definitely have some knowledge gaps.

I've gotten to the point where I'm comfortable writing methods for my custom listener, and I want to be able to pretty-print the comments as well. My comments are on a separate hidden channel. Here are the grammar rules for the hidden tokens:

/* Comments and whitespace -- Nested comments are allowed, each is redirected to a specific channel */
COMMENT_1   :  '(*' (COMMENT_1|COMMENT_2|.)*? '*)'  -> channel(1)  ;
COMMENT_2   :  '{' (COMMENT_1|COMMENT_2|.)*? '}'    -> channel(1)  ;

NEWLINES    :  [\r\n]+                              -> channel(2)  ; 
WHITESPACE  :  [ \t]+                               -> skip        ;

I've been playing with the Cymbol CommentShifter example on p. 207 of The Definitive ANTLR 4 Reference and I'm trying to figure out how to adapt it to my listener methods.

public void exitVarDecl(ParserRuleContext ctx) {
    Token semi = ctx.getStop();
    int i = semi.getTokenIndex();
    List<Token> cmtChannel = tokens.getHiddenTokensToRight(i, CymbolLexer.COMMENTS);
    if (cmtChannel != null) {
        Token cmt = cmtChannel.get(0);
        if (cmt != null) {
            String txt = cmt.getText().substring(2);
            String newCmt = "// " + txt.trim(); // printing comments in original format
            rewriter.insertAfter(ctx.stop, newCmt); // at end of line
            rewriter.replace(cmt, "\n");
        }
    }
}

I adapted this example by using exitEveryRule rather than exitVarDecl and it worked for the Cymbol example but when I adapt it to my own listener I get a null pointer exception whether I use exitEveryRule or exitSpecificThing

I'm looking at this answer and it seems promising but I think what I really need is an explanation of how the hidden channel data is stored and how to access it. It took me months to really get listener methods and context in the parse tree.

It seems like CommonTokenStream.LT(), CommonTokenStream.LA(), and consume() are what I want to be using, but why is the example in that SO answer using completely different methods from the ANTLR book example? What should I know about the token index or token types?

I'd like to better understand the logic behind this.

Okay, so I can't answer how AnTLR stores its data internally, but I can tell you how to access your hidden tokens. I have tested this on my computer using AnTLR v4.1 for C# .NET v4.5.2.

I have a rule that looks like this:

LineComment
    :   '//' ~[\r\n]*
        -> channel(1)
    ;

In my code, I am getting the entire raw token stream like this:

IList<IToken> lTokenList = cmnTokenStream.Get( 0, cmnTokenStream.Size );

To test, I printed the token list using the following loop:

foreach ( IToken iToken in lTokenList )
{
    Console.WriteLine( "{0}[{1}] : {2}",
        iToken.Channel,
        iToken.TokenIndex,
        iToken.Text );
}

Running on this code:

void Foo()
{
    // comment
    i = 5;
}

Yields the following output (for the sake of brevity, please assume I have a complete grammar that is also ignoring whitespace):

0[0] : void
0[1] : Foo
0[2] : (
0[3] : )
0[4] : {
1[5] : // comment
0[6] : i
0[7] : =
0[8] : 6
0[9] : ;
0[10] : }

You can see the channel index is 1 only for the single comment token. So you can use this loop to access only the comment tokens:

int lCommentCount = 0;
foreach ( IToken iToken in lTokenList )
{
    if ( iToken.Channel == 1 )
    {
        Console.WriteLine( "{0} : {1}",
            lCommentCount++,
            iToken.Text );
    }
}

Then you can do your whatever with those tokens. Also works if you have multiple streams, though I will caution against using more than 65,536 streams. AnTLR gave the following error when I tried to compile a grammar with a token rule redirect to stream index 65536:

Serialized ATN data element out of range.

So I guess they're only using a 16-bit unsigned integer to index the streams. Wierd.

Recommended topics

Hot tags