Catching (and keeping) all comments with ANTLR

Asked 18/9, 2012 at 21:14 Answered 27/2 at 6:56

I'm writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I'm trying to keep all of the comments. This is difficult comments can be used literally anywhere in the code. If a comment is somewhere in the source code that doesn't match the grammar, ANTLR can't finish parsing the file.

Is there a way to make ANTLR automatically add any comments it finds to the AST? I know the lexer can simply ignore all of the comments using either {skip();} or by sending the text to the hidden channel. With either of those options set, ANTLR parses the file without any problems at all.

Any ideas are welcome.

Vogler answered 18/9, 2012 at 21:14 Comment(0)

Is there a way to make ANTLR automatically add any comments it finds to the AST?

No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:

...

if_stat
 : 'if' comments '(' comments expr comments ')' comments ...
 ;

...

comments
 : (SingleLineComment | MultiLineComment)*
 ;

SingleLineComment
 : '//' ~('\r' | '\n')*
 ;

MultiLineComment
 : '/*' .* '*/'
 ;

Quadrireme answered 19/9, 2012 at 6:29 Comment(3)

That's what I figured. Oh well. The real problem is that comments can be anywhere in source code, so every rule has to have "comments?" in every part of it. – Vogler 19/9, 2012 at 14:0

@TSuds, yeah, that is correct. Note that since my comments rule matches nothing or more comments, the ? is not needed after it. – Quadrireme 19/9, 2012 at 14:6

Depending on the use case, this might not be a good solution, see others. – Sebbie 22/12, 2020 at 22:16

Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:

grammar Java;

@lexer::members {
    public static final int WHITESPACE = 1;
    public static final int COMMENTS = 2;
}

Then for your comments rules do this:

COMMENT
    : '/*' .*? '*/' -> channel(COMMENTS)
    ;

LINE_COMMENT
    : '//' ~[\r\n]* -> channel(COMMENTS)
    ;

Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.

Garver answered 31/7, 2013 at 2:43 Comment(1)

Does not work. warning(155): vhdl.g4:1645:24: rule SPACE contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output error(164): vhdl.g4:26:0: custom channels are not supported in combined grammars – Sogdian 31/8, 2015 at 8:47

first: direct all comments to a certain channel (only comments)

COMMENT
    : '/*' .*? '*/' -> channel(2)
    ;

LINE_COMMENT
    : '//' ~[\r\n]* -> channel(2)
    ;

second: print out all comments

      CommonTokenStream tokens = new CommonTokenStream(lexer);
      tokens.fill();
      for (int index = 0; index < tokens.size(); index++)
      {
         Token token = tokens.get(index);
         // substitute whatever parser you have
         if (token.getType() != Parser.WS) 
         {
            String out = "";
            // Comments will be printed as channel 2 (configured in .g4 grammar file)
            out += "Channel: " + token.getChannel();
            out += " Type: " + token.getType();
            out += " Hidden: ";
            List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
            for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
            {
               if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
               {
                  out += "\n\t" + i + ":";
                  out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + "  Type: " + hiddenTokensToLeft.get(i).getType();
                  out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
               }
            }
            out += token.getText().replaceAll("\\s", "");
            System.out.println(out);
         }
      }

Abatement answered 16/5, 2016 at 11:27 Comment(1)

This is not the answer to the literal question Is there a way to make ANTLR automatically add any comments it finds to the AST?, but this was the solution that I needed :-) Thanks – Counterbalance 6/2, 2019 at 9:28

Is there a way to make ANTLR automatically add any comments it finds to the AST?

No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:

...

if_stat
 : 'if' comments '(' comments expr comments ')' comments ...
 ;

...

comments
 : (SingleLineComment | MultiLineComment)*
 ;

SingleLineComment
 : '//' ~('\r' | '\n')*
 ;

MultiLineComment
 : '/*' .* '*/'
 ;

Quadrireme answered 19/9, 2012 at 6:29 Comment(3)

That's what I figured. Oh well. The real problem is that comments can be anywhere in source code, so every rule has to have "comments?" in every part of it. – Vogler 19/9, 2012 at 14:0

@TSuds, yeah, that is correct. Note that since my comments rule matches nothing or more comments, the ? is not needed after it. – Quadrireme 19/9, 2012 at 14:6

Depending on the use case, this might not be a good solution, see others. – Sebbie 22/12, 2020 at 22:16

The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:

Island Grammars: Dealing with Different Formats in the Same File

Grandioso answered 3/8, 2015 at 11:0 Comment(0)

I did that on my lexer part :

WS  :   ( [ \t\r\n] | COMMENT) -> skip
;

fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;

Like that it will remove them automatically !

Lumpen answered 26/2, 2015 at 17:29 Comment(1)

This does not answer the OP's question. We want to pass through all comments to the output. So instead of skipping the goal is to have the lexer send the comments directly to the output file and skip parsing them alltoghter... but they need to keep their position relative to the token stream for context. – Billington 26/9, 2023 at 15:22

For ANTLR v3:

The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.

If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.

Jackass answered 27/3, 2018 at 9:27 Comment(0)

I have a project which has Kolasu as dependency. For keeping comments, I merged 2 answers (@baron.wang and @Bart Kiers) and now have a working one. I sent the comment to channel(2) (@baron.wang answer) and created a map, didn't change the rest of the grammar (@Bart Kiers answer, didn't want to sprinkle comment), it is parsed correctly.

Kolasu gave a Point for each comment which is kept at the map. When I need a comment of a Class, I search before its Point.

Nightgown answered 27/2 at 6:56 Comment(0)

Recommended topics

Hot tags