how to handling nested comments in antlr lexer
Asked Answered
T

5

6

How to handle nested comments in antlr4 lexer? ie I need to count the number of "/*" inside this token and close only after the same number of "*/" have been received. As an example, the D language has such nested comments as "/+ ... +/"

For example, the following lines should be treated as one block of comments:

/* comment 1
   comment 2
   /* comment 3
      comment 4
   */
   // comment 5
   comment 6
*/

My current code is the following, and it does not work on the above nested comment:

COMMENT : '/*' .*? '*/' -> channel(HIDDEN)
        ;
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n'  -> channel(HIDDEN)
        ;
Thence answered 18/12, 2014 at 4:40 Comment(0)
T
16

Terence Parr has these two lexer lines in his Swift Antlr4 grammar for lexing out nested comments:

COMMENT : '/*' (COMMENT|.)*? '*/' -> channel(HIDDEN) ;
LINE_COMMENT  : '//' .*? '\n' -> channel(HIDDEN) ;
Temper answered 25/3, 2015 at 3:58 Comment(1)
Any way to also make a rule for unfinished block comments so that a custom error is raised?Wavy
R
4

I'm using:

COMMENT: '/*' ('/'*? COMMENT | ('/'* | '*'*) ~[/*])*? '*'*? '*/' -> skip;

This forces any /* inside a comment to be the beginning of a nested comment and similarly with */. In other words, there's no way to recognize /* and */ other than at the beginning and end of the rule COMMENT.

This way, something like /* /* /* */ a */ would not be recognized entirely as a (bad) comment (mismatched /*s and */s), as it would if using COMMENT: '/*' (COMMENT|.)*? '*/' -> skip;, but as /, followed by *, followed by correct nested comments /* /* */ a */.

Reorientation answered 5/9, 2015 at 21:1 Comment(0)
Z
1

Works for Antlr3.

Allows nested comments and '*' within a comment.

fragment
F_MultiLineCommentTerm
:
(   {LA(1) == '*' && LA(2) != '/'}? => '*'
|   {LA(1) == '/' && LA(2) == '*'}? => F_MultiLineComment
|   ~('*') 
)*
;   

fragment
F_MultiLineComment
:
'/*' 
F_MultiLineCommentTerm
'*/'
;   

H_MultiLineComment
:   r=  F_MultiLineComment
    {   $channel=HIDDEN;
        printf(stder,"F_MultiLineComment[\%s]",$r->getText($r)->chars); 
    }
;
Zinnes answered 20/7, 2017 at 2:31 Comment(1)
Remember Antlr3 generated Lexers only have 1 character look-ahead unless you use predicates like the example above.Zinnes
F
0

I can give you an ANTLR3 solution, which you can adjust to work in ANTLR4:

I think you can use a recursive rule invocation. Make a non-greedy comment rule for /* ... */ which calls itself. That should allow for unlimited nesting without having to count opening + closing comment markers:

COMMENT option { greedy = false; }:
    ('/*' ({LA(1) == '/' && LA(2) == '*'} => COMMENT | .) .* '*/') -> channel(HIDDEN)
;

or maybe even:

COMMENT option { greedy = false; }:
    ('/*' .* COMMENT? .* '*/') -> channel(HIDDEN)
;

I'm not sure if ANTLR properly chooses the right path depending on any char or the comment introducer. Try it out.

Falsetto answered 18/12, 2014 at 9:1 Comment(1)
Thanks. I first tried your solution in ANTLR3 itself, before moving to ANTLR4, and it did not work. I just replaced my COMMENT rule with your rule and then ANTLR3 reported errors like: error(100): X.g:syntax error: antlr: MissingTokenException(inserted [@-1,0:0='<missing COLON>',<54>,1927:8] at option)Thence
H
0
  1. This will handle : '/*/*/' and '/*.../*/'where the comment body is '/' and '.../' respectively.
  2. Multiline comments will not nest inside single line comments, therefore you cannot start nor begin a multiline comment inside a single line comment.
    • This is not a valid comment: '/* // */'.
    • You need a newline to end the single line comment before the '*/' can be consumed to end the multiline comment.
    • This is a valid comment: '/* // */ \n /*/'.
    • The comment body is: ' // */ \n /'. As you can see the complete single line comment is included in the body of the multiline comment.
  3. Although '/*/' can end a multiline comment if the preceding character is '*', the comment will end on the first '/' and remaining '*/' will need to end a nested comment otherwise there is a error. The shortest path wins, this is non-greedy!
    • This is not a valid comment /****/*/
    • This is a valid comment /*/****/*/, the comment body is /****/, which is itself a nested comment.
  4. The prefix and suffix will never be matched in the multiline comment body.
  5. If you want to implement this for the 'D' language, change the '*' to '+'.

COMMENT_NEST : '/*' ( ('/'|'*'+)? ~[*/] | COMMENT_NEST | COMMENT_INL )*? ('/'|'*'+?)? '*/' ;

COMMENT_INL : '//' ( COMMENT_INL | ~[\n\r] )* ;

Handicraft answered 16/10, 2017 at 19:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.