ANTLR4 Lexer Matching Start of Line End Of Line
Asked Answered
C

2

6

How to achieve Perl regular expression ^ and $ in the ANLTR4 lexer? ie. to match the start of a line and end of a line without consuming any character.

I am trying to use ANTLR4 lexer to match a # character at the start of a line but not in the middle of a line For example, to isolate and toss out all C++ preprocessor directives regardless of which directive it is while disregard a # inside a string literal. (Normally we can tokenize C++ string literals to eliminate a # appearing in the middle of a line but assuming we're not doing that). That means I only want to specify # .*? without bothering #if #ifndef #pragma, etc.

Also, the C++ standard allows whitespace and multi line comments right before and after the # e.g.

   /* helo
world*/  #  /* hel
l
o
*/  /*world */ifdef .....

is considered a valid preprocessor directive appearing on a single line. (the CRLFs inside the ML COMMENTs are tossed)

This's what I am doing currently:

PPLINE: '\r'? '\n' (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+ -> channel(PPDIR); 

But the problem is I have to rely on the existence of a CRLF before the # and toss out that CRLF altogether with the directive. I need to replace the CRLF tossed out by the CRLF of this directive line so I've to make sure the directive is terminated by a CRLF.

However, that means my grammar cannot handle a directive appearing right at the start of file (i.e. no preceding CRLF) or preceded by an EOF without terminating CRLF.

If the Perl style regex ^ $ syntax is available, I can match the SOL/EOL instead of explicitly matching and consuming CRLF.

Costello answered 5/5, 2013 at 8:3 Comment(0)
H
5

You can use semantic predicates for the conditions.

PPLINE
    :   {getCharPositionInLine() == 0}?
        (ML_COMMENT | '\t' | '\f' |' ')* '#' (ML_COMMENT | ~[\r\n])+
        {_input.LA(1) == '\r' || _input.LA(1) == '\n'}?
        -> channel(PPDIR)
    ;
Halpin answered 5/5, 2013 at 17:37 Comment(5)
In Terrance Parr's book, semantic predicates are said to appear on the right edge of lexer rules. How should we interpret your example having semantic predicates appearing on the left edge ?Costello
In ANTLR 4, semantic predicates can appear anywhere in a lexer rule, and they'll be evaluated at the point where they appear. Parser rules are a bit more restrictive - predicates can only appear on the left edge of a decision.Halpin
NameError: name 'getCharPositionInLine' is not definedUniformitarian
same problem here: ReferenceError: getCharPositionInLine is not defined does this not exist in JavaScript ?Worthy
actually seems you need to use 'this.column' instead (documentation is not great)Worthy
J
1

You could try having multiple rules with gated semantics (Different lexer rules in different state) or with modes (pushMode -> http://www.antlr.org/wiki/display/ANTLR4/Lexer+Rules), having an alternative rule for the beginning of the file and then switching to the core rules when the directives end, but it could be a long job.

Firstly, perhaps, I would try if really there are problems in parsing #pragma/preprocessor directives without changing anything, because for example if the problem of finding a # is it could be present in strings and comments, then just by ordering the rules you should be able to direct it to the right case (but this could be a problem for languages where you can put directives in comments).

Janik answered 5/5, 2013 at 10:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.