Introduction
Looking at the documentation, ANTLR 2 used to have something called predicated lexing, with examples like this one (inspired by Pascal):
RANGE_OR_INT
: ( INT ".." ) => INT { $setType(INT); }
| ( INT '.' ) => REAL { $setType(REAL); }
| INT { $setType(INT); }
;
The way I see it, that's essentially a positive look-ahead assertion at the beginning of the rule: If the look-ahead matches INT ".."
then the first rule will apply (and match the INT
portion of that input), and so on.
I haven't found something like this in ANTLR 4 yet. The 2 to 3 migration guide doesn't seem to mention this, while the 3 to 4 changes document states:
The biggest difference between ANTLR 3 and 4 is that ANTLR 4 takes any grammar you give it unless the grammar had indirect left recursion. That means we don't need syntactic predicates or backtracking so ANTLR 4 does not support that syntax; you will get a warning for using it.
This is in line with the error message I get if I leave this essentially as is:
(...)=> syntactic predicates are not supported in ANTLR 4
While I can understand how a more intelligent parser implementation would resolve these ambiguities, I fail to see how this would work for lexers.
Reproducing example
To be sure, let's try this out:
grammar Demo;
prog: atom (',' atom)* ;
atom: INT { System.out.println("INT: " + $INT.getText()); }
| REAL { System.out.println("REAL: " + $REAL.getText()); }
| a=INT RANGE b=INT { System.out.println("RANGE: " +
$a.getText() + " .. " + $b.getText()); }
;
WS : (' ' | '\t' | '\n' | '\r')+ -> skip ;
INT : ('0'..'9')+ ;
REAL: INT '.' INT? | '.' INT ;
RANGE: '..' ;
Save this to Demo.g
, then compile and run:
$ wget -nc http://www.antlr.org/download/antlr-4.5.2-complete.jar
$ java -jar antlr-4.5.2-complete.jar Demo.g
$ javac -cp antlr-4.5.2-complete.jar Demo*.java
$ java -cp .:antlr-4.5.2-complete.jar org.antlr.v4.gui.TestRig \
Demo prog <<< '1,2.,3.4,5 ..6,7..8'
INT: 1
REAL: 2.
REAL: 3.4
RANGE: 5 .. 6
REAL: 7.
line 1:17 extraneous input '.8' expecting {<EOF>, ','}
So it seems I was correct: while the removal of syntactic predecates might have been appropriate for the parser, the lexer won't suddenly guess the right token type.
Core Question
So how would one convert this specific example to ANTLR 4? Is there a way to express look-ahead conditions? Or perhaps a way to have a single rule like INT '..'
emit two distinct tokens?
References and possible solutions
Looking at the ANTLR 4 Pascal grammar, I notice that it doesn't allow real numbers to end in .
without a digit after that, so learning a solution from there doesn't appear to be an option.
I've seen Semantic predicates in ANTLR4? and syntactic predicates - Upgrading from Antlr 3 to Antlr 4. Both discuss syntactic predicates in parser rules. The latter also has an example with lexer rules, but the look-ahead is identical to the rule that follows it, which means the rules could get removed without adverse effects. This is not the case in my example above.
Answers to check previous/left token in lexer mention the emit
method of the lexer, with a comment referencing the How can I emit more than a single token per lexer rule? FAQ page in the ANTLR 3 wiki, so I guess that's one approach. I will turn this into an answer if noone beats me to it and if I can get it to work in my example.
The answer to ANTLR4 negative lookahead in lexer makes use of the _input.LA(int)
method to examine the look-ahead. The ANTLR 4 lexical analysis faq mentions _input.LA
without going into details. This should work for the example above as well, but would be hard for scenarios where there is more than a single character of look-ahead to consider.