In response to ...
Does ANTLR uses some greedy strategy to match current character stream with the longest lexer rule?
... I will quote from ANTLR4's Wildcard Operator and Nongreedy Subrules documentation.
Here is how the lexer chooses token rules:
- The primary goal is to match the lexer rule that recognizes the most input characters.
INT : [0-9]+ ;
DOT : '.' ; // match period
FLOAT : [0-9]+ '.' ; // match FLOAT upon '34.' not INT then DOT
- If more than one lexer rule matches the same input sequence, the priority goes to the rule occurring first in the grammar file.
DOC : '/**' .*? '*/' ; // both rules match /** foo */, resolve to DOC
CMT : '/*' .*? '*/' ;
- Nongreedy subrules match the fewest number of characters that still allows the surrounding lexical rule to match.
/** Match anything except \n inside of double angle brackets */
STRING : '<<' ~'\n'*? '>>' ; // Input '<<foo>>>>' matches STRING then END
END : '>>' ;
- After crossing through a nongreedy subrule within a lexical rule, all decision-making from then on is "first match wins."
For example, literal ab
in rule right-hand side (grammar fragment) .*? ('a'|'ab')
is dead code and can never be matched. If the input is ab, the first alternative, 'a', matches the first character and therefore succeeds. ('a'|'ab') by itself on the right-hand side of a rule properly matches the second alternative for input ab
. This quirk arises from a nongreedy design decision that’s too complicated to go into here.
If you understand rules 1, 2, and 3, you will likely be fine. The fourth rule is esoteric.
Based only the information quoted above, I don't see a definitive answer as to where the implicit token rule applies. As I find more information, I will update this answer.
I encourage you to also review TomServo's answer, which talks more about the implicit token rule.
(Aside: in my opinion, the content quoted above probably would be more discoverable and understandable if incorporated into the lexer rules docs.)