How does ANTLR decide which lexer rule to apply? The longest matching lexer rule wins?

Asked 2/8, 2017 at 1:43 Answered 3/2, 2021 at 21:9

Solved compiler-construction antlr antlr4 lexer

The input content:

The grammar:

grammar test;

p : EOF;

Char : [a-z];

fragment Tab : '\t';
fragment Space : ' ';
T1 : (Tab|Space)+ ->skip;

T2 : '#' T1+ Char+;

The matching result is this:

[@0,0:6='#   abc',<T2>,1:0]    <<<<<<<< PLACE 1
[@1,7:6='<EOF>',<EOF>,1:7]
line 1:0 extraneous input '#   abc' expecting <EOF>

Please ignore the error in the last line. I am wondering why the token matched at PLACE 1 is T2.

In the grammar file, the T2 lexer rule goes after the T1 lexer rule. So I expect T1 rule should get applied first. So why the spaces in # abc is not skipped?

Does ANTLR uses some greedy strategy to match current character stream with the longest lexer rule?

Torose answered 2/8, 2017 at 1:43 Comment(0)

Three rules apply, in this order:

Longest match wins first.
Rule matching implicit token (like # in your grammar) next.
Finally, in case of a tie (by match length), the rule listed earliest among the matching rules wins.

After much wee-hours searching, I found again most of this material in one lengthy quote from Sam Harwell in which he also expounds on the impact of greedy operators. I remember seeing it the first time and sketching the notes in my copy of TDAR, but without the reference.

ANTLR 4 lexers normally operate with longest-match-wins behavior, without any regard for the order in which alternatives appear in the grammar. If two lexer rules match the same longest input sequence, only then is the relative order of those rules compared to determine how the token type is assigned.

The behavior within a rule changes as soon as the lexer reaches a non-greedy optional or closure. From that moment forward to the end of the rule, all alternatives within that rule will be treated as ordered, and the path with the lowest alternative wins. This seemingly strange behavior is actually responsible for the non-greedy handling due to the way we order alternatives in the underlying ATN representation. When the lexer is in this mode and reaches the block (ESC|.), the ordering constraint requires it use ESC if possible.

The "implicit token" rule is from here.

Homeless answered 2/8, 2017 at 2:10 Comment(3)

Did I overlook these rules in TDAR: The Definitive ANTLR 4 Reference? I searched for "longest match" and only found "Lexers ... just look for the longest match and make decisions after they’ve seen the entire token." in Chapter 11. – Abijah 2/2, 2021 at 20:34

@DavidJ. The quote I'm referring to, if memory serves (it's been a long time) was one Sam made here on SO to clear up some confusion about lexing. I didn't even know there was an online version of the book; I have a paper copy on my shelf. I don't frequent this tag as much as I used to, working on other more compact lexers and parsers now for embedded systems. – Homeless 3/2, 2021 at 2:35

Thanks. I also found documentation about this tucked away in the wildcard section of the official docs: github.com/antlr/antlr4/blob/master/doc/… – Abijah 3/2, 2021 at 20:49

In response to ...

Does ANTLR uses some greedy strategy to match current character stream with the longest lexer rule?

... I will quote from ANTLR4's Wildcard Operator and Nongreedy Subrules documentation.

Here is how the lexer chooses token rules:

The primary goal is to match the lexer rule that recognizes the most input characters.
INT : [0-9]+ ;
DOT : '.' ; // match period
FLOAT : [0-9]+ '.' ; // match FLOAT upon '34.' not INT then DOT
If more than one lexer rule matches the same input sequence, the priority goes to the rule occurring first in the grammar file.
DOC : '/**' .*? '*/' ; // both rules match /** foo */, resolve to DOC
CMT : '/*' .*? '*/' ;
Nongreedy subrules match the fewest number of characters that still allows the surrounding lexical rule to match.
/** Match anything except \n inside of double angle brackets */
STRING : '<<' ~'\n'*? '>>' ; // Input '<<foo>>>>' matches STRING then END
END    : '>>' ;
After crossing through a nongreedy subrule within a lexical rule, all decision-making from then on is "first match wins."

For example, literal ab in rule right-hand side (grammar fragment) .*? ('a'|'ab') is dead code and can never be matched. If the input is ab, the first alternative, 'a', matches the first character and therefore succeeds. ('a'|'ab') by itself on the right-hand side of a rule properly matches the second alternative for input ab. This quirk arises from a nongreedy design decision that’s too complicated to go into here.

If you understand rules 1, 2, and 3, you will likely be fine. The fourth rule is esoteric.

Based only the information quoted above, I don't see a definitive answer as to where the implicit token rule applies. As I find more information, I will update this answer.

I encourage you to also review TomServo's answer, which talks more about the implicit token rule.

(Aside: in my opinion, the content quoted above probably would be more discoverable and understandable if incorporated into the lexer rules docs.)

Abijah answered 3/2, 2021 at 21:9 Comment(0)

I think it's because ANTLR first sees the # character in the lexer phase, so it will go along the T2 rule. And luckily, T2 matches. If ANTLR sees a first, it will go along the T1 rule.

Torose answered 2/8, 2017 at 1:54 Comment(0)

Recommended topics

Hot tags