Antlr rule priorities
Asked Answered
W

1

11

Firstly I know this grammar doesn't make sense but it was created to test out the ANTLR rule priority behaviour

grammar test;

options 
{

output=AST;
backtrack=true;
memoize=true;

}

rule_list_in_order :
    (
    first_rule
    | second_rule
    | any_left_over_tokens)+
    ;


first_rule
    :
     FIRST_TOKEN
    ;


second_rule:     
    FIRST_TOKEN NEW_LINE SECOND_TOKEN NEW_LINE;


any_left_over_tokens
    :
    NEW_LINE
    | FIRST_TOKEN
    | SECOND_TOKEN;



FIRST_TOKEN
    : 'First token here'
    ;   

SECOND_TOKEN
    : 'Second token here';

NEW_LINE
    : ('\r'?'\n')   ;

WS  : (' '|'\t'|'\u000C')
    {$channel=HIDDEN;}
    ;

When I give this grammar the input 'First token here\nSecond token here', it matches the second_rule.

I would have expected it to match the first rule then any_left_over_tokens because the first_rule appears before the second_rule in the rule_order_list which is the start point. Can anyone explain why this happens?

Cheers

Winshell answered 4/2, 2011 at 15:11 Comment(0)
R
20

First of all, ANTLR's lexer will tokenize the input from top to bottom. So tokens defined first have a higher precedence than the ones below it. And in case rule have overlapping tokens, the rule that matches the most characters will take precedence (greedy match).

The same principle holds within parser rules. Rules defined first will also be matched first. For example, in rule foo, sub-rule a will first be tried before b:

foo
  :  a
  |  b
  ;

Note that in your case, the 2nd rule isn't matched, but tries to do so, and fails because there is no trailing line break, producing the error:

line 0:-1 mismatched input '<EOF>' expecting NEW_LINE

So, nothing is matched at all. But that is odd. Because you've set the backtrack=true, it should at least backtrack and match:

  1. first_rule ("First token here")
  2. any_left_over_tokens ("line-break")
  3. any_left_over_tokens ("Second token here")

if not match first_rule in the first place and not even try to match second_rule to begin with.

A quick demo when doing the predicates manually (and disabling the backtrack in the options { ... } section) would look like:

grammar T;

options {
  output=AST;
  //backtrack=true;
  memoize=true;
}

rule_list_in_order
  :  ( (first_rule)=>  first_rule  {System.out.println("first_rule=[" + $first_rule.text + "]");}
     | (second_rule)=> second_rule {System.out.println("second_rule=[" + $second_rule.text + "]");}
     | any_left_over_tokens        {System.out.println("any_left_over_tokens=[" + $any_left_over_tokens.text + "]");}
     )+ 
  ;

first_rule
  :  FIRST_TOKEN
  ;

second_rule
  :  FIRST_TOKEN NEW_LINE SECOND_TOKEN NEW_LINE
  ;

any_left_over_tokens
  :  NEW_LINE
  |  FIRST_TOKEN
  |  SECOND_TOKEN
  ;

FIRST_TOKEN  : 'First token here';   
SECOND_TOKEN : 'Second token here';
NEW_LINE     : ('\r'?'\n');
WS           : (' '|'\t'|'\u000C') {$channel=HIDDEN;};

which can be tested with the class:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = "First token here\nSecond token here";
        ANTLRStringStream in = new ANTLRStringStream(source);
        TLexer lexer = new TLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TParser parser = new TParser(tokens);
        parser.rule_list_in_order();
    }
}

which produces the expected output:

first_rule=[First token here]
any_left_over_tokens=[
]
any_left_over_tokens=[Second token here]

Note that it doesn't matter if you use:

rule_list_in_order
  :  ( (first_rule)=>  first_rule 
     | (second_rule)=> second_rule
     | any_left_over_tokens
     )+ 
  ;

or

rule_list_in_order
  :  ( (second_rule)=> second_rule // <--+--- swapped
     | (first_rule)=>  first_rule  // <-/
     | any_left_over_tokens
     )+ 
  ;

, both will produce the expected output.

So, my guess is that you may have found a bug.

Yout could try the ANTLR mailing-list, in case you want a definitive answer (Terence Parr frequents there more often than he does here).

Good luck!

PS. I tested this with ANTLR v3.2

Ravenna answered 4/2, 2011 at 18:57 Comment(2)
Thanks Bart - insightful as always. Just for reference, the input was supposed to be 'First token here\nSecond token here\n' - the absentia of the second \n was a typeo. I'll try the mailing list as well.Winshell
@Richard, ah, I see (about the line break). Yes, then the 2nd rule is matched. If memory serves me well, the 2nd one is matched before the 1st one because you enabled the backtrack-option causing the parser to match as much as possible: matching the 1st sub-rule, and then back tracking to the 2nd sub-rule and sticking with that one because it matches more (but I'm not 100% sure about that: if you're posting to the mailing list, mind as well ask that too! :)). And you're welcome of course!Ravenna

© 2022 - 2024 — McMap. All rights reserved.