ANTLR parses greedily even though it can match high priority rule

Asked 13/2, 2015 at 9:19 Answered 14/2, 2015 at 11:8

I am using the following ANTLR grammar to define a function.

definition_function
    : DEFINE FUNCTION function_name '[' language_name ']'
      RETURN attribute_type '{' function_body '}'
    ;

function_name
    : id
    ;

language_name
    : id
    ;

function_body
    : SCRIPT
    ;

SCRIPT
    :   '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}' 
        { setText(getText().substring(1, getText().length()-1)); }
    ;

But when I try to parse two functions like below,

define function concat[Scala] return string {
  var concatenatedString = ""
  for(i <- 0 until data.length) {
     concatenatedString += data(i).toString
  }
  concatenatedString
};
define function concat[JavaScript] return string {
  var str1 = data[0];
  var str2 = data[1];
  var str3 = data[2];
  var res = str1.concat(str2,str3);
  return res;
};

Then ANTLR doesn't parse this like two function definitions, but like a single function with the following body,

  var concatenatedString = ""
  for(i <- 0 until data.length) {
     concatenatedString += data(i).toString
  }
  concatenatedString
};
define function concat[JavaScript] return string {
  var str1 = data[0];
  var str2 = data[1];
  var str3 = data[2];
  var res = str1.concat(str2,str3);
  return res;

Can you explain this behavior? The body of the function can have anything in it. How can I correctly define this grammar?

Spineless answered 13/2, 2015 at 9:19 Comment(0)

Your rule matches that much because '\u0020'..'\u007e' from the rule '{' ('\u0020'..'\u007e' | ~( '{' | '}' ) )* '}' matches both { and }.

Your rule should work if you define it like this:

SCRIPT
    :   '{' ( SCRIPT | ~( '{' | '}' ) )* '}' 
    ;

However, this will fail when the script block contains, says, strings or comments that contain { or }. Here is a way to match a SCRIPT token, including comments and string literals that could contain { and '}':

SCRIPT
 : '{' SCRIPT_ATOM* '}'
 ;

fragment SCRIPT_ATOM
 : ~[{}]
 | '"' ~["]* '"'
 | '//' ~[\r\n]*
 | SCRIPT
 ;

A complete grammar that properly parses your input would then look like this:

grammar T;

parse
 : definition_function* EOF
 ;

definition_function
 : DEFINE FUNCTION function_name '[' language_name ']' RETURN attribute_type SCRIPT ';'
 ;

function_name
 : ID
 ;

language_name
 : ID
 ;

attribute_type
 : ID
 ;

DEFINE
 : 'define'
 ;

FUNCTION
 : 'function'
 ;

RETURN
 : 'return'
 ;

ID
 : [a-zA-Z_] [a-zA-Z_0-9]*
 ;

SCRIPT
 : '{' SCRIPT_ATOM* '}'
 ;

SPACES
 : [ \t\r\n]+ -> skip
 ;

fragment SCRIPT_ATOM
 : ~[{}]
 | '"' ~["]* '"'
 | '//' ~[\r\n]*
 | SCRIPT
 ;

which also parses the following input properly:

define function concat[JavaScript] return string {
  for (;;) { 
    while (true) { } 
  }
  var s = "}"
  // }
  return s 
};

Respectable answered 14/2, 2015 at 11:8 Comment(0)

Unless you absolutely need SCRIPT to be a token (recognized by a lexer rule), you can use a parser rule which recognizes nested blocks (the block rule below). The grammar included here should parse your example as two distinct function definitions.

DEFINE : 'define';
FUNCTION : 'function';
RETURN : 'return';
ID : [A-Za-z]+;
ANY : . ;
WS : [ \r\t\n]+ -> skip ;

test : definition_function* ;

definition_function
    : DEFINE FUNCTION function_name '[' language_name ']'
      RETURN attribute_type block ';'
    ;

function_name : id ;
language_name : id ;
attribute_type : 'string' ;
id : ID;

block
    : '{' ( ( ~('{'|'}') )+ | block)* '}'
    ;

Intyre answered 13/2, 2015 at 19:45 Comment(1)

Note that lexer rules can also contain recursive (lexer) rules. Note that inside a parser rule ~('{'|'}') does not match any char other that { and }, but rather, any token other than the tokens that match { and }. – Respectable 14/2, 2015 at 11:10

Recommended topics

Hot tags