Ignoring whitespace (in certain parts) in Antlr4

Asked 8/4, 2015 at 20:41 Answered 10/4, 2015 at 2:39

Solved parsing antlr4 removing-whitespace

I am not so familiar with antlr. I am using version 4 and I have a grammar where whitespace is not important in some parts (but it might be in others, or rather its luck).

So say we have the following grammar

grammar Foo;
program : A* ;
A  : ID '@' ID '(' IDList ')' ';' ;
ID : [a-zA-Z]+ ;
IDList : ID (',' IDList)* ;
WS : [ \t\r\n]+ -> skip ;

and a test input

foo@bar(X,Y);
foo@baz  ( z,Z) ;

The first line is parsed correctly whereas the second one is not. I don't want to polute my rules with the places where whitespace is not relevant, since my actual grammar is more complicated than the toy example. In case it's not clear the part ID'@'ID should not have a whitespace. Whitespace in any other position shouldn't matter at all.

Circumpolar answered 8/4, 2015 at 20:41 Comment(0)

Define ID '@' ID as lexer token rather than as parser token.

A  : AID '(' IDList ')' ';' ;

AID : [a-zA-Z]+ '@' [a-zA-Z]+;

Other options

enable/disable whitespaces in your token stream, e.g. here
enable/disable whitespaces with lexer modes (may be a problem because lexer modes are triggered on context, which is not easy to determine in your case)

Shadowgraph answered 9/4, 2015 at 4:52 Comment(0)

Even though you are skipping WS, lexer rules are still sensitive to the existence of the whitespace characters. Skip simply means that no token is generated for consumption by the parser. Thus, the lexer Addr rule explicitly does not permit any interior whitespace characters.

Conversely, the a and idList parser rules never see interior whitespace tokens so those rules are insensitive to the occurrence of whitespace characters occurring between the generated tokens.

grammar Foo;

program : a* EOF ; // EOF will require parsing the entire input

a  : Addr LParen IDList RParen Semi ;
idList : ID (Comma ID)* ;  // simpler equivalent construct

Addr : ID '@' ID ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;

Absolutely answered 10/4, 2015 at 2:39 Comment(0)

Define ID '@' ID as lexer token rather than as parser token.

A  : AID '(' IDList ')' ';' ;

AID : [a-zA-Z]+ '@' [a-zA-Z]+;

Other options

enable/disable whitespaces in your token stream, e.g. here
enable/disable whitespaces with lexer modes (may be a problem because lexer modes are triggered on context, which is not easy to determine in your case)

Shadowgraph answered 9/4, 2015 at 4:52 Comment(0)

Recommended topics

Hot tags