ANTLR 4.5 - Mismatched Input 'x' expecting 'x'

Asked 21/4, 2015 at 16:15 Answered 27/6, 2023 at 10:16

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:

grammar output;

test: FILEPATH NEWLINE TITLE ;

FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;

This grammar will not match something like:

c:\test.txt
x

Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).

I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.

Horseshit answered 21/4, 2015 at 16:15 Comment(0)

This seems to be a common misunderstanding of ANTLR:

Language Processing in ANTLR:

The Language Processing is done in two strictly separated phases:

Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens

Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.

Lexing

Lexing in ANTLR works as following:

all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
- if one rule matches the maximum length match the corresponding token is pushed into the token stream
- if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream

Example: What is wrong with your grammar

Your grammar has two rules that are critical:

FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;

Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.

There are two hints for that:

keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).

Anse answered 21/4, 2015 at 18:27 Comment(4)

That makes a lot of sense now, thanks for your response! It would be nice to have a more helpful error message though, but I know that may be difficult or unreasonable to do. – Horseshit 21/4, 2015 at 19:51

At runtime the parser must assume that the user is aware of its behaviour. Yet I agree that a warning would be fine if two lexer rules overlap in such a way. – Anse 21/4, 2015 at 19:55

Great summary from ANTLR reference! – Capper 30/3, 2017 at 11:35

I had the same problem for another reason. The token constants in the parser and the lexer were out of sync, resulting in different numbers for 'x' in both. The tokens were correctly recognized but the parser could not match. Cleaning the project helped. – Antislavery 11/8, 2018 at 5:52

This was not directly OP's problem, but for those who have the same error message, here is something you could check.

I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

Tintometer answered 5/6, 2019 at 21:16 Comment(0)

Any input for TITLE is matched by FILEPATH token. The lang processor stops their choice on FILEPATH handling the input and no chance to reach TITLE token. It leads to the issue.

The workaround is to put TITLE before FILEPATH token ( or FILEPATH after the TITLE token). For example:

grammar output;

test: FILEPATH NEWLINE TITLE ;

NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;

P.S. This solution works for inputs like

c:\test.txt
x

In case your input will be simple filename without extension of folder name you'll get the same issue.

test
x

So I'd consider to use some restrictions for the FILEPATH to make it different from TITLE. For example to use next regex [A-Za-z][:][\\/][A-Za-z0-9]+'.'[A-Za-z0-9]+ for the FILEPATH (not sure because i'm not clear with all your cases). So the final solution might be like:

grammar output;

test: FILEPATH NEWLINE TITLE ;

fragment FILENAME: TITLE DOT EXTENSION;
fragment LETTER: [a-zA-Z] ;
fragment DIGIT: [0-9] ;
fragment UNDERSCORE: '_' ;
fragment SPACE: ' ' ;
fragment ESCAPE: '\\' ;
fragment SLASH: '/' ;
fragment QUOTE: '"' ;
fragment PLUS: '+';
fragment MINUS: '-';
fragment COLON: ':' ;
fragment DOT: '.';

EXTENSION: DOT (LETTER | DIGIT)+;
SEPARATOR: ESCAPE | SLASH;
DISC: LETTER COLON;
TITLE: (LETTER | DIGIT | UNDERSCORE | MINUS)+ ;
FILEPATH: DISC?(SEPARATOR TITLE)+ EXTENSION ;
NEWLINE: '\r'? '\n' ;

Underexpose answered 27/6, 2023 at 10:16 Comment(0)

Recommended topics

Hot tags