Treat invalid chars as a single token in ANTLR4 lexer

Asked 28/3, 2016 at 20:32 Answered 18/4, 2016 at 22:15

I'm using the JSON grammar from the antlr4 grammar repository to parse JSON files for an editor plugin. It works, but reports invalid chars one by one. The following snippet results in 18 lexer errors:

{
   sometext-without-quotes : 42
}

I want to boil it down to 1-2 by treating consecutive, invalid single-char tokens of the same type as one bigger invalid token.

For a similar question, a custom lexer was suggested that glues "unknown" elements to larger tokens: In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?

I assume that this bypasses the usual lexer error reporting, which I would like to avoid, if possible. Isn't there a proper solution for that rather simple task? It seems to have worked by default in ANTLR3.

Maidamaidan answered 28/3, 2016 at 20:32 Comment(0)

The answer is in the link you provided. I don't want to copy the original answer completely so I'll try and paraphrase a bit...

In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?

Add unknowns to the lexer that will match multiples of these...

unknowns : Unknown+ ; 
...
Unknown  : . ;

There was an edit made to this post to cater for the case where you were only using a lexer and not using a parser. If using a parser then you do not need to override the nextToken method because the error can be handled in the parser in a much cleaner way ie unknowns are just another token type as far as the lexer is concerned. The lexer passes these to the parser which can then handle the errors. If using a parser I'd normally recognize all tokens as individual tokens and then in the parser emit the errors ie group them or not. The reason for doing this is all error handling is done in one place ie it's not in the lexer and in the parser. It also makes the lexer simpler to write and test ie it must recognize all text and never fail on any utf8 input. Some people would likely do it differently but this has worked for me with hand written lexers in C. The parser is in charge of determining what's actually valid and how to error on it. One other benefit is that the lexer is fairly generic and can be reused.

For lexer only solution...

Check the answer at the link and look for this comment in the answer...

... but I only have a lexer, no parsers ...

The answer states that you override the nextToken method and goes into some detail on how to do that

@Override
public Token nextToken() {

and the important part in the code is this...

Token next = super.nextToken();

if(next.getType() != Unknown) {
  return next;
}

The code that comes after this handles the case where you can match the bad tokens.

Pomology answered 16/4, 2016 at 23:57 Comment(1)

I was hoping for a simple solution (the problem is simple), but it looks like there isn't. Thank you for the confirmation. – Maidamaidan 19/4, 2016 at 15:23

What you could do is use lexer modes. For this you'd had to split grammar to parser and lexer grammar. Let's start with lexer grammar:

JSONLexer.g4

/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */

// Derived from http://json.org
lexer grammar JSONLexer;

STRING
   : '"' (ESC | ~ ["\\])* '"'
   ;


fragment ESC
   : '\\' (["\\/bfnrt] | UNICODE)
   ;


fragment UNICODE
   : 'u' HEX HEX HEX HEX
   ;


fragment HEX
   : [0-9a-fA-F]
   ;


NUMBER
   : '-'? INT '.' [0-9] + EXP? | '-'? INT EXP | '-'? INT
   ;


fragment INT
   : '0' | [1-9] [0-9]*
   ;

// no leading zeros

fragment EXP
   : [Ee] [+\-]? INT
   ;

// \- since - means "range" inside [...]

TRUE : 'true';
FALSE : 'false';
NULL : 'null';

LCURL : '{';
RCURL : '}';
COL : ':';
COMA : ',';
LBRACK : '[';
RBRACK : ']';

WS
   : [ \t\n\r] + -> skip
   ;

NON_VALID_STRING : . ->pushMode(MODE_ERR);

mode MODE_ERR;
WS1
   : [ \t\n\r] + -> skip
   ;
COL1 : ':' ->popMode;
MY_ERR_TOKEN : ~[':']* ->type(NON_VALID_STRING);

Basically I have added some tokens used in the parser part (like LCURL, COL, COMA etc) and introduced NON_VALID_STRING token, which is basically the first character that's nothing that already is (should be) matched. Once this token is detected, I switch the lexer to MODE_ERR mode. In this mode I go back to default mode once : is detected (this can be changed and maybe refined, but server the purpose here :) ) or I say that everything else is MY_ERR_TOKEN to which I assign NON_VALID_STRING token type. Here is what ATNLRWorks says to this when I run interpret lexer option with your input:

So s is NON_VALID_STRING type and so is everything else until :. So, same type but two different tokens. If you want them not to be of the same type, simply omit the type call in the lexer grammar.
Here is the parser grammar now
JSONParser.g4

/** Taken from "The Definitive ANTLR 4 Reference" by Terence Parr */

// Derived from http://json.org
parser grammar JSONParser;

options {
  tokenVocab=JSONLexer;
}

json
   : object
   | array
   ;

object
   : LCURL pair (COMA pair)* RCURL
   | LCURL RCURL
   ;

pair
   : STRING COL value
   ;

array
   : LBRACK value (COMA value)* RBRACK
   | LBRACK RBRACK
   ;

value
   : STRING
   | NUMBER
   | object
   | array
   | TRUE
   | FALSE
   | NULL
   ;

and if you run the test rig (I do it with ANTLRworks) you'll get a single error (see screenshot)

Also you could accumulate lexer errors by overriding the generated lexer class, but I understood in the question that this is not desired or I didn't understand that part :)

Cherisecherish answered 18/4, 2016 at 22:15 Comment(1)

Yes, you have it right. Thank you for the detailed explanation. I was hoping for a simple solution (the problem is simple), but it looks like there isn't. +1 – Maidamaidan 19/4, 2016 at 15:14

Recommended topics

Hot tags