ANTLR4 Lexer error reporting (length of offending characters)
Asked Answered
E

1

5

I'm developing a small IDE for some language using ANTLR4 and need to underline erroneous characters when the lexer fails to match them. The built in org.antlr.v4.runtime.ANTLRErrorListener implementation outputs a message to stderr in such cases, similar to this:

line 35:25 token recognition error at: 'foo\n'

I have no problem understanding how information about line and column of the error is obtained (passed as arguments to syntaxError callback), but how do I get the 'foo\n' string inside the callback?

When a parser is the source of the error, it passes the offending token as the second argument of syntaxError callback, so it becomes trivial to extract information about the start and stop offsets of the erroneous input and this is also explained in the reference book. But what about the case when the source is a lexer? The second argument in the callback is null in this case, presumably since the lexer failed to form a token.

I need the length of unmatched characters to know how much to underline, but while debugging my listener implementation I could not find this information anywhere in the supplied callback arguments (other than extracting it from the supplied error message though string manipulation, which would be just wrong). The 'foo\n' string may clearly be obtained somehow, so what am I missing?

I suspect that I might be looking in the wrong place and that I should be looking at extending DefaultErrorStrategy where error messages get formed.

Excursion answered 13/9, 2013 at 9:5 Comment(0)
A
19

You should write your lexer such that a syntax error is impossible. In ANTLR 4, it is easy to do this by simply adding the following as the last rule of your lexer:

ErrorChar : . ;

By doing this, your errors are moved from the lexer to the parser.

In some cases, you can take additional steps to help users while they edit code in your IDE. For example, suppose your language supports double-quoted strings of the following form, which cannot span multiple lines:

StringLiteral : '"' ~[\r\n"]* '"';

You can improve error reporting in your IDE by using the following pair of rules:

StringLiteral : '"' ~[\r\n"]* '"';
UnterminatedStringLiteral : '"' ~[\r\n"]*;

You can then override the emit() method to treat the UnterminatedStringLiteral in a special way. As a result, the user sees a great error message and the parser sees a single StringLiteral token that it can generally handle well.

@Override
public Token emit() {
    switch (getType()) {
    case UnterminatedStringLiteral:
        setType(StringLiteral);
        Token result = super.emit();
        // you'll need to define this method
        reportError(result, "Unterminated string literal");
        return result;
    default:
        return super.emit();
    }
}
Abijah answered 14/9, 2013 at 2:43 Comment(6)
I actually know your ErrorChar trick from other parser generators, but for some reason, I was under the impression that ANTLR4 Lexers do this implicitly. Oh, well.. Great answer, thanks.Excursion
I may have prematurely labeled this answer as correct for my specific question. While the error message after adding ErrorChar pattern to my lexer grammar is definitely an improvement over the previous one, I still cannot underline the entire offending string. Only the first char of it is made into a token - leaving me with the same problem as before. I tried to change the definition to ErrorChar : .+? ; but that didn't work.Excursion
You need to follow the "take additional steps" section of the answer, which is highly specific to a particular lexer.Abijah
That would require me predicting the syntax of illegal input from users, would it not? Anyways, doing this in addition to what you suggested, probably answers my question on how to obtain the length of offending strings, not just where they start occurring. Just need to confirm it.Excursion
I was able solve my problems using a solution based on this answer (and the one I linked). Thank you.Excursion
this answer saved my houres of timeRosalie

© 2022 - 2024 — McMap. All rights reserved.