Is there a way to easily adapt the error messages of ANTLR4?
Asked Answered
R

2

5

Currenlty I'm working on my own grammar and I would like to have specific error messages on NoViableAlternative, InputMismatch, UnwantedToken, MissingToken and LexerNoViableAltException.

I already extended the Lexer.class and have overridden the notifyListeners to change the default error message token recognition error at: to my own one. As well I extended the DefaultErrorStrategy and have overridden all report methods, like reportNoViableAlternative, reportInputMismatch, reportUnwantedToken, reportMissingToken.

The purpose of all that is to change the messages, which will be passed to the syntaxError() method of the listener ANTLRErrorListener.

Here's a small example of the extended Lexer.class:

    @Override
    public void notifyListeners(LexerNoViableAltException lexerNoViableAltException) {
        String text = this._input.getText(Interval.of(this._tokenStartCharIndex, this._input.index()));
        String msg = "Operator " + this.getErrorDisplay(text) + " is unkown.";
        ANTLRErrorListener listener = this.getErrorListenerDispatch();
        listener.syntaxError(this, null, this._tokenStartLine, this._tokenStartCharPositionInLine, msg,
            lexerNoViableAltException);
    }

Or for the DefaultErrorStrategy:

    @Override
    protected void reportNoViableAlternative(Parser recognizer, NoViableAltException noViableAltException) {
        TokenStream tokens = recognizer.getInputStream();
        String input;
        if (tokens != null) {
            if (noViableAltException.getStartToken().getType() == -1) {
                input = "<EOF>";
            } else {
                input = tokens.getText(noViableAltException.getStartToken(), noViableAltException.getOffendingToken());
            }
        } else {
            input = "<unknown input>";
        }

        String msg = "Invalid operation " + input + ".";
        recognizer.notifyErrorListeners(noViableAltException.getOffendingToken(), msg, noViableAltException);
    }

So I read this thread Handling errors in ANTLR4 and was wondering if there's no easier solution when it comes to the point of customising?

Racemose answered 19/9, 2019 at 12:31 Comment(0)
R
7

My strategy for improving the ANTLR4 error messages is a bit different. I use a syntaxError override in my error listeners (I have one for both the lexer and the parser). By using the given values and a few other stuff like the LL1Analyzer you can create pretty precise error messages. The lexer error listener's handling is pretty straight forward (hopefully C++ code is understandable for you):

void LexerErrorListener::syntaxError(Recognizer *recognizer, Token *, size_t line,
                                     size_t charPositionInLine, const std::string &, std::exception_ptr ep) {
  // The passed in string is the ANTLR generated error message which we want to improve here.
  // The token reference is always null in a lexer error.
  std::string message;
  try {
    std::rethrow_exception(ep);
  } catch (LexerNoViableAltException &) {
    Lexer *lexer = dynamic_cast<Lexer *>(recognizer);
    CharStream *input = lexer->getInputStream();
    std::string text = lexer->getErrorDisplay(input->getText(misc::Interval(lexer->tokenStartCharIndex, input->index())));
    if (text.empty())
      text = " "; // Should never happen.

    switch (text[0]) {
      case '/':
        message = "Unfinished multiline comment";
        break;
      case '"':
        message = "Unfinished double quoted string literal";
        break;
      case '\'':
        message = "Unfinished single quoted string literal";
        break;
      case '`':
        message = "Unfinished back tick quoted string literal";
        break;

      default:
        // Hex or bin string?
        if (text.size() > 1 && text[1] == '\'' && (text[0] == 'x' || text[0] == 'b')) {
          message = std::string("Unfinished ") + (text[0] == 'x' ? "hex" : "binary") + " string literal";
          break;
        }

        // Something else the lexer couldn't make sense of (likely there is no rule that accepts this input).
        message = "\"" + text + "\" is no valid input at all";
        break;
    }
    owner->addError(message, 0, lexer->tokenStartCharIndex, line, charPositionInLine,
                    input->index() - lexer->tokenStartCharIndex);
  }
}

This code shows that we don't use the original message at all and instead examine the token text to see what's wrong. Here we mostly deal with unclosed strings:

enter image description here

The parser error listener is much more complicated and too large to post here. It's a combination of different sources to construct the actual error message:

  • Parser.getExpectedTokens(): uses the LL1Analyzer to get the next possible lexer tokens from a given position in the ATN (the socalled follow-set). It looks through predicates however, which might be a problem (if you use such).

  • Identifiers & keywords: often certain keywords are allowed as normal identifiers in specific situations, which creates follow-sets with a list of keywords that are actually meant to be identifiers, so that needs an extra check to avoid showing them as expected values:

enter image description here

  • Parser rule invocation stack, during the call to the error listener the parser has the current parser rule context (Parser.getRuleContext()) which you can use to walk up the invocation stack, to find rule contexts that give you more specific information of the error location (for example, walking up from a * match to a hypothetical expr rule tells you that actually an expression is expected at this point).

  • The given exception: if this is null the error is about a missing or unwanted single token, which is pretty easy to handle. If the exception has a value you can examine it for further details. Worth mentioning here is that the content of the exception is not used (and pretty sparse anyway), instead we use the values that were collected previously. The most common exception types are NoViableAlt and InputMismatch, which you can both translate to either "input is incomplete" when the error position is EOF or something like "input is not valid at this position". Both can then be enhanced with an expectation constructed from the rule invocation stack and/or the follow-set as mentioned (and shown in the image) above.

Regulable answered 20/9, 2019 at 7:34 Comment(0)
R
3

After some research I came up with an another solution. In the book "The Definitive ANTLR4 Reference" in Chapter 9.4 they explain how to use error alternatives:

fcall
: ID '(' expr ')'
| ID '(' expr ')' ')' {notifyErrorListeners("Too many parentheses");}
| ID '(' expr         {notifyErrorListeners("Missing closing ')'");}
;

These error alternatives can make an ANTLR-generated parser work a little harder to choose between alternatives, but they don't in any way confuse the parser.

I adapted this to my grammar and extended the BaseErrorListener. The passed Exception to the notifyErrorListener are null (from Parser.class):

public final void notifyErrorListeners(String msg) {
    this.notifyErrorListeners(this.getCurrentToken(), msg, (RecognitionException)null);
}

So handled it in the extension of BaseErrorListener, like that:

if (recognitionException instanceof LexerNoViableAltException) {
    message = handleLexerNoViableAltException((Lexer) recognizer);
} else if (recognitionException instanceof InputMismatchException) {
    message = handleInputMismatchException((CommonToken) offendingSymbol);
} else if (recognitionException instanceof NoViableAltException) {
    message = handleNoViableAltException((CommonToken) offendingSymbol);
} else if (Objects.isNull(recognitionException)) {
// Handle Errors specified in my grammar
    message = msg;
} else {
    message = "Can't be resolved";
}

I hope that helps a little bit

Racemose answered 26/9, 2019 at 10:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.