Antlr4 discards remaining tokens instead of bailing out
Asked Answered
F

1

5

I am using Antlr4, and here is a simplified grammar I wrote:

grammar BooleanExpression;

/*******************************
 *      Parser Rules
 *******************************/
booleanTerm
    : booleanLiteral (KW_OR booleanLiteral)+
    | booleanLiteral
    ;

id
    : IDENTIFIER
    ;

booleanLiteral
    : KW_TRUE
    | KW_FALSE
    ;

/*******************************
 *         Lexer Rules
 *******************************/
KW_TRUE
    : 'true'
    ;

KW_FALSE
    : 'false'
    ;

KW_OR
    : 'or'
    ;   

IDENTIFIER
    : (SIMPLE_LATIN)+
    ;

fragment 
SIMPLE_LATIN
    : 'A' .. 'Z'
    | 'a' .. 'z'
    ;

WHITESPACE
    : [ \t\n\r]+ -> skip
    ;

I used a BailErrorStategy and BailLexer like below:

public class BailErrorStrategy extends DefaultErrorStrategy {
    /**
     * Instead of recovering from exception e, rethrow it wrapped in a generic
     * IllegalArgumentException so it is not caught by the rule function catches.
     * Exception e is the "cause" of the IllegalArgumentException.
     */

    @Override
    public void recover(Parser recognizer, RecognitionException e) {
        throw new IllegalArgumentException(e);
    }

    /**
     * Make sure we don't attempt to recover inline; if the parser successfully
     * recovers, it won't throw an exception.
     */
    @Override
    public Token recoverInline(Parser recognizer) throws RecognitionException {
        throw new IllegalArgumentException(new InputMismatchException(recognizer));
    }

    /** Make sure we don't attempt to recover from problems in subrules. */
    @Override
    public void sync(Parser recognizer) {
    }

    @Override
    protected Token getMissingSymbol(Parser recognizer) {
        throw new IllegalArgumentException(new InputMismatchException(recognizer));
    }
}



 public class BailLexer extends BooleanExpressionLexer {
    public BailLexer(CharStream input) {
        super(input);
        //removeErrorListeners();
        //addErrorListener(new ConsoleErrorListener());
    }

    @Override
    public void recover(LexerNoViableAltException e) {
        throw new IllegalArgumentException(e); // Bail out
    }

    @Override
    public void recover(RecognitionException re) {
        throw new IllegalArgumentException(re); // Bail out
    }
}

Everything works okay except one case. I tried the following expression:

true OR false

I expect this expression to be rejected and an IllegalArgumentException is thrown because the 'or' token should be lower case instead of upper case. But it turned out Antlr4 didn't reject this expression and the expression is tokenized into "KW_TRUE IDENTIFIER KW_FALSE" (which is expected, upper case 'OR' will be considered as an IDENTIFIER), but the parser didn't throw an error during processing this token stream and parsed it into a tree containing only "true" and discarded the remaining "IDENTIFIER KW_FALSE" tokens. I tried different prediction modes but all of them worked like above. I have no idea why it works like this and did some debugging, and it eventually led to to this piece of code in Antlr:

ATNConfigSet reach = computeReachSet(previous, t, false);

if ( reach==null ) {
    // if any configs in previous dipped into outer context, that
    // means that input up to t actually finished entry rule
    // at least for SLL decision. Full LL doesn't dip into outer
    // so don't need special case.
    // We will get an error no matter what so delay until after
    // decision; better error message. Also, no reachable target
    // ATN states in SLL implies LL will also get nowhere.
    // If conflict in states that dip out, choose min since we
    // will get error no matter what.
    int alt = getAltThatFinishedDecisionEntryRule(previousD.configs);
    if ( alt!=ATN.INVALID_ALT_NUMBER ) {
        // return w/o altering DFA
        return alt;
    }
    throw noViableAlt(input, outerContext, previous, startIndex);
}  

The code "int alt = getAltThatFinishedDecisionEntryRule(previousD.configs);" returned the second alternative in booleanTerm (because "true" matches the second alternative "booleanLiteral") but since it is not equal to ATN.INVALID_ALT_NUMBER, noViableAlt is not thrown immediately. According to the Java comments there, "We will get an error no matter what, so delay until after decision" but it seems no error was thrown eventually.

I really have no idea how to make Antlr reports an error in this case, could some one shed me some light on this? Any help is appreciated, thanks.

Fading answered 28/2, 2013 at 3:49 Comment(2)
Maybe not all tokens are consumed? What happens if you force the parser to parse all the way to the end-of-input: parse : booleanTerm EOF;Ching
Why are you not using BailErrorStrategy?Phrenic
P
8

If your top-level rule does not end with an explicit EOF, then ANTLR is not required to parse to the end of the input sequence. Rather than throw an exception, it simply parsed the valid portion of the sequence you gave it.

The following start rule would force it to parse the entire input sequence as a single booleanTerm.

start : booleanTerm EOF;

Also, BailErrorStrategy is provided by the ANTLR 4 runtime, and throws a more informative ParseCancellationException than the one shown in your example.

Phrenic answered 28/2, 2013 at 14:24 Comment(6)
Thanks a million. This is indeed the solution to the problem I ran into, and I did some more searching and found this wiki for Antlr 3, antlr.org/wiki/pages/viewpage.action?pageId=4554943, which described the exactly same issue.Fading
I think the problem is there is really no official documentation describing this. I read both Antlr 4 online documentations (not so much) and the Definitive ANTLR 4 Reference book, but I cannot recall I read anything that mentions the usage of 'EOF' token like here. There is no any example in Definitive ANTLR 4 Reference that has a EOF at the end of the start rule, and it is not mentioned in the "Error Reporting and Recovery" section too :(Fading
I don't realize there is a built-in BailErrorStrategy I can use, thanks for pointing this out. I will try that.Fading
@280Z28 Hey, I've met the same issue. But my problem is that sometimes I need to parse a child rule(not the start rule) with the input of only the content of the target child rule. The parser also discards the remaining tokens. How can I solve this? Since it is not possible to add an EOF for all child rules.Hellespont
I think I might be experiencing the same problem as @Stoneboy. #29834989Sumikosumma
@Sumikosumma My final solution was to add additional EOF rule to every child rule, not clean but works, since I couldn't find better solution. If you could find better solution, please share by comment here. Thanks in advance.Hellespont

© 2022 - 2024 — McMap. All rights reserved.