Syntactic predicates in ANTLR lexer rules
Asked Answered
O

2

8

Introduction

Looking at the documentation, ANTLR 2 used to have something called predicated lexing, with examples like this one (inspired by Pascal):

RANGE_OR_INT
    :   ( INT ".." ) => INT  { $setType(INT); }
    |   ( INT '.' )  => REAL { $setType(REAL); }
    |   INT                  { $setType(INT); }
    ;    

The way I see it, that's essentially a positive look-ahead assertion at the beginning of the rule: If the look-ahead matches INT ".." then the first rule will apply (and match the INT portion of that input), and so on.

I haven't found something like this in ANTLR 4 yet. The 2 to 3 migration guide doesn't seem to mention this, while the 3 to 4 changes document states:

The biggest difference between ANTLR 3 and 4 is that ANTLR 4 takes any grammar you give it unless the grammar had indirect left recursion. That means we don't need syntactic predicates or backtracking so ANTLR 4 does not support that syntax; you will get a warning for using it.

This is in line with the error message I get if I leave this essentially as is:

(...)=> syntactic predicates are not supported in ANTLR 4

While I can understand how a more intelligent parser implementation would resolve these ambiguities, I fail to see how this would work for lexers.

Reproducing example

To be sure, let's try this out:

grammar Demo;
prog:   atom (',' atom)* ;
atom:   INT  { System.out.println("INT:   " + $INT.getText()); }
    |   REAL { System.out.println("REAL:  " + $REAL.getText()); }
    |   a=INT RANGE b=INT { System.out.println("RANGE: " +
                              $a.getText() + " .. " + $b.getText()); }
    ;
WS  :   (' ' | '\t' | '\n' | '\r')+ -> skip ;
INT :   ('0'..'9')+ ;
REAL:   INT '.' INT? | '.' INT ;
RANGE:  '..' ;

Save this to Demo.g, then compile and run:

$ wget -nc http://www.antlr.org/download/antlr-4.5.2-complete.jar
$ java -jar antlr-4.5.2-complete.jar Demo.g
$ javac -cp antlr-4.5.2-complete.jar Demo*.java
$ java -cp .:antlr-4.5.2-complete.jar org.antlr.v4.gui.TestRig \
  Demo prog <<< '1,2.,3.4,5 ..6,7..8'
INT:   1
REAL:  2.
REAL:  3.4
RANGE: 5 .. 6
REAL:  7.
line 1:17 extraneous input '.8' expecting {<EOF>, ','}

So it seems I was correct: while the removal of syntactic predecates might have been appropriate for the parser, the lexer won't suddenly guess the right token type.

Core Question

So how would one convert this specific example to ANTLR 4? Is there a way to express look-ahead conditions? Or perhaps a way to have a single rule like INT '..' emit two distinct tokens?

References and possible solutions

Looking at the ANTLR 4 Pascal grammar, I notice that it doesn't allow real numbers to end in . without a digit after that, so learning a solution from there doesn't appear to be an option.

I've seen Semantic predicates in ANTLR4? and syntactic predicates - Upgrading from Antlr 3 to Antlr 4. Both discuss syntactic predicates in parser rules. The latter also has an example with lexer rules, but the look-ahead is identical to the rule that follows it, which means the rules could get removed without adverse effects. This is not the case in my example above.

Answers to check previous/left token in lexer mention the emit method of the lexer, with a comment referencing the How can I emit more than a single token per lexer rule? FAQ page in the ANTLR 3 wiki, so I guess that's one approach. I will turn this into an answer if noone beats me to it and if I can get it to work in my example.

The answer to ANTLR4 negative lookahead in lexer makes use of the _input.LA(int) method to examine the look-ahead. The ANTLR 4 lexical analysis faq mentions _input.LA without going into details. This should work for the example above as well, but would be hard for scenarios where there is more than a single character of look-ahead to consider.

Otranto answered 1/3, 2016 at 13:20 Comment(0)
O
4

Here is a very short solution:

@lexer::members { private int _pos; }
INT_RANGE: INT  { _pos=_input.index(); setType(INT); emit(); }
           '..' { _input.seek(_pos); };

This matches the whole INT '..' expression, but then rewinds the input to just after the INT where we emit the token and save the position. That position is then used at the end of the rule to rewind the input in a more permanent manner.

There is, however, a problem: the resulting tokens will have incorrect position information since the _input.seek won't affect what getCharPositionInLine returns. In this case one could do

setCharPositionInLine(getCharPositionInLine() - 2)

at the end of the rule, but that approach would not work if instead of .. one were dealing with input of variable length. I had hoped that I would be able to save the result of getCharPositionInLine() in the first action, but unfortunately that already reflects the end of the whole expression.

Looking at LexerATNSimulator.evaluatePredicate I see that this method makes an effort to restore a given position state. So we can get at the correct state by abusing a semantic predicate for its side effects:

@lexer::members {
    private int _savedIndex, _savedLine, _savedColumn;
    private boolean remember() {
        _savedIndex = _input.index();
        _savedLine = getLine();
        _savedColumn = getCharPositionInLine();
        return true;
    }
    private void recall(int type) {
        _input.seek(_savedIndex);
        setLine(_savedLine);
        setCharPositionInLine(_savedColumn);
        setType(type);
    }
}
INT_RANGE: INT { remember() }? '..' { recall(INT); } ;

Keep in mind that the semantic predicate will get executed at a point in time where it isn't yet guaranteed that the whole expression will actually match. So if you use this trick in several places, you have to be careful that you don't get remember() calls from different rules overwriting the state. If in doubt, you could use multiple such functions, or an index into an array, to make each match unambiguous.

Otranto answered 1/3, 2016 at 15:9 Comment(3)
Took ages to find out the way to "rewind" the lexer. FINALLY, THANK YOU!Lurk
Declared victory too soon. Not really working with 4.13, but I'll try to debug.Lurk
Parser and Lexer token codes were not matching. Seems like on newer version you can't use the "hack" you suggest here but instead you need to overrides methods, and provide YourParser.INT/YourParser.RANGE to the lexer _factory.Lurk
O
3

The sources of the current (as of this writing) Lexer implementation contain several docstring entries regarding the emission of multiple tokens. These are of course represented in the Lexer API JavaDoc as well. According to these, one has to do the following:

  1. Override emit(Token):

    By default does not support multiple emits per nextToken invocation for efficiency reasons. Subclass and override this method, nextToken, and getToken (to push tokens into a list and pull from that list rather than a single variable as this implementation does).

  2. Override nextToken().

  3. Override getToken():

    Override if emitting multiple tokens.

  4. Make sure to set _token to non-null:

    If you subclass to allow multiple token emissions, then set this to the last token to be matched or something nonnull so that the auto token emit mechanism will not emit another token.

However, I don't see why overriding getToken would be important, since I see no calls to that method anywhere in the runtime library. And if you set _token, then that will be the output of getToken as well.

So what I did to emit two tokens from a single rule was this:

@lexer::members {

    private Token _queued;

    @Override public Token nextToken() {
        if (_queued != null) {
            emit(_queued);
            _queued = null;
            return getToken();
        }
        return super.nextToken();
    }

    @Override public Token emit() {
        if (_type != INT_RANGE)
            return super.emit();
        Token t = _factory.create(
            _tokenFactorySourcePair, INT, null, _channel,
            _tokenStartCharIndex, getCharIndex()-3,
            _tokenStartLine, _tokenStartCharPositionInLine);
        _queued = _factory.create(
            _tokenFactorySourcePair, RANGE, null, _channel,
            getCharIndex()-2, getCharIndex()-1, _tokenStartLine,
            _tokenStartCharPositionInLine + getCharIndex()-2 -
            _tokenStartCharIndex);
        emit(t);
        return t;
    }
}

INT_RANGE: INT '..' ;

All the position computation felt quite tedious, however, and gave me another (and at least for this application far better) idea which I'll post in a spearate answer.

Otranto answered 1/3, 2016 at 14:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.