In antlr4 lexer, How to have a rule that catches all remaining "words" as Unknown token?
Asked Answered
K

2

15

I have an antlr4 lexer grammar. It has many rules for words, but I also want it to create an Unknown token for any word that it can not match by other rules. I have something like this:

Whitespace : [ \t\n\r]+ -> skip;
Punctuation : [.,:;?!];
// Other rules here
Unknown : .+? ; 

Now generated matcher catches '~' as unknown but creates 3 '~' Unknown tokens for input '~~~' instead of a single '~~~' token. What should I do to tell lexer to generate word tokens for unknown consecutive characters. I also tried "Unknown: . ;" and "Unknown : .+ ;" with no results.

EDIT: In current antlr versions .+? now catches remaining words, so this problem seems to be resolved.

Kreindler answered 5/2, 2013 at 12:39 Comment(1)
According to comment from @CAD97 below this works as intended in antlr 4.7, .+? catches words not individual characters.Kreindler
P
14

.+? at the end of a lexer rule will always match a single character. But .+ will consume as much as possible, which was illegal at the end of a rule in ANTLR v3 (v4 probably as well).

What you can do is just match a single char, and "glue" these together in the parser:

unknowns : Unknown+ ; 

...

Unknown  : . ; 

EDIT

... but I only have a lexer, no parsers ...

Ah, I see. Then you could override the nextToken() method:

lexer grammar Lex;

@members {

  public static void main(String[] args) {
    Lex lex = new Lex(new ANTLRInputStream("foo, bar...\n"));
    for(Token t : lex.getAllTokens()) {
      System.out.printf("%-15s '%s'\n", tokenNames[t.getType()], t.getText());
    }
  }

  private java.util.Queue<Token> queue = new java.util.LinkedList<Token>();

  @Override
  public Token nextToken() {    

    if(!queue.isEmpty()) {
      return queue.poll();
    }

    Token next = super.nextToken();

    if(next.getType() != Unknown) {
      return next;
    }

    StringBuilder builder = new StringBuilder();

    while(next.getType() == Unknown) {
      builder.append(next.getText());
      next = super.nextToken();
    }

    // The `next` will _not_ be an Unknown-token, store it in 
    // the queue to return the next time!
    queue.offer(next);

    return new CommonToken(Unknown, builder.toString());
  }
}

Whitespace  : [ \t\n\r]+ -> skip ;
Punctuation : [.,:;?!] ;
Unknown     : . ; 

Running it:

java -cp antlr-4.0-complete.jar org.antlr.v4.Tool Lex.g4 
javac -cp antlr-4.0-complete.jar *.java
java -cp .:antlr-4.0-complete.jar Lex

will print:

Unknown         'foo'
Punctuation     ','
Unknown         'bar'
Punctuation     '.'
Punctuation     '.'
Punctuation     '.'
Parmentier answered 5/2, 2013 at 19:23 Comment(4)
Thanks, but I only have a lexer, no parsers, I am using this to tokenize texts for a NLP tool. What we want is to have an unknown or unrecognized token for whatever lexer can not identify as token. Maybe I can intercept these unknown chars tokens and concatenate myself.Kreindler
@mdakin, checkout my EDIT.Parmentier
This works for the limited example I gave, but now I understand my problem is deeper and not actually easy to solve. If I have another rule like: " word: [a-zA-Z]+; " , Lexer would tokenize this: "~A~a" as "~":Unknown "A":Word "~":Unknown "a":word . The reason is, I skip the whitespaces, so I have no context on where a word starts or ends. But anyway your answer still helped me to learn about the working of the lexer.Kreindler
For posterity: I just made a lexer rule with .+? in ANTLR 4.7, and it worked (with the interpreter in the IntellJ IDEA plugin).Gisela
P
3

The accepted answer works, but it only works for Java.

I converted the provided Java code for use with the C# ANTLR runtime. If anyone else needs it... here ya go!

@members {
private IToken _NextToken = null;
public override IToken NextToken()
{
    if(_NextToken != null)
    {
        var token = _NextToken;
        _NextToken = null;
        return token;
    }

    var next = base.NextToken();
    if(next.Type != UNKNOWN)
    {
        return next;
    }

    var originalToken = next;
    var lastToken = next;
    var builder = new StringBuilder();
    while(next.Type == UNKNOWN)
    {
        lastToken = next;
        builder.Append(next.Text);
        next = base.NextToken();
    }
    _NextToken = next;
    return new CommonToken(
        originalToken
    )
    {
        Text = builder.ToString(),
        StopIndex = lastToken.Column
    };
}
}
Pokeberry answered 18/1, 2020 at 0:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.