Can I add Antlr tokens at runtime?
Asked Answered
P

1

12

I have a situation where my language contains some words that aren't known at build time but will be known at run time causing the need to constantly rebuild / redeploy the program to take into account new words. I was wandering if it was possible in Antlr generate some of the tokens from a config file?

e.g In a simplified example if I have a rule

rule : WORDS+;

WORDS : 'abc';

And my language comes across 'bcd' at runntime, I would like to be able to modify a config file to define bcd as a word rather than having to rebuild then redeploy.

Portion answered 24/5, 2011 at 9:20 Comment(0)
F
23

You could add some sort of collection to your lexer class. This collection will hold all runtime-words. Then you add some custom code inside the rule that could possibly match these runtime-words and change the type of the token if it is present in the collection.

Demo

Let's say you want to parse the input:

"foo bar baz"

and at runtime, the words "foo" and "baz" should become special runtime words. The following grammar shows how to solve this:

grammar RuntimeWords;

tokens {
  RUNTIME_WORD;
}

@lexer::members {

  private java.util.Set<String> runtimeWords;

  public RuntimeWordsLexer(CharStream input, java.util.Set<String> words) {
    super(input);
    runtimeWords = words;
  }
}

parse
  :  (w=. {System.out.printf("\%-15s :: \%s \n", tokenNames[$w.type], $w.text);})+ EOF
  ;

Word
  :  ('a'..'z' | 'A'..'Z')+
     {
       if(runtimeWords.contains(getText())) {
         $type = RUNTIME_WORD;
       }
     }
  ;

Space
  :  ' ' {skip();}
  ;

And a little test class:

import org.antlr.runtime.*;
import java.util.*;

public class Main {
  public static void main(String[] args) throws Exception {
    Set<String> words = new HashSet<String>(Arrays.asList("foo", "baz"));
    ANTLRStringStream in = new ANTLRStringStream("foo bar baz");
    RuntimeWordsLexer lexer = new RuntimeWordsLexer(in, words);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    RuntimeWordsParser parser = new RuntimeWordsParser(tokens);        
    parser.parse();
  }
}

which will produce the following output:

RUNTIME_WORD    :: foo 
Word            :: bar 
RUNTIME_WORD    :: baz

Demo II

Here's another demo that is more tailored to your problem (I skimmed your question too quickly at first, but I'll leave my first demo in place because it might come in handy for someone). There's not much comments in it, but my guess is that you won't have problems grasping what happens (if not, don't hesitate to ask for clarification!).

grammar RuntimeWords;

@lexer::members {

  private java.util.Set<String> runtimeWords;

  public RuntimeWordsLexer(CharStream input, java.util.Set<String> words) {
    super(input);
    runtimeWords = words;
  }

  private boolean runtimeWordAhead() {
    for(String word : runtimeWords) {
      if(ahead(word)) {
        return true;
      }
    }
    return false;
  }

  private boolean ahead(String word) {
    for(int i = 0; i < word.length(); i++) {
      if(input.LA(i+1) != word.charAt(i)) {
        return false;
      }
    } 
    return true; 
  }
}

parse
  :  (w=. {System.out.printf("\%-15s :: \%s \n", tokenNames[$w.type], $w.text);})+ EOF
  ;

Word
  :  {runtimeWordAhead()}?=> ('a'..'z' | 'A'..'Z')+
  |  'abc'
  ;

Space
  :  ' ' {skip();}
  ;

and the class:

import org.antlr.runtime.*;
import java.util.*;

public class Main {
  public static void main(String[] args) throws Exception {
    Set<String> words = new HashSet<String>(Arrays.asList("BBB", "CDEFG"));
    ANTLRStringStream in = new ANTLRStringStream("BBB abc CDEFG");
    RuntimeWordsLexer lexer = new RuntimeWordsLexer(in, words);
    CommonTokenStream tokens = new CommonTokenStream(lexer);
    RuntimeWordsParser parser = new RuntimeWordsParser(tokens);        
    parser.parse();
  }
}

will produce:

Word            :: BBB 
Word            :: abc 
Word            :: CDEFG 

Be careful if some of your runtime words start with another one. For example, if your runtime words contain "stack" and "stacker", you want the longer word to be checked first! Sorting the set based on the length of the strings should be in order.

One final word of caution: if only "stack" is in your runtime word list and the lexer encounters "stacker", then you probably don't want to create a "stack"-token and leave "er" dangling. In that case, you'll want to check if the character after the last char in the word is not a letter:

private boolean ahead(String word) {
  for(int i = 0; i < word.length(); i++) {
    if(input.LA(i+1) != word.charAt(i)) {
      return false;
    }
  }
  // charAfterWord = input.LA(word.length())
  // assert charAfterWord != letter
  // note that charAfterWord could also be EOF
  return ... ; 
}
Fevre answered 24/5, 2011 at 9:45 Comment(7)
that is a really good answer, I just wish I could upvote it more.Portion
@Richard, your kind words are worth more than whatever amount of up-votes. You're welcome.Fevre
excellent write-up +1; You've not only addressed the question, but provided some very valuable insights to how architecturally this type of problem can be addressed - thanks!Sportive
@BartKiers, in both Demos I have an error in the Lexer class: public RuntimeWordsLexer(CharStream input, java.util.Set<String> words) { super(input); runtimeWords = words; }. It tells me "Return type for this method is missing" and "Constructor call must be the first statement in a constructor". Can you help me with that?Guildsman
@Guildsman My guess is that you named your grammar something other than RuntimeWords, in which case the constructor RuntimeWordsLexer is seen as a plain method, which is missing a return type. You need to copy paste exactly what I have written above. If you run into more problems, better create a question of your own: these comment boxes are not suited for extensive Q&A's. Good luck.Fevre
@BartKiers I have the same problem. Instead of creating a brand new RuntimeWordsLexer which is a very good idea, is there a API that adds a caracter string at runtime, converting it to a token ?Sportswear
@peter.cyc, no, there's no API for that.Fevre

© 2022 - 2024 — McMap. All rights reserved.