How do I match unicode characters in antlr

Asked 17/1, 2010 at 17:19 Answered 18/1, 2010 at 21:6

I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out.

fragment CHAR     :  ('A'..'Z') | ('a'..'z');
fragment DIGIT    :  ('0'..'9');
fragment UNICODE  :  '\u0000'..'\u00FF';

Now if I write my token rule as:

TOKEN  :  (CHAR|DIGIT|UNICODE)+;

I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input" " Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input"

And nothing gets matched: And also if I write it as

TOKEN  :  (UNICODE)+;

Nothing gets matched.

Is there a way of doing this.

Inchoate answered 17/1, 2010 at 17:19 Comment(6)

'\u0000'..'\u00FF' does not cover "all Unicode characters", it only covers the first 256. – Garey 17/1, 2010 at 17:24

That too. I missed that! – Rimester 17/1, 2010 at 17:26

True, but I thought Java doesn't support five-digit Unicode yet. – Inchoate 17/1, 2010 at 19:24

With \u00FF, we're not in 5 digit Unicode country yet; that's only 2 so far. There's still all the characters from \u0100 to about \uF8FF. – Rimester 17/1, 2010 at 19:43

Actually, since the subset \u0000..\u00FF is an exact duplicate of ISO-8859-1, you could argue we haven't got any Unicode at all in there. :o) (And for the record: The highest valid codepoint in the BMP is \uFFFD, the replacement character, but not all codepoints up to that value are assigned. \uFFFE and \uFFFF are not characters.) – Garey 17/1, 2010 at 19:56

Ah sorry I just realised I meant \uFFFFF with out the 0 but yes that is wrong as well since it doesn't have a value. – Inchoate 17/1, 2010 at 22:31

One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary option to say that you want to allow any char in the Unicode range of 0 through FFFE

options
{
charVocabulary='\u0000'..'\uFFFE';
}

The default you'll usually see in the examples is

options
{
charVocabulary = '\3'..'\377';
}

To cover the point made above. Generally if you needed both the ascii character range 'A'..'Z' and the unicode range you'd make a unicode lexer rule like: '\u0080'..'\ufffe'

Femininity answered 18/1, 2010 at 21:6 Comment(1)

Note: The option "charVocabulary" is not available in antlr3 as it uses unicode by default. – Overtax 27/11, 2012 at 10:5

Practically speaking, TOKEN: (UNICODE)+ is completely useless.

Since everything is a token character, if you try to use such a rule to match a Java program, say, it will simply match the whole program and return it to you as one big token.

You really do need to break your characters down into different groups if you want to split your input apart into meaningful fragments.

It might help you to take a look at how the "pros" have done it. Here is a BNF grammar for Java, and here is BNF for an identifier, which shows how they took to the trouble to group out

identifier 
  ::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" }

Rimester answered 17/1, 2010 at 17:23 Comment(0)

Recommended topics

Hot tags