Java, JavaCC: How to parse characters outside the BMP?
Asked Answered
D

2

7

I am referring to the XML 1.1 spec.

Look at the definition of NameStartChar:

NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]

If I interpret this correctly, the last range (#x10000-#xEFFFF) goes beyond the UTF16 range of Java's char type. So it must be UTF32, right? So, I need to check pairs of char against this range, instead of single chars, right?

My questions are:

  • How do I check for such character ranges using standard Java methods?
  • How is it possible to define such ranges in JavaCC?
    • JavaCC complains about \u10000 and \uEFFFF

Thank you!

NOTE: Don't worry, I am not trying to write an own XML-parser.
EDIT: I am writing a parser, which would check if text input from miscellaneous (non-XML) text formats would match valid XML names.

Dryasdust answered 20/5, 2010 at 10:12 Comment(2)
Java's utterly broken char was conceived (to defend Gosling's SNAFU) when Unicode was not yet at 3.1. Hence the 16-bit char SNAFU. It got messy once Unicode 3.1 came out, because the entire char[] "abstraction" isn't really abstracting much anymore. As Jon Skeet pointed out, the trick is to work with 32-bit codepoints (Java ints) and to figure out the char-to-codepoint relation in the String class and others. The situation is not nice. It is one of the messier aspect of Java (because it affecting a broken primitive type, deeply engrained in the language).Tuff
This question has nothing to do with UTF-32. Notations like #x10FFFF and \u10FFFF represent characters in the abstract; UTF-16 and UTF-32 are encodings that tell the computer how store the characters in memory. Java always uses UTF-16, so characters outside the BMP are stored using two char values, or a surrogate pair. Jon has already pointed out how deal with those.Uther
V
4

Have a look at Character.toCodePoint(char, char) which will convert a surrogate pair into a full range code point. String.codePointAt may well be useful to you, too.

There's a lot of other surrogate support within Character and String. To know exactly which methods to call, we'd need to know the exact details of your situation.

Veritable answered 20/5, 2010 at 10:16 Comment(1)
Thank you. OK, I clarified my intentions at bottom of my question (see EDIT).Dryasdust
R
0

I've found http://www.fileformat.info/info/unicode/char/10000/index.htm to be a handy site for learning about Unicode characters.

For example, u+10000 and u+10FFFF are

String first = "\uD800\uDC00"; // u10000
String last = "\uDBFF\uDFFF"; // u10FFFF
Revolutionize answered 31/1, 2014 at 18:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.