I am referring to the XML 1.1 spec.
Look at the definition of NameStartChar
:
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
If I interpret this correctly, the last range (#x10000-#xEFFFF
) goes beyond the UTF16 range of Java's char
type. So it must be UTF32, right? So, I need to check pairs of char
against this range, instead of single char
s, right?
My questions are:
- How do I check for such character ranges using standard Java methods?
- How is it possible to define such ranges in JavaCC?
- JavaCC complains about
\u10000
and\uEFFFF
- JavaCC complains about
Thank you!
NOTE: Don't worry, I am not trying to write an own XML-parser.
EDIT: I am writing a parser, which would check if text input from miscellaneous (non-XML) text formats would match valid XML names.
#x10FFFF
and\u10FFFF
represent characters in the abstract; UTF-16 and UTF-32 are encodings that tell the computer how store the characters in memory. Java always uses UTF-16, so characters outside the BMP are stored using twochar
values, or a surrogate pair. Jon has already pointed out how deal with those. – Uther