Java reading in character streams with supplementary unicode characters
Asked Answered
A

2

2

I'm having trouble reading in supplementary unicode characters using Java. I have a file that potentially contains characters in the supplementary set (anything greater than \uFFFF). When I setup my InputStreamReader to read the file using UTF-8 I would expect the read() method to return a single character for each supplementary character, instead it seems to split on the 16 bit threshold.

I saw some other questions about basic unicode character streams, but nothing seems to deal with the greater than 16 bit case.

Here's some simplified sample code:

InputStreamReader input = new InputStreamReader(file, "UTF8");
int nextChar = input.read();
while(nextChar != -1) {
    ...
    nextChar = input.read();
}

Does anyone know what I need to do to correctly read in a UTF-8 encoded file that contains supplementary characters?

Angry answered 11/10, 2011 at 4:12 Comment(0)
K
4

Java works with UTF-16. So, if your input stream has astral characters, they will appear as a surrogate pair, i.e., as two chars. The first character is the high surrogate, and the second character is the low surrogate.

Krause answered 11/10, 2011 at 4:24 Comment(5)
That makes sense. Is there an easy way to tell if a character is the first of a surrogate pair?Angry
Sure, use Character.isHighSurrogate(). (There's also Character.isLowSurrogate() for the second half of the surrogate pair.)Krause
Or, use String.codePointAt() / Character.codePointAt() methods, if you know index of the first character that forms surrogate pair.Lowry
Awesome that seems to work. One last question... I have a regular expression that uses unicode character classes (such as "\p{Nd}"), these classes seem to not work on these surrogate pairs. Is there an easy solution to that?Angry
Here's a link with a detailed discussion of coding for supplementary characters and surrogate code points in Java: ibm.com/developerworks/java/library/j-unicodeDeannedeans
G
1

Though read() is defined to return int, and could theoretically return a supplementary character's code point "all at once", I believe the return type is only int to allow a value of -1 to be returned.

The value you're getting from read() is basically a char by another name, and Java a char is limited to 16 bits.

Java can only represent supplementary characters as a UTF-16 surrogate pair, there is no such thing as a "single character" (at least in the char sense) once you get above 0xFFFF as far as Java is concerned.

Gillespie answered 11/10, 2011 at 4:26 Comment(1)
Mostly true, although JDK does expose concept of "code point", which is the 32-bit full unicode value decoded out of UCS-2 (~= UTF-16) characters. So while char is limited to 16 bits, Java is not oblivious to the fact that Unicode code points extend beyond 16 bits.Lowry

© 2022 - 2024 — McMap. All rights reserved.