How to read non-BMP (astral) Unicode supplementary characters (code points)

Asked 28/6, 2013 at 9:14 Answered 28/6, 2013 at 12:9

Solved java unicode codepoint surrogate-pairs supplementary

The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char or a int containing also only 16 bit. Which function reads complete Unicode symbols including SMP, SIP, TIP, SSP and PUA?

Update

I have asked how to read a single Unicode symbol (or code point) from a input stream. I neither have any integer array nor do I want to read a line.

It is possible to build a code point with Character.toCodePoint() but this function requires a char. On the other side reading a char is not possible because read() returns an int. My best work around so far is this but it still contains unsafe casts:

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (Character.isHighSurrogate((char)ch16))
    return Character.toCodePoint((char)ch16, (char)input.read());
  else 
    return (int)ch16;
}

How to do it better?

Update 2

Another version returning a String but still using casts:

public String readchar (Reader input) throws java.io.IOException
{
  int i16 = input.read(); // UTF-16 as int
  if (i16 == -1) return null;
  char c16 = (char)i16; // UTF-16
  if (Character.isHighSurrogate(c16)) {
    int low_i16 = input.read(); // low surrogate UTF-16 as int
    if (low_i16 == -1)
      throw new java.io.IOException ("Can not read low surrogate");
    char low_c16 = (char)low_i16;
    int codepoint = Character.toCodePoint(c16, low_c16);
    return new String (Character.toChars(codepoint));
  }
  else 
    return Character.toString(c16);
}

The remaining question: are the casts safe or how to avoid them?

Ostracod answered 28/6, 2013 at 9:14 Comment(5)

There is no need to add the major tag in the title. – Lynnalynne 28/6, 2013 at 9:15

possible duplicate of Java reading in character streams with supplementary unicode characters – Enervated 28/6, 2013 at 9:33

The possible duplicate does not contain an answer. – Ostracod 28/6, 2013 at 9:59

both of your answers are "correct" (although the first doesn't handle end of stream). nothing about your casts is unsafe. – Messy 28/6, 2013 at 22:17

Read The WTF-8 encoding decode from potentially ill-formed UTF-16 to code points and vice versa… – Layout 2/11, 2022 at 20:47

My best work around so far is this but it still contains unsafe casts

The only unsafe thing about the code you've presented is that ch16 might be -1 if input has reached EOF. If you check for this condition first then you can guarantee that the other (char) casts are safe as Reader.read() is specified to return either -1 or a value that is within the range of char (0 - 0xFFFF).

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (ch16 < 0 || !Character.isHighSurrogate((char)ch16))
    return ch16;
  else {
    int loSurr = input.read();
    if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) 
      return ch16; // or possibly throw an exception
    else 
      return Character.toCodePoint((char)ch16, (char)loSurr);
  }
}

This still isn't ideal, really you need to handle the edge case where the first char read is a high surrogate but the second one isn't a matching low surrogate, in which case you probably want to return the first char as-is and backup the reader so that the next read gives you the next character. But that only works if input.markSupported() == true. If you can guarantee that then how about

public int read_code_point (Reader input) throws java.io.IOException
{
  int firstChar = input.read();
  if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) {
    return firstChar;
  } else {
    input.mark(1);
    int secondChar = input.read();
    if(secondChar < 0) {
      // reached EOF
      return firstChar;
    } else if(!Character.isLowSurrogate((char)secondChar)) {
      // unpaired surrogates, un-read the second char
      input.reset();
      return firstChar;
    }
    else {
      return Character.toCodePoint((char)firstChar, (char)secondChar);
    }
  }
}

Or you could wrap the original reader in a PushbackReader and use unread(secondChar)

Alonaalone answered 28/6, 2013 at 12:9 Comment(2)

what does converting this to a codepoint gain? if you want to do anything useful, you most likely want the data in a String. – Messy 28/6, 2013 at 21:36

@Messy Every parser needs the next character and not the next string. Would you say parsers are not useful? – Ostracod 25/7, 2014 at 8:30

-1

Full Unicode can be represented in both UTF-8 and UTF-16, by sequences of bytes resp. byte pairs ("java chars"). From String a full Unicode code point can be extracted with:

int[] codePoints = { 0x1d11e };
String s = new String(codePoints, 0, codePoints.length);

for (int i = 0; i < s.length(); ) {
    int cp = s.codePointAt(i);
    i += Character.charCount(cp);
}

For a file with basically latin characters, UTF-8 would seem fine.

Tho following reads a full standard Unicode file (in UTF-8):

try (BufferedReader in = new BufferedReader(
        new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
    for (;;) {
        String line = in.readLine();
        if (line == null) {
            break;
        }
        ... do some thing with a Unicode line ...
    }
} catch (FileNotFoundException e) {
    System.err.println("No file: " + file.getPath());
} catch (IOException e) {
    ...
}

A function that delivers a Java String of one (or more Unicode codes):

String s = unicodeToString(0x1d11e);
String s = unicodeToString(0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x1d11e);

public static String unicodeToString(int... codepoints) {
    return new String(codePoints, 0, codePoints.length);
}

Jericajericho answered 28/6, 2013 at 9:52 Comment(3)

Detailed it more; here I read from a file, a FileInputStream. Maybe the confusion is that Unicode in itself is not a format, but standard numbering of symbols. UTF-8, UTF-16LE, UTF-16BE, UTF-16 are the actual binary formats. In effect Java uses Unicode in 2 formats: though char is UTF-16, in .class String constants are stored as UTF-8. UTF-8 covers full Unicode. In the code above the array codePoints uses the Unicode numbers. – Jericajericho 28/6, 2013 at 10:33

It question asked for a single symbol not a full line. Using readline makes it necessary to unread the rest of the line. – Ostracod 28/6, 2013 at 10:34

Ahah, will add it to the answer. – Jericajericho 28/6, 2013 at 10:35

Recommended topics

Hot tags