How to read non-BMP (astral) Unicode supplementary characters (code points)
Asked Answered
O

2

8

The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char or a int containing also only 16 bit. Which function reads complete Unicode symbols including SMP, SIP, TIP, SSP and PUA?

Update

I have asked how to read a single Unicode symbol (or code point) from a input stream. I neither have any integer array nor do I want to read a line.

It is possible to build a code point with Character.toCodePoint() but this function requires a char. On the other side reading a char is not possible because read() returns an int. My best work around so far is this but it still contains unsafe casts:

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (Character.isHighSurrogate((char)ch16))
    return Character.toCodePoint((char)ch16, (char)input.read());
  else 
    return (int)ch16;
}

How to do it better?

Update 2

Another version returning a String but still using casts:

public String readchar (Reader input) throws java.io.IOException
{
  int i16 = input.read(); // UTF-16 as int
  if (i16 == -1) return null;
  char c16 = (char)i16; // UTF-16
  if (Character.isHighSurrogate(c16)) {
    int low_i16 = input.read(); // low surrogate UTF-16 as int
    if (low_i16 == -1)
      throw new java.io.IOException ("Can not read low surrogate");
    char low_c16 = (char)low_i16;
    int codepoint = Character.toCodePoint(c16, low_c16);
    return new String (Character.toChars(codepoint));
  }
  else 
    return Character.toString(c16);
}

The remaining question: are the casts safe or how to avoid them?

Ostracod answered 28/6, 2013 at 9:14 Comment(5)
There is no need to add the major tag in the title.Lynnalynne
possible duplicate of Java reading in character streams with supplementary unicode charactersEnervated
The possible duplicate does not contain an answer.Ostracod
both of your answers are "correct" (although the first doesn't handle end of stream). nothing about your casts is unsafe.Messy
Read The WTF-8 encoding decode from potentially ill-formed UTF-16 to code points and vice versa…Layout
A
2

My best work around so far is this but it still contains unsafe casts

The only unsafe thing about the code you've presented is that ch16 might be -1 if input has reached EOF. If you check for this condition first then you can guarantee that the other (char) casts are safe as Reader.read() is specified to return either -1 or a value that is within the range of char (0 - 0xFFFF).

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (ch16 < 0 || !Character.isHighSurrogate((char)ch16))
    return ch16;
  else {
    int loSurr = input.read();
    if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) 
      return ch16; // or possibly throw an exception
    else 
      return Character.toCodePoint((char)ch16, (char)loSurr);
  }
}

This still isn't ideal, really you need to handle the edge case where the first char read is a high surrogate but the second one isn't a matching low surrogate, in which case you probably want to return the first char as-is and backup the reader so that the next read gives you the next character. But that only works if input.markSupported() == true. If you can guarantee that then how about

public int read_code_point (Reader input) throws java.io.IOException
{
  int firstChar = input.read();
  if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) {
    return firstChar;
  } else {
    input.mark(1);
    int secondChar = input.read();
    if(secondChar < 0) {
      // reached EOF
      return firstChar;
    } else if(!Character.isLowSurrogate((char)secondChar)) {
      // unpaired surrogates, un-read the second char
      input.reset();
      return firstChar;
    }
    else {
      return Character.toCodePoint((char)firstChar, (char)secondChar);
    }
  }
}

Or you could wrap the original reader in a PushbackReader and use unread(secondChar)

Alonaalone answered 28/6, 2013 at 12:9 Comment(2)
what does converting this to a codepoint gain? if you want to do anything useful, you most likely want the data in a String.Messy
@Messy Every parser needs the next character and not the next string. Would you say parsers are not useful?Ostracod
J
-1

Full Unicode can be represented in both UTF-8 and UTF-16, by sequences of bytes resp. byte pairs ("java chars"). From String a full Unicode code point can be extracted with:

int[] codePoints = { 0x1d11e };
String s = new String(codePoints, 0, codePoints.length);

for (int i = 0; i < s.length(); ) {
    int cp = s.codePointAt(i);
    i += Character.charCount(cp);
}

For a file with basically latin characters, UTF-8 would seem fine.

Tho following reads a full standard Unicode file (in UTF-8):

try (BufferedReader in = new BufferedReader(
        new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
    for (;;) {
        String line = in.readLine();
        if (line == null) {
            break;
        }
        ... do some thing with a Unicode line ...
    }
} catch (FileNotFoundException e) {
    System.err.println("No file: " + file.getPath());
} catch (IOException e) {
    ...
}

A function that delivers a Java String of one (or more Unicode codes):

String s = unicodeToString(0x1d11e);
String s = unicodeToString(0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x1d11e);

public static String unicodeToString(int... codepoints) {
    return new String(codePoints, 0, codePoints.length);
}
Jericajericho answered 28/6, 2013 at 9:52 Comment(3)
Detailed it more; here I read from a file, a FileInputStream. Maybe the confusion is that Unicode in itself is not a format, but standard numbering of symbols. UTF-8, UTF-16LE, UTF-16BE, UTF-16 are the actual binary formats. In effect Java uses Unicode in 2 formats: though char is UTF-16, in .class String constants are stored as UTF-8. UTF-8 covers full Unicode. In the code above the array codePoints uses the Unicode numbers.Jericajericho
It question asked for a single symbol not a full line. Using readline makes it necessary to unread the rest of the line.Ostracod
Ahah, will add it to the answer.Jericajericho

© 2022 - 2024 — McMap. All rights reserved.