Comparing a char to a code-point?

Asked 22/6, 2009 at 23:25 Answered 23/6, 2009 at 9:57

What is the "correct" way of comparing a code-point to a Java character? For example:

int codepoint = String.codePointAt(0);
char token = '\n';

I know I can probably do:

if (codepoint==(int) token)
{ ... }

but this code looks fragile. Is there a formal API method for comparing codepoints to chars, or converting the char up to a codepoint for comparison?

Contracted answered 22/6, 2009 at 23:25 Comment(0)

A little bit of background: When Java appeared in 1995, the char type was based on the original "Unicode 88" specification, which was limited to 16 bits. A year later, when Unicode 2.0 was implemented, the concept of surrogate characters was introduced to go beyond the 16 bit limit.

Java internally represents all Strings in UTF-16 format. For code points exceeding U+FFFF the code point is represented by a surrogate pair, i.e., two chars with the first being the high-surrogates code unit, (in the range \uD800-\uDBFF), the second being the low-surrogate code unit (in the range \uDC00-\uDFFF).

From the early days, all basic Character methods were based on the assumption that a code point could be represented in one char, so that's what the method signatures look like. I guess to preserve backward compatibility that was not changed when Unicode 2.0 came around and caution is needed when dealing with them. To quote from the Java documentation:

The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).

Casting the char to an int, as you do in your sample, works fine though.

Fellowship answered 23/6, 2009 at 0:30 Comment(3)

java.sun.com/developer/technicalArticles/Intl/Supplementary discusses the design decisions behind code-points in Java. – Contracted 24/6, 2009 at 21:43

Casting the char to an int is completely unnecessary. – Headstrong 9/8, 2023 at 22:33

Gili's link is no longer valid but I was able to copy it into wayback machine web.archive.org/web/20120605040611/http://java.sun.com/… – Flavone 7/12, 2023 at 12:53

The Character class contains many useful methods for working with Unicode code points. Note methods like Character.toChars(int) that return an array of chars. If your codepoint lies in the supplementary range, then the array will be two chars in length.

How you want to compare the values depends on whether you want to support the full range of Unicode values. This sample code can be used to iterate through a String's codepoints, testing to see if there is a match for the supplementary character MATHEMATICAL_FRAKTUR_CAPITAL_G (𝔊 - U+1D50A):

public final class CodePointIterator {

  private final String sequence;
  private int index = 0;

  public CodePointIterator(String sequence) {
    this.sequence = sequence;
  }

  public boolean hasNext() {
    return index < sequence.length();
  }

  public int next() {
    int codePoint = sequence.codePointAt(index);
    index += Character.charCount(codePoint);
    return codePoint;
  }

  public static void main(String[] args) {
    String sample = "A" + "\uD835\uDD0A" + "B" + "C";
    int match = 0x1D50A;
    CodePointIterator pointIterator = new CodePointIterator(sample);
    while (pointIterator.hasNext()) {
      System.out.println(match == pointIterator.next());
    }
  }
}

For Java 8 onwards CharSequence.codePoints() can be used:

public static void main(String[] args) {
  String sample = "A" + "\uD835\uDD0A" + "B" + "C";
  int match = 0x1D50A;
  sample.codePoints()
        .forEach(cp -> System.out.println(cp == match));
}

I created a table to help get a handle on Unicode string length and comparison cases that sometimes need to be handled.

Spurious answered 23/6, 2009 at 9:57 Comment(2)

The body of next() could be written as int codePoint = sequence.codePointAt(index); index += Character.charCount(codePoint); return codePoint; which might read better and be a miniscule bit more efficient. – Elmaleh 26/5, 2013 at 7:14

To concatenate the characters to a string, StringBuffer.appendCodePoint(int codePoint). – Cargo 12/1, 2017 at 11:14

For a character which can be represented by a single char (16 bits, basic multilingual plane), you can get the codepoint simply by casting the char to an integer (as the question suggests), so there's no need for a special method to perform a conversion.

If you're comparing a char to a codepoint, you don't need any special casing. Just compare the char to the int directly (as the question suggests). If the int represents a codepoint outside of the basic multilingual plane, the result will always be false.

Bowleg answered 23/6, 2009 at 0:57 Comment(0)

For characters in the basic multilingual plane, casting the char to an int will get you the codepoint. This corresponds to all the unicode values that can be encoded in a single 16 bit char value. Values outside this plane (with codepoints exceeding 0xffff) cannot be expressed as a single character. This is probably why there is no Character.toCodePoint(char value).

Kathleenkathlene answered 22/6, 2009 at 23:53 Comment(0)

Java uses a 16-bit (UTF-16) model for handling characters, so any characters with codepoints > 0xFFFF are stored in the strings as pairs of 16-bit characters using two surrogate characters to represent the plane and character within the plane.

If you want to handle characters and strings properly according to the full Unicode standard, you need to process strings taking this into account.

XML cares a lot about this; it's useful to access the XMLChar class in Xerces (which comes with Java version 5.0 and higher) for character-related code.

It's also instructive to look at the Saxon XSLT/XQuery processor, since being a well-behaved XML application, it has to take into account how Java stores codepoints in strings. XQuery 1.0 and XPath 2.0 have functions for codepoints-to-string and string-to-codepoints; it might be instructive to get a copy of Saxon and play with them to see how they work.

Whitworth answered 23/6, 2009 at 0:28 Comment(0)

Recommended topics

Hot tags