Java Unicode encoding
Asked Answered
E

7

42

A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters. Does this mean that you can't handle certain Unicode characters in a Java application?

Does this boil down to what character encoding you are using?

Extractive answered 28/3, 2010 at 13:42 Comment(0)
I
40

You can handle them all if you're careful enough.

Java's char is a UTF-16 code unit. For characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).

See http://www.oracle.com/us/technologies/java/supplementary-142654.html for how to handle those characters in Java.

(BTW, in Unicode 5.2 there are 107,154 assigned characters out of 1,114,112 slots.)

Igloo answered 28/3, 2010 at 13:45 Comment(4)
The linked page above is one of the clearest I've read in differentiating between the different encodings, what the JVM APIs use, the meaning of certain phraseology ("code point" vs. "code unit") and what the JNI provides.Watchword
The following site is very clear yet quite detailed. It even goes beyond the definition of code points, and shows how to handle and count Graphemes (complete rendered character which may consist of more than one code point, when using combinatory diacritical marks) illegalargumentexception.blogspot.jp/2009/05/…Admonish
After @AllenGeorge's review I was excited to read the article only to discover the link is now broken :( Ruddy Oracle and their inability to 301 properly. Anyone able to update the link?Openhanded
I try to search in oracle website and found this one. http://www.oracle.com/us/technologies/java/supplementary-142654.htmlClaudell
F
14

Java uses UTF-16. A single Java char can only represent characters from the basic multilingual plane. Other characters have to be represented by a surrogate pair of two chars. This is reflected by API methods such as String.codePointAt().

And yes, this means that a lot of Java code will break in one way or another when used with characters outside the basic multilingual plane.

Formant answered 28/3, 2010 at 13:50 Comment(2)
How do String.length, substring, etc. handle strings with these characters?Lepper
@Bart: length() counts such characters as two chars, substring() also does and will happily break them up, resulting in invalid UTF-16. That's because such characters became part of Unicode only after Java was designed and Java doesn't do breaking changes. Thus, new methods were added to deal with surrogate pairs, but the old ones were left unchanged.Formant
A
13

To add to the other answers, some points to remember:

  • A Java char takes always 16 bits.

  • A Unicode character, when encoded as UTF-16, takes "almost always" (not always) 16 bits: that's because there are more than 64K unicode characters. Hence, a Java char is NOT a Unicode character (though "almost always" is).

  • "Almost always", above, means the 64K first code points of Unicode, range 0x0000 to 0xFFFF (BMP), which take 16 bits in the UTF-16 encoding.

  • A non-BMP ("rare") Unicode character is represented as two Java chars (surrogate representation). This applies also to the literal representation as a string: For example, the character U+20000 is written as "\uD840\uDC00".

  • Corolary: string.length() returns the number of java chars, not of Unicode chars. A string that has just one "rare" unicode character (eg U+20000) would return length() = 2 . Same consideration applies to any method that deals with char-sequences.

  • Java has little intelligence for dealing with non-BMP unicode characters as a whole. There are some utility methods that treat characters as code-points, represented as ints eg: Character.isLetter(int ch). Those are the real fully-Unicode methods.

Adaline answered 14/4, 2010 at 18:20 Comment(2)
Don't you mean 0x0000 to 0xFFFF? You only write 3 F's.Heraclitus
"almost always"? More than half the characters in Unicode are defined with numbers above that 64K boundary: 137,994 total characters defined in Unicode 12.1 supported by Java 14.Photothermic
P
6

You said:

A Java char is 2 bytes (max size of 65,536) but there are 95,221 Unicode characters.

Unicode grows

Actually, the inventory of characters defined in Unicode has grown dramatically. Unicode continues to grow — and not just because of emojis.

  • 143,859 characters in Unicode 13 (Java 15, release notes)
  • 137,994 characters in Unicode 12.1 (Java 13 & 14)
  • 136,755 characters in Unicode 10 (Java 11 & 12)
  • 120,737 characters in Unicode 8 (Java 9)
  • 110,182 characters in Unicode 6.2 (Java 8)
  • 109,449 characters in Unicode 6.0 (Java 7)
  • 96,447 characters in Unicode 4.0 (Java 5 & 6)
  • 49,259 characters in Unicode 3.0 (Java 1.4)
  • 38,952 characters in Unicode 2.1 (Java 1.1.7)
  • 38,950 characters in Unicode 2.0 (Java 1.1)
  • 34,233 characters in Unicode 1.1.5 (Java 1.0)

char is legacy

The char type is long outmoded, now legacy.

Use code point numbers

Instead, you should be working with code point numbers.


You asked:

Does this mean that you can't handle certain Unicode characters in a Java application?

The char type can address less than half of today's Unicode characters.

To represent any Unicode character, use code point numbers. Never use char.

Every character in Unicode is assigned a code point number. These range over a million, from 0 to 1,114,112. Doing the math when comparing to the numbers listed above, this means most of the numbers in that range have not yet been assigned to a character yet. Some of those numbers are reserved as Private Use Areas and will never be assigned.

The String class has gained methods for working with code point numbers, as did the Character class.

Get the code point number for any character in a string, by zero-based index number. Here we get 97 for the letter a.

int codePoint = "Cat".codePointAt( 1 ) ; // 97 = 'a', hex U+0061, LATIN SMALL LETTER A.

For the more general CharSequence rather than String, use Character.codePointAt.

We can get the Unicode name for a code point number.

String name = Character.getName( 97 ) ; // letter `a`

LATIN SMALL LETTER A

We can get a stream of the code point numbers of all the characters in a string.

IntStream codePointsStream = "Cat".codePoints() ;

We can turn that into a List of Integer objects. See How do I convert a Java 8 IntStream to a List?.

List< Integer > codePointsList = codePointsStream.boxed().collect( Collectors.toList() ) ;

Any code point number can be changed into a String of a single character by calling Character.toString.

String s = Character.toString( 97 ) ; // 97 is `a`, LATIN SMALL LETTER A. 

a

We can produce a String object from an IntStream of code point numbers. See Make a string from an IntStream of code point numbers?.

IntStream intStream = IntStream.of( 67 , 97 , 116 , 32 , 128_008 ); // 32 = SPACE, 128,008 = CAT (emoji).

String output =
        intStream
                .collect(                                     // Collect the results of processing each code point.
                        StringBuilder :: new ,                // Supplier<R> supplier
                        StringBuilder :: appendCodePoint ,    // ObjIntConsumer<R> accumulator
                        StringBuilder :: append               // BiConsumer<R,​R> combiner
                )                                             // Returns a `CharSequence` object.
                .toString();                                  // If you would rather have a `String` than `CharSequence`, call `toString`. 

Cat 🐈


You asked:

Does this boil down to what character encoding you are using?

Internally, a String in Java is always using UTF-16.

You only use other character encoding when importing or exporting text in or out of Java strings.

So, to answer your question, no, character encoding is not directly related here. Once you get your text into a Java String, it is in UTF-16 encoding and can therefore contain any Unicode character. Of course, to see that character, you must be using a font with a glyph defined for that particular character.

When exporting text from Java strings, if you specify a legacy character encoding that cannot represent some of the Unicode characters used in your text, you will have a problem. So use a modern character encoding, which nowadays means UTF-8 as UTF-16 is now considered harmful.

Photothermic answered 4/5, 2020 at 0:49 Comment(0)
T
3

Have a look at the Unicode 4.0 support in J2SE 1.5 article to learn more about the tricks invented by Sun to provide support for all Unicode 4.0 code points.

In summary, you'll find the following changes for Unicode 4.0 in Java 1.5:

  • char is a UTF-16 code unit, not a code point
  • new low-level APIs use an int to represent a Unicode code point
  • high level APIs have been updated to understand surrogate pairs
  • a preference towards char sequence APIs instead of char based methods

Since Java doesn't have 32 bit chars, I'll let you judge if we can call this good Unicode support.

Tarr answered 28/3, 2010 at 14:3 Comment(1)
"Unicode support" can be done a variety of ways, including (but not limited to) the UTF-8, UTF-16, and UTF-32 encodings. There are tradeoffs to be considered between the various encodings, but there's nothing "not good" about opting for UTF-16 instead of UTF-32.Ventilator
I
3

Here's Oracle's documentation on Unicode Character Representations. Or, if you prefer, a more thorough documentation here.

The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. The Unicode standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, known as Unicode scalar value. (Refer to the definition of the U+n notation in the Unicode standard.)

The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Characters whose code points are greater than U+FFFF are called supplementary characters. The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes. In this representation, supplementary characters are represented as a pair of char values, the first from the high-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

A char value, therefore, represents Basic Multilingual Plane (BMP) code points, including the surrogate code points, or code units of the UTF-16 encoding. An int value represents all Unicode code points, including supplementary code points. The lower (least significant) 21 bits of int are used to represent Unicode code points and the upper (most significant) 11 bits must be zero. Unless otherwise specified, the behavior with respect to supplementary characters and surrogate char values is as follows:

  • The methods that only accept a char value cannot support supplementary characters. They treat char values from the surrogate ranges as undefined characters. For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.
  • The methods that accept an int value support all Unicode characters, including supplementary characters. For example, Character.isLetter(0x2F81A) returns true because the code point value represents a letter (a CJK ideograph).
Impudicity answered 12/4, 2012 at 16:34 Comment(0)
C
1

From the OpenJDK7 documentation for String:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

Cymophane answered 28/3, 2010 at 13:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.