What is a "surrogate pair" in Java?

Asked 5/5, 2011 at 19:21 Answered 4/8, 2019 at 19:16

Solved java unicode utf-16 surrogate-pairs

193

I was reading the documentation for StringBuffer, in particular the reverse() method. That documentation mentions something about surrogate pairs. What is a surrogate pair in this context? And what are low and high surrogates?

Belak answered 5/5, 2011 at 19:21 Comment(4)

It's UTF-16 terminology, explained here: download.oracle.com/javase/6/docs/api/java/lang/… – Abduction 5/5, 2011 at 19:23

That method is buggy: it should reverse full characters ᴀᴋᴀ code points — not separate pieces of them, ᴀᴋᴀ code units. The bug is that that particular legacy method works only on individual char units instead of on code points, which is what you want Strings to be made up of, not just of char units. Too bad Java doesn’t allow you to use OO to fix that, but both the String class and the StringBuffer classes have been finalized. Say, isn’t that a euphemism for killed? :) – Vansickle 5/5, 2011 at 19:32

@Vansickle The documentation (and source) says that it does reverse as a string of code points. (Presumably 1.0.2 didn't do that, and you'd never get such a change of behaviour these days.) – Padlock 5/5, 2011 at 20:0

see also: unicode.org/faq/utf_bom.html#utf16-2 – Strath 19/3 at 4:57

160

The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.

In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.

Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.

The surrogate code units are in two ranges known as "high surrogates" and "low surrogates", depending on whether they are allowed at the start or end of the two-code-unit sequence.

Byrne answered 5/5, 2011 at 19:28 Comment(0)

Early Java versions represented Unicode characters using the 16-bit char data type. This design made sense at the time, because all Unicode characters had values less than 65,535 (0xFFFF) and could be represented in 16 bits. Later, however, Unicode increased the maximum value to 1,114,111 (0x10FFFF). Because 16-bit values were too small to represent all of the Unicode characters in Unicode version 3.1, 32-bit values — called code points — were adopted for the UTF-32 encoding scheme. But 16-bit values are preferred over 32-bit values for efficient memory use, so Unicode introduced a new design to allow for the continued use of 16-bit values. This design, adopted in the UTF-16 encoding scheme, assigns 1,024 values to 16-bit high surrogates(in the range U+D800 to U+DBFF) and another 1,024 values to 16-bit low surrogates(in the range U+DC00 to U+DFFF). It uses a high surrogate followed by a low surrogate — a surrogate pair — to represent (the product of 1,024 and 1,024)1,048,576 (0x100000) values between 65,536 (0x10000) and 1,114,111 (0x10FFFF) .

Tallahassee answered 27/11, 2017 at 7:1 Comment(7)

I like this better than the accepted answer, since it explains how Unicode 3.1 reserved 1024 + 1024 (high + low) values out of the original 65535 to gain 1024 * 1024 new values, with no added requirements that parsers start at the beginning of a string. – Homburg 18/12, 2017 at 21:48

I don't like this answer for implying UTF-16 is the most memory-efficient Unicode encoding. UTF-8 exists, and doesn't render most text as two bytes. UTF-16 is mostly used today because Microsoft picked it before UTF-32 was a thing, not for memory efficiency. About the only time you'd actually want UTF-16 is when you're doing a lot of file handling on Windows, and are therefore both reading and writing it a lot. Otherwise, UTF-32 for high speed (b/c constant offsets) or UTF-8 for low memory (b/c minimum 1 byte) – Binaural 18/9, 2019 at 12:59

@Nic more like, microsoft picked it because they didn't know UTF-32 was going to be necessary - so it was the high-speed constant-offset encoding, before unicode needed more space – Roswell 9/2, 2023 at 6:40

@Roswell Indeed. One might even choose to phrase this as "Microsoft picked it before UTF-32 was a thing", as UTF-32 only became a thing once it became clear UTF-16 was insufficient. A fact also reflected in the answer this comment is inextricably attached to. So... thanks for condescendingly repeating exactly what I and the answer already said. – Binaural 11/2, 2023 at 16:23

@Nic and thanks to you for condescendingly reading too hard into the answer - nowhere does it say anything close to "16 bits are the most efficient" - just that they are "preferred over 32-bit encodings", which does make a lot of sense... – Roswell 11/2, 2023 at 16:41

@Roswell Yes. The answer does say they are "preferred over 32-bit values". Weird how it says just that, and then ends, and there's coincidentally some squiggles in the shape of "for efficient memory use" right after that. I'm sure those squiggles mean nothing though. – Binaural 11/2, 2023 at 17:4

@Nic hi, just felt like pointing out that there's a difference between a positive and superlative adjective – Roswell 12/2, 2023 at 1:20

Adding some more info to the above answers from this post.

Tested in Java-12, should work in all Java versions above 5.

As mentioned here: https://mcmap.net/q/48008/-what-is-a-quot-surrogate-pair-quot-in-java,
whichever character (whose Unicode is above U+FFFF) is represented as a surrogate pair, which Java stores as a pair of char values, i.e. the single Unicode character is represented as two adjacent Java characters.
As we can see in the following example.

Length:

"🌉".length()  //2, Expectations was it should return 1

"🌉".codePointCount(0,"🌉".length())  //1, To get the number of Unicode characters in a Java String

Equality:
Represent "🌉" to String using Unicode \ud83c\udf09 as below and check equality.
```
"🌉".equals("\ud83c\udf09") // true
```
Java does not support UTF-32
```
"🌉".equals("\u1F309") // false
```

You can convert Unicode character to Java String

"🌉".equals(new String(Character.toChars(0x0001F309))) //true

String.substring() does not consider supplementary characters

"🌉🌐".substring(0,1) //"?"
"🌉🌐".substring(0,2) //"🌉"
"🌉🌐".substring(0,4) //"🌉🌐"

To solve this we can use String.offsetByCodePoints(int index, int codePointOffset)

"🌉🌐".substring(0,"🌉🌐".offsetByCodePoints(0,1) // "🌉"
"🌉🌐".substring(2,"🌉🌐".offsetByCodePoints(1,2)) // "🌐"

Iterating Unicode string with BreakIterator
Sorting Strings with Unicode java.text.Collator
Character's toUpperCase(), toLowerCase(), methods should not be used, instead, use String uppercase and lowercase of particular locale.
Character.isLetter(char ch) does not support, better used Character.isLetter(int codePoint), for each methodName(char ch) method in the Character class there will be type of methodName(int codePoint) which can handle supplementary characters.
Specify charset in String.getBytes(), converting from Bytes to String, InputStreamReader, OutputStreamWriter

New Methods are added in Java-21, java.lang.Character.isEmoji and new Regex related patterns, emoji data from here, This new functions can be helpful if you are using any library as mentioned here

public static void main(String[] args) {
    System.out.println('☺' + " isEmoji : " + isEmoji('☺')); // true
    System.out.println('❌' + " isEmoji : " + isEmoji('❌')); // true
    System.out.println('ž' + " isEmoji : " + isEmoji('ž')); // false

    emojiChecks("A");
    emojiChecks("©");
    emojiChecks("☺");
    emojiChecks("\uD83D\uDE0A");
}

private static void emojiChecks(String emoji) {
    // If any string is not emoji then it can not be moji_Component, Emoji_Presentation, Emoji_Modifier, and Emoji_Modifier_Base. 
    // Ref: https://unicode.org/reports/tr51/#Emoji_Properties_and_Data_Files
    final Pattern emojiPattern = Pattern.compile("\\p{IsEmoji}");
    final Pattern emojiModifierBasePattern = Pattern.compile("\\p{IsEmoji_Modifier_Base}");
    final Pattern emojiComponentPattern = Pattern.compile("\\p{IsEmoji_Component}");
    final Pattern emojiPresentationPattern = Pattern.compile("\\p{IsEmoji_Presentation}");
    final Pattern isExtendedPictographicPattern = Pattern.compile("\\p{IsExtended_Pictographic}");
    System.out.println(emoji + " IsEmoji: " + emojiPattern.matcher(emoji).matches());
    System.out.println(emoji + " IsEmojiModifierBase: " + emojiModifierBasePattern.matcher(emoji).matches());
    System.out.println(emoji + " IsEmojiComponent: " + emojiComponentPattern.matcher(emoji).matches());
    System.out.println(emoji + " IsEmojiPresentation: " + emojiPresentationPattern.matcher(emoji).matches());
    System.out.println(emoji + " IsExtended_Pictographic: " + isExtendedPictographicPattern.matcher(emoji).matches());
    System.out.println("----------------------------------------");
}

// output
☺ isEmoji : true
❌ isEmoji : true
ž isEmoji : false
A IsEmoji: false
A IsEmojiModifierBase: false
A IsEmojiComponent: false
A IsEmojiPresentation: false
A IsExtended_Pictographic: false
----------------------------------------
© IsEmoji: true
© IsEmojiModifierBase: false
© IsEmojiComponent: false
© IsEmojiPresentation: false
© IsExtended_Pictographic: true
----------------------------------------
☺ IsEmoji: true
☺ IsEmojiModifierBase: false
☺ IsEmojiComponent: false
☺ IsEmojiPresentation: false
☺ IsExtended_Pictographic: true
----------------------------------------
😊 IsEmoji: true
😊 IsEmojiModifierBase: false
😊 IsEmojiComponent: false
😊 IsEmojiPresentation: true
😊 IsExtended_Pictographic: true
----------------------------------------

Ref:
https://coolsymbol.com/emojis/emoji-for-copy-and-paste.html#objects
https://www.online-toolz.com/tools/text-unicode-entities-convertor.php
https://www.ibm.com/developerworks/library/j-unicode/index.html
https://www.oracle.com/technetwork/articles/javaee/supplementary-142654.html

More info on example image1 image2
Other terms worth to explore: Normalization, BiDi

Hebraize answered 28/2, 2019 at 9:39 Comment(0)

What that documentation is saying is that invalid UTF-16 strings may become valid after calling the reverse method since they might be the reverses of valid strings. A surrogate pair (discussed here) is a pair of 16-bit values in UTF-16 that encode a single Unicode code point; the low and high surrogates are the two halves of that encoding.

Shirleyshirlie answered 5/5, 2011 at 19:25 Comment(3)

Clarification. A string must be reversed on "true" characters (a.k.a "graphemes" or "text elements"). A single "character" code point could be one or two "char" chunks (surrogate pair), and a grapheme could be one or more of those code points (i.e. a base character code plus one or more combining character codes, each of which could be one or two 16-bit chunks or "chars" long). So a single grapheme could be three combining characters each two "chars" long, totaling 6 "chars". All 6 "chars" must be kept together, in order (i.e. not reversed), when reversing the entire string of characters. – Lighten 9/8, 2013 at 6:52

Hence the "char" data type is rather misleading. "character" is a loose term. The "char" type is really just the UTF16 chunk size and we call it character because of the relative rarity of surrogate pairs occuring (i.e. it usually represents a whole character code point), so "character" really refers to a single unicode code point, but then with the combining characters, you can have a sequence of characters that display as a single "character/grapheme/text element". This is not rocket science; the concepts are simple, but the language is confusing. – Lighten 9/8, 2013 at 6:58

At the time Java was being developed, Unicode was in it's infancy. Java was around for about 5 years before Unicode got surrogate pairs, so a 16-bit char fit pretty well at the time. Now, you're much better off using UTF-8 and UTF-32 than UTF-16. – Erund 8/10, 2013 at 3:8

Small preface

Unicode represents code points. Each code point can be encoded in 8-, 16,- or 32-bit blocks according to the Unicode standard.
Prior to the Version 3.1, mostly in use was 8-bit enconding, known as UTF-8, and 16-bit encoding, known as UCS-2 or “Universal Character Set coded in 2 octets”. UTF-8 encodes Unicode points as a sequence of 1-byte blocks, while UCS-2 always takes 2 bytes:

A = 41 - one block of 8-bits with UTF-8
A = 0041 - one block of 16-bits with UCS-2
Ω = CE A9 - two blocks of 8-bits with UTF-8
Ω = 03A9 - one block of 16-bits with UCS-2

Problem

The consortium thought that 16 bits would be enough to cover any human-readable language, which gives 2^16 = 65536 possible code values. This was true for the Plane 0, also known as BMP or Basic Multilingual Plane, that includes 55,445 of 65536 code points today. BMP covers almost every human language in the world, including Chinese-Japanese-Korean symbols (CJK).

The time passed and new Asian character sets were added, Chinese symbols took more than 70,000 points alone. Now, there are even Emoji points as part of the standard 😺. New 16 "additional" Planes were added. The UCS-2 room was not enough to cover anything bigger than Plane-0.

Unicode decision

Limit Unicode to the 17 planes × 65 536 characters per plane = 1 114 112 maximum points.
Present UTF-32, former known as UCS-4, to hold 32-bits for each code point and cover all planes.
Continue to use UTF-8 as dynamic encoding, limit UTF-8 to 4 bytes maximum for each code point, i.e. from 1 up to 4 bytes per point.
Deprecate UCS-2
Create UTF-16 based on UCS-2. Make UTF-16 dynamic, so it takes 2 bytes or 4 bytes per point. Assign 1024 points U+D800–U+DBFF, called High Surrogates, to UTF-16; assign 1024 symbols U+DC00–U+DFFF, called Low Surrogates, to UTF-16.

With those changes, BMP is covered with 1 block of 16 bits in UTF-16, while all "Supplementary characters" are covered with Surrogate Pairs presenting 2 blocks by 16 bits each, totally 1024x1024 = 1 048 576 points.

A high surrogate precedes a low surrogate. Any deviation from this rule is considered as a bad encoding. For example, a surrogate without a pair is incorrect, a low surrogate standing before a high surrogate is incorrect.

𝄞, 'MUSICAL SYMBOL G CLEF', is encoded in UTF-16 as a pair of surrogates 0xD834 0xDD1E (2 by 2 bytes),
in UTF-8 as 0xF0 0x9D 0x84 0x9E (4 by 1 byte),
in UTF-32 as 0x0001D11E (1 by 4 bytes).

Current situation

Although according to the standard the surrogates are specifically assigned only to UTF-16, historically some Windows and Java applications used UTF-8 and UCS-2 points reserved now to the surrogate range.
To support legacy applications with incorrect UTF-8/UTF-16 encodings, a new standard WTF-8, Wobbly Transformation Format, was created. It supports arbitrary surrogate points, such as a non-paired surrogate or an incorrect sequence. Today, some products do not comply with the standard and treat UTF-8 as WTF-8.
The surrogate solution opened some security problems, as well as attempts to use "illigal surrogate pairs".

Many historic details were suppressed to follow the topic ⚖.
The latest Unicode Standard can be found at http://www.unicode.org/versions/latest

Bottomless answered 4/8, 2019 at 19:16 Comment(3)

Your 'security problems' link is broken. – Oribelle 20/10, 2020 at 1:54

Thank you @Indolering, I did not find the old link, it was based on the UCS2 to UTF16 blog series: archives.miloush.net/michkap/archive/2009/06/10/9723321.html. Updated the text with the link. – Bottomless 20/10, 2020 at 19:55

really nice answer, I think it should deserve more votes. One typo is BPM, it should be BMP (Basic Multilingual Plane)? – Goglet 28/11, 2021 at 1:38

Surrogate pairs refer to UTF-16's way of encoding certain characters, see http://en.wikipedia.org/wiki/UTF-16/UCS-2#Code_points_U.2B10000..U.2B10FFFF

Galatia answered 5/5, 2011 at 19:23 Comment(2)

"character" is such a loaded term. – Lighten 9/8, 2013 at 7:0

There are no characters in Unicode, but there are codepoints. Each codepoint can render as zero to several characters. – Babyblueeyes 16/12, 2016 at 16:12

A surrogate pair is two 'code units' in UTF-16 that make up one 'code point'. The Java documentation is stating that these 'code points' will still be valid, with their 'code units' ordered correctly, after the reverse. It further states that two unpaired surrogate code units may be reversed and form a valid surrogate pair. Which means that if there are unpaired code units, then there is a chance that the reverse of the reverse may not be the same!

Notice, though, the documentation says nothing about Graphemes -- which are multiple codepoints combined. Which means e and the accent that goes along with it may still be switched, thus placing the accent before the e. Which means if there is another vowel before the e it may get the accent that was on the e.

Yikes!

Bitterweed answered 14/6, 2017 at 13:44 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags