How do I create a string with a surrogate pair inside of it?

B

2

15

I saw this post on Jon Skeet's blog where he talks about string reversing. I wanted to try the example he showed myself, but it seems to work... which leads me to believe that I have no idea how to create a string that contains a surrogate pair which will actually cause the string reversal to fail. How does one actually go about creating a string with a surrogate pair in it so that I can see the failure myself?

Brosy answered 15/1, 2013 at 22:6 Comment(0)

D

15

The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme (see this page for more information);

In the Unicode character encoding, characters are mapped to values between 0x000000 and 0x10FFFF. Internally, a UTF-16 encoding scheme is used to store strings of Unicode text in which two-byte (16-bit) code sequences are considered. Since two bytes can only contain the range of characters from 0x0000 to 0xFFFF, some additional complexity is used to store values above this range (0x010000 to 0x10FFFF).

This is done using pairs of code points known as surrogates. The surrogate characters are classified in two distinct ranges known as low surrogates and high surrogates, depending on whether they are allowed at the start or the end of the two-code sequence.

Try this yourself:

String surrogate = "abc" + Char.ConvertFromUtf32(Int32.Parse("2A601", NumberStyles.HexNumber)) + "def";

Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);

String surrogateReversed = new String(surrogateArray);

or this, if you want to stick with the blog example:

String surrogate = "Les Mise" + Char.ConvertFromUtf32(Int32.Parse("0301", NumberStyles.HexNumber)) + "rables";

Char[] surrogateArray = surrogate.ToCharArray();
Array.Reverse(surrogateArray);

String surrogateReversed = new String(surrogateArray);

nnd then check the string values with the debugger. Jon Skeet is damn right... strings and dates seem easy but they are absolutely NOT.

Disbelieve answered 15/1, 2013 at 22:23 Comment(3)

Interesting that the example shows up exactly as described in LinqPad, but not in a Visual Studio console application. – Brosy 15/1, 2013 at 22:33

In C#, you can write a hexadecimal Int32 value like this: 0x2A601 So there's no need to use Int32.Parse with NumberStyles. But you can also just say "\U0002A601" to get the Unicode character. See my answer. – Mallorca 15/1, 2013 at 22:50

Regarding "Les Misérables", there's also another way to decompose it: string surrogate = "Les Misérables".Normalize(NormalizationForm.FormD); – Mallorca 15/1, 2013 at 23:25

M

21

The simplest way is to use \U######## where the U is capital, and the # denote exactly eight hexadecimal digits. If the value exceeds 0000FFFF hexadecimal, a surrogate pair will be needed:

string myString = "In the game of mahjong \U0001F01C denotes the Four of circles";

You can check myString.Length to see that the one Unicode character occupies two .NET Char values. Note that the char type has a couple of static methods that will help you determine if a char is a part of a surrogate pair.

If you use a .NET language that does not have something like the \U######## escape sequence, you can use the method ConvertFromUtf32, for example:

string fourCircles = char.ConvertFromUtf32(0x1F01C);

Addition: If your C# source file has an encoding that allows all Unicode characters, like UTF-8, you can just put the charater directly in the file (by copy-paste). For example:

string myString = "In the game of mahjong 🀜 denotes the Four of circles";

The character is UTF-8 encoded in the source file (in my example) but will be UTF-16 encoded (surrogate pairs) when the application runs and the string is in memory.

(Not sure if Stack Overflow software handles my mahjong character correctly. Try clicking "edit" to this answer and copy-paste from the text there, if the "funny" character is not here.)

Mallorca answered 15/1, 2013 at 22:43 Comment(3)

Stack Overflow software handles your mahjong character correctly, i did a copy paste into a editor and it shows the utf-8 sequence 0xF0 0x9F 0x80 0x9C which is a 4-byte sequence and encodes the unicode codepoint 0x1F01C which is decimal 127004 and this is indeed the "MAHJONG TILE FOUR OF CIRCLES" codepoint. But possibly (as i do) we do not see the character because the font doesn't contain the glyph, so instead, a replacement glyph/character is shown. – Tate 17/6, 2014 at 9:52

@Tate Yes, Stack Overflow seems to work flawlessly with characters from outside plane 0. Problems can arise from missing fonts or web browser support (old browsers). – Mallorca 17/6, 2014 at 11:41

Thanks to the "has a couple of static methods that will help you determine if a char is a part of a surrogate pair" I found Char.IsSurrogate(myString[i]), and was able to correctly identify surrogate pairs in a really easy and reliable way. – Millikan 2/9, 2020 at 2:36

D

15