string and 4-byte Unicode characters

Asked 23/12, 2012 at 11:53 Answered 23/12, 2012 at 12:5

I have one question about strings and chars in C#. I found that a string in C# is a Unicode string, and a char takes 2 bytes. So every char is in UTF-16 encoding. That's great, but I also read on Wikipedia that there are some characters that in UTF-16 take 4 bytes.

I'm doing a program that lets you draw characters for alphanumerical displays. In program there is also a tester, where you can write some string, and it draws it for you to see how it looks.

So how I should work with strings, where the user writes a character which takes 4 bytes, i.e. 2 chars. Because I need to go char by char through the string, find this char in the list, and draw it into the panel.

Interpretive answered 23/12, 2012 at 11:53 Comment(2)

Going char by char simply doesn't work. Even going codepoint by codepoint doesn't work, because there are combining characters, ligatures, control characters, etc. – Paramagnet 23/12, 2012 at 11:55

Correct display representation units are called grapheme clusters. Sometimes they are more than one code point. – Klondike 24/12, 2012 at 9:18

You you could do:

for( int i = 0; i < str.Length; ++i ) {
    int codePoint = Char.ConvertToUTF32( str, i );
    if( codePoint > 0xffff ) {
        i++;
    }
}

Then the codePoint represents any possible code point as a 32 bit integer.

Orest answered 23/12, 2012 at 11:57 Comment(3)

This looks pretty simple and clear how it works. Thank you But now I tried to find some 4 bytes UTF-16 character, and I wasn't succesful, or this character is represented as '𝄞' so this is almost absolutely pointless question. But thanks – Interpretive 23/12, 2012 at 12:27

Here's a character which definitely takes 2 chars: 𒀠 – Foliolate 7/1, 2021 at 1:17

Did some playing around, if you make a string "𒀠", it'll report its Length as 2. If you do Char.ConvertToUTF32("𒀠", 0) you get 73760 (exceeds char.MaxValue) But if you do Char.ConvertToUTF32("𒀠", 1) you get an error: Found a low surrogate char without a preceding high surrogate at index: 1. In other words, according to the spec it knows that this is the tail end of a 2-char character, and so it's invalid to pass only the second char of the pair. Also no overloads of that method take a single char, either a string with an index or else a pair of chars. – Foliolate 7/1, 2021 at 1:17

Work entirely with String objects; don't use Char at all. Example using IndexOf:

var needle = "ℬ";    // U+1D49D (I think)
var hayStack = "a code point outside basic multi lingual plane: ℬ";
var index = heyStack.IndexOf(needle);

Most methods on the String class have overloads which accept Char or String. Most methods on Char have overrides which use String as well. Just don't use Char.

Machmeter answered 23/12, 2012 at 12:5 Comment(1)

I'm going to profess ignorance about combining characters, control characters, etc. I don't know enough about them to handle them correctly. Read up about Unicode in .NET and write some tests! – Machmeter 23/12, 2012 at 12:7

Recommended topics

Hot tags