Return code point of characters in C#

V

7

27

How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs.

With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account).

Victim answered 15/12, 2012 at 16:32 Comment(4)

This question is severely misworded. “Returning the ‘Unicode’ of a character” has no meaning, and frankly, is nonsense. Your example makes clear what you actually want, but the title needs to be reworked. Please do so. – Inflectional 15/12, 2012 at 17:13

Thanks. I have given you my upvote in appreciation. – Inflectional 15/12, 2012 at 17:33

Note that "UTF8 has 8-bit code units, UTF16 has 16-bit code units" isn't quite correct. Both UTF-8 and UTF-16 are variable length. UTF-8 can be between 1-4 bytes depending on the code point. It is designed to be backwards compatible with ASCII, but for characters outside of that, it needs either 2 or 4 bytes. Similarly, UTF-16 uses either 2 or 4 bytes. See en.wikipedia.org/wiki/UTF-8, en.wikipedia.org/wiki/UTF-16. – Charlottcharlotta 12/3 at 2:5

"UTF8 has 8-bit code units, UTF16 has 16-bit code units" actually is correct. Code Unit is the term for the pieces that make up a Code Point. In the case of UTF-8 the Code Units are 8-bit bytes and one Code Point will need between 1 and 4 of them. In the case of UTF-16 the Code Units are 16-bit words and one Code Point will require either 1 or 2 of them. This is stated in the Wikipedia links. – Isolated 12/4 at 5:13

H

14

Easy, since chars in C# is actually UTF16 code points:

char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);

To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
    int x = Char.ConvertToUtf32(input, i);
    Console.WriteLine("U+{0:X4}", x);
}

Hypophosphite answered 15/12, 2012 at 16:33 Comment(11)

They are unicode code units, not code points. What about characters that require more than one code unit? – Megargee 15/12, 2012 at 16:37

@driis... Same as GregS comment – Victim 15/12, 2012 at 16:39

@GregS: Can a char actually hold a character that requires more than one code unit? – Houghton 15/12, 2012 at 16:41

@GregS: Please see updated answer. My solution yields exactly the same result as the other (upvoted) answer, it just doesn't jump through as many hoops to get there. – Hypophosphite 15/12, 2012 at 16:46

@driis: I didn't downvote you, I was just offering a clarifying point. – Megargee 15/12, 2012 at 17:2

@dtb: no. I meant Unicode characters, not Char characters. I hate the whole Unicode terminology as it seems designed to confuse people. I still think this answer has "point" and "unit" swapped. – Megargee 15/12, 2012 at 17:3

@Qaesar lower case a ('a') is U+0061, uppercase a ('A') is U+0041 – Extension 15/12, 2012 at 17:15

@GregS A codepoint is an abstract, logical character, one divorced from its low-level physical layout. 99.99% of programmers want to work only with logical characters, not individual physical constituent components that are laid out differently on different sytsems. That means that a code unit is the ugly thing you never want to deal with. You only want to deal with code points. – Inflectional 15/12, 2012 at 17:18

@ All, I just want to know, which of ASCII or code point the processor considers it when it looks up for the letters?? I:m really getting confused. Thank you. – Victim 15/12, 2012 at 17:22

Sorry if we are confusing you. The problem is Unicode encodings is actually a bit complex even though they might not seem so at first glance. The code in this answer, or the one @Houghton posted, will work fine for you. I can recommend joelonsoftware.com/articles/Unicode.html if you want some more background. – Hypophosphite 15/12, 2012 at 17:31

@ driis. I have to say sorry because I bothered you . Your Kind action really appreciated. Many Thanks. – Victim 15/12, 2012 at 17:48

H

16

The following code writes the codepoints of a string input to the console:

string input = "\uD834\uDD61";

for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1)
{
    var codepoint = char.ConvertToUtf32(input, i);

    Console.WriteLine("U+{0:X4}", codepoint);
}

Output:

U+1D161

Since strings in .NET are UTF-16 encoded, the char values that make up the string need to be converted to UTF-32 first.

Houghton answered 15/12, 2012 at 16:46 Comment(3)

That doesn't convert to UTF-32 but returns the code point as integer, UTF-32 is an encoding, not an integer. This method naming propagates same confusion as microsoft labeling the UTF-16LE encoding as "unicode" – Extension 15/12, 2012 at 17:3

@Esailija: I wasn't sure what is more confusing: converting to a Unicode code point using a method named ConvertToUtf32, or converting to UTF-32 and treating the result as Unicode code point. In the end that's probably splitting hairs. – Houghton 15/12, 2012 at 17:8

you can't treat the result of converting to actual UTF-32 as code point, you need to decode the code points from the encoding, just like you would decode from UTF-16 or UTF-8, except simpler. But I can see why this would be seen nitpicky :P – Extension 15/12, 2012 at 17:12

Y

15

In .NET Core 3.0 or later, you can use the Rune Struct:

// Note that 😉 and 👍 are encoded using surrogate pairs
// but A, B, C and ✋ are not
var runes = "ABC✋😉👍".EnumerateRunes();

foreach (var r in runes)
    Console.Write($"U+{r.Value:X4} ");
        
// Writes: U+0041 U+0042 U+0043 U+270B U+1F609 U+1F44D

Yung answered 5/3, 2021 at 18:32 Comment(0)

H

14

Easy, since chars in C# is actually UTF16 code points:

char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);

To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
    int x = Char.ConvertToUtf32(input, i);
    Console.WriteLine("U+{0:X4}", x);
}