Return code point of characters in C#
Asked Answered
V

7

27

How can I return the Unicode Code Point of a character? For example, if the input is "A", then the output should be "U+0041". Ideally, a solution should take care of surrogate pairs.

With code point I mean the actual code point according to Unicode, which is different from code unit (UTF8 has 8-bit code units, UTF16 has 16-bit code units and UTF32 has 32-bit code units, in the latter case the value is equal to the code point, after taking endianness into account).

Victim answered 15/12, 2012 at 16:32 Comment(4)
This question is severely misworded. “Returning the ‘Unicode’ of a character” has no meaning, and frankly, is nonsense. Your example makes clear what you actually want, but the title needs to be reworked. Please do so.Inflectional
Thanks. I have given you my upvote in appreciation.Inflectional
Note that "UTF8 has 8-bit code units, UTF16 has 16-bit code units" isn't quite correct. Both UTF-8 and UTF-16 are variable length. UTF-8 can be between 1-4 bytes depending on the code point. It is designed to be backwards compatible with ASCII, but for characters outside of that, it needs either 2 or 4 bytes. Similarly, UTF-16 uses either 2 or 4 bytes. See en.wikipedia.org/wiki/UTF-8, en.wikipedia.org/wiki/UTF-16.Charlottcharlotta
"UTF8 has 8-bit code units, UTF16 has 16-bit code units" actually is correct. Code Unit is the term for the pieces that make up a Code Point. In the case of UTF-8 the Code Units are 8-bit bytes and one Code Point will need between 1 and 4 of them. In the case of UTF-16 the Code Units are 16-bit words and one Code Point will require either 1 or 2 of them. This is stated in the Wikipedia links.Isolated
H
14

Easy, since chars in C# is actually UTF16 code points:

char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);

To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
    int x = Char.ConvertToUtf32(input, i);
    Console.WriteLine("U+{0:X4}", x);
}
Hypophosphite answered 15/12, 2012 at 16:33 Comment(11)
They are unicode code units, not code points. What about characters that require more than one code unit?Megargee
@driis... Same as GregS commentVictim
@GregS: Can a char actually hold a character that requires more than one code unit?Houghton
@GregS: Please see updated answer. My solution yields exactly the same result as the other (upvoted) answer, it just doesn't jump through as many hoops to get there.Hypophosphite
@driis: I didn't downvote you, I was just offering a clarifying point.Megargee
@dtb: no. I meant Unicode characters, not Char characters. I hate the whole Unicode terminology as it seems designed to confuse people. I still think this answer has "point" and "unit" swapped.Megargee
@Qaesar lower case a ('a') is U+0061, uppercase a ('A') is U+0041Extension
@GregS A codepoint is an abstract, logical character, one divorced from its low-level physical layout. 99.99% of programmers want to work only with logical characters, not individual physical constituent components that are laid out differently on different sytsems. That means that a code unit is the ugly thing you never want to deal with. You only want to deal with code points.Inflectional
@ All, I just want to know, which of ASCII or code point the processor considers it when it looks up for the letters?? I:m really getting confused. Thank you.Victim
Sorry if we are confusing you. The problem is Unicode encodings is actually a bit complex even though they might not seem so at first glance. The code in this answer, or the one @Houghton posted, will work fine for you. I can recommend joelonsoftware.com/articles/Unicode.html if you want some more background.Hypophosphite
@ driis. I have to say sorry because I bothered you . Your Kind action really appreciated. Many Thanks.Victim
H
16

The following code writes the codepoints of a string input to the console:

string input = "\uD834\uDD61";

for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1)
{
    var codepoint = char.ConvertToUtf32(input, i);

    Console.WriteLine("U+{0:X4}", codepoint);
}

Output:

U+1D161

Since strings in .NET are UTF-16 encoded, the char values that make up the string need to be converted to UTF-32 first.

Houghton answered 15/12, 2012 at 16:46 Comment(3)
That doesn't convert to UTF-32 but returns the code point as integer, UTF-32 is an encoding, not an integer. This method naming propagates same confusion as microsoft labeling the UTF-16LE encoding as "unicode"Extension
@Esailija: I wasn't sure what is more confusing: converting to a Unicode code point using a method named ConvertToUtf32, or converting to UTF-32 and treating the result as Unicode code point. In the end that's probably splitting hairs.Houghton
you can't treat the result of converting to actual UTF-32 as code point, you need to decode the code points from the encoding, just like you would decode from UTF-16 or UTF-8, except simpler. But I can see why this would be seen nitpicky :PExtension
Y
15

In .NET Core 3.0 or later, you can use the Rune Struct:

// Note that 😉 and 👍 are encoded using surrogate pairs
// but A, B, C and ✋ are not
var runes = "ABC✋😉👍".EnumerateRunes();

foreach (var r in runes)
    Console.Write($"U+{r.Value:X4} ");
        
// Writes: U+0041 U+0042 U+0043 U+270B U+1F609 U+1F44D
Yung answered 5/3, 2021 at 18:32 Comment(0)
H
14

Easy, since chars in C# is actually UTF16 code points:

char x = 'A';
Console.WriteLine("U+{0:x4}", (int)x);

To address the comments, A char in C# is a 16 bit number, and holds a UTF16 code point. Code points above 16 the bit space cannot be represented in a C# character. Characters in C# is not variable width. A string however can have 2 chars following each other, each being a code unit, forming a UTF16 code point. If you have a string input and characters above the 16 bit space, you can use char.IsSurrogatePair and Char.ConvertToUtf32, as suggested in another answer:

string input = ....
for(int i = 0 ; i < input.Length ; i += Char.IsSurrogatePair(input,i) ? 2 : 1)
{
    int x = Char.ConvertToUtf32(input, i);
    Console.WriteLine("U+{0:X4}", x);
}
Hypophosphite answered 15/12, 2012 at 16:33 Comment(11)
They are unicode code units, not code points. What about characters that require more than one code unit?Megargee
@driis... Same as GregS commentVictim
@GregS: Can a char actually hold a character that requires more than one code unit?Houghton
@GregS: Please see updated answer. My solution yields exactly the same result as the other (upvoted) answer, it just doesn't jump through as many hoops to get there.Hypophosphite
@driis: I didn't downvote you, I was just offering a clarifying point.Megargee
@dtb: no. I meant Unicode characters, not Char characters. I hate the whole Unicode terminology as it seems designed to confuse people. I still think this answer has "point" and "unit" swapped.Megargee
@Qaesar lower case a ('a') is U+0061, uppercase a ('A') is U+0041Extension
@GregS A codepoint is an abstract, logical character, one divorced from its low-level physical layout. 99.99% of programmers want to work only with logical characters, not individual physical constituent components that are laid out differently on different sytsems. That means that a code unit is the ugly thing you never want to deal with. You only want to deal with code points.Inflectional
@ All, I just want to know, which of ASCII or code point the processor considers it when it looks up for the letters?? I:m really getting confused. Thank you.Victim
Sorry if we are confusing you. The problem is Unicode encodings is actually a bit complex even though they might not seem so at first glance. The code in this answer, or the one @Houghton posted, will work fine for you. I can recommend joelonsoftware.com/articles/Unicode.html if you want some more background.Hypophosphite
@ driis. I have to say sorry because I bothered you . Your Kind action really appreciated. Many Thanks.Victim
R
4

C# cannot store unicode codepoints in a char, as char is only 2 bytes and unicode codepoints routinely exceed that length. The solution is to either represent a codepoint as a sequence of bytes (either as a byte array or "flattened" into a 32-bit primitive) or as a string. The accepted answer converts to UTF32, but that's not always ideal.

This is the code we use to split a string into its unicode codepoint components, but preserving the native UTF-16 encoding. The result is an enumerable that can be used to compare (sub)strings natively in C#/.NET:

    public class InvalidEncodingException : System.Exception
    { }

    public static IEnumerable<string> UnicodeCodepoints(this string s)
    {
        for (int i = 0; i < s.Length; ++i)
        {
            if (Char.IsSurrogate(s[i]))
            {
                if (s.Length < i + 2)
                {
                    throw new InvalidEncodingException();
                }
                yield return string.Format("{0}{1}", s[i], s[++i]);
            }
            else
            {
                yield return string.Format("{0}", s[i]);
            }
        }
    }
}
Ruddy answered 7/4, 2017 at 14:14 Comment(0)
S
2

Actually there is some merit in @Yogendra Singh 's answer, currently the only one with negative voting. The job can be done like this

    public static IEnumerable<int> Utf8ToCodePoints(this string s)
    {
        var utf32Bytes = Encoding.UTF32.GetBytes(s);
        var bytesPerCharInUtf32 = 4;
        Debug.Assert(utf32bytes.Length % bytesPerCharInUtf32 == 0);
        for (int i = 0; i < utf32bytes.Length; i+= bytesPerCharInUtf32)
        {
            yield return BitConverter.ToInt32(utf32bytes, i);
        }
    }

Tested with

    var surrogatePairInput = "abc💩";
    Debug.Assert(surrogatePairInput.Length == 5);
    var pointsAsString = string.Join(";" , 
        surrogatePairInput
        .Utf8ToCodePoints()
        .Select(p => $"U+{p:X4}"));
    Debug.Assert(pointsAsString == "U+0061;U+0062;U+0063;U+1F4A9");

Example is relevant because the pile of poo is represented as a surrogate pair.

Sitton answered 21/6, 2017 at 15:12 Comment(3)
As a point of improvement rather than getting the utf8 bytes and then converting them to utf32 you could just get the utf32 bytes in the first place.Globin
Also the reason that the answer you mentioned has a negative score is that the method only accepts a char as a parameter which means it could never give you more than two bytes of information. Yours is a vast improvement because you actually parse a string and not a char.Globin
Thanks @Chris. I simplified the method.Digitoxin
G
-1

I found a little method on msdn forum. Hope this helps.

    public int get_char_code(char character){ 
        UTF32Encoding encoding = new UTF32Encoding(); 
        byte[] bytes = encoding.GetBytes(character.ToString().ToCharArray()); 
        return BitConverter.ToInt32(bytes, 0); 
    } 
Grim answered 15/12, 2012 at 16:39 Comment(2)
Does this ever return something different than (int)character? What happens if character is one half of a surrogate pair?Houghton
@Houghton (very late answer, I know). The interesting thing of this code is that it shows using UTF32Encoding, but since the method only takes a char, it has no effect and is the same as (int) character, though much slower than a cast. In fact, character.ToString().ToCharArray() will always return an array of one item (size 2 bytes), and the BitConverter will never return a value > 65535. Nice idea in principle, but useless in the way it is presented.Stat
E
-2
public static string ToCodePointNotation(char c)
{

    return "U+" + ((int)c).ToString("X4");
}

Console.WriteLine(ToCodePointNotation('a')); //U+0061
Extension answered 15/12, 2012 at 16:46 Comment(3)
@Qaesar lower case a ('a') is U+0061, uppercase a ('A') is U+0041Extension
You should throw an exception if Char.IsSurrogate(c) because such a code unit cannot be considered a codepoint value and therefore doesn't have a codepoint notation.Eelpout
This answer is simply not correct, you cannot presume there exists a one-to-one mapping between a C# char and a UTF-16 codepoint because there is none.Ruddy

© 2022 - 2024 — McMap. All rights reserved.