How would you get an array of Unicode code points from a .NET String?
Asked Answered
C

6

22

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.

I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...

How would you convert a string to an array (int[]) of 32-bit Unicode code points?

Circuity answered 26/3, 2009 at 20:3 Comment(0)
R
7

This answer is not correct. See @Virtlink's answer for the correct one.

static int[] ExtractScalars(string s)
{
  if (!s.IsNormalized())
  {
    s = s.Normalize();
  }

  List<int> chars = new List<int>((s.Length * 3) / 2);

  var ee = StringInfo.GetTextElementEnumerator(s);

  while (ee.MoveNext())
  {
    string e = ee.GetTextElement();
    chars.Add(char.ConvertToUtf32(e, 0));
  }

  return chars.ToArray();
}

Notes: Normalization is required to deal with composite characters.

Rhinoceros answered 26/3, 2009 at 20:28 Comment(10)
▼: Your solution discards any modifier characters, and you are dealing with text elements and not code points. For example, the result of ExtractScalars("El Ni\u006E\u0303o") converted back to a string would be "El Nino" instead of "El Niño".Meyerhof
@Virtlink: Interesting. From the docs it must have sounded like char.ConvertToUtf32(string, int) should deal with it. Edit: The damn docs claims it should! msdn.microsoft.com/en-us/library/z2ys180b(v=vs.110).aspxRhinoceros
@Virtlink: Ok, it does not deal with composite characters, but does for surrogate pairs.Rhinoceros
I realize you may be looking at my strange use of ConvertToUtf32 overloads. Yeah, that's fixed now, but that wasn't the issue. It's about the difference between surrogate pairs and composite characters, and text elements and code points. Your code indeed handles surrogate pairs.Meyerhof
@Virtlink: Fixed. Just Normalize the input ,if needed, to deal with composites. Your codepoints are in fact not normalized, not incorrect, but would be tricky :D Edit: The roundtrip works now. Thanks for pointing it out!Rhinoceros
@Rhinoceros Only some combinations of base character and composite characters will turn into a single codepoint when normalized to FormC. So this answer is still incorrect. Something TextElement is simply not the right approach when you want a sequence of codepoints.Agonist
Yeah, I was just looking into that. For example, the Devanagari syllable "ni" is a composable character \u0928\u093F that doesn't turn into one code point when normalized. Also, if you have a latin character with multiple modifiers (e.g. ^ and ~), that also doesn't get normalized into a single code point. You have to accept that your code deals with text elements (combinations of code points that represent a single grapheme) and you discard all code points except the first by doing ConvertToUtf32(e, 0). There is no way to make your code work with code points using text elements.Meyerhof
An alternative strategy is this: var bytes = Encoding.UTF32.GetBytes(s); var ints = new int[bytes.Length / 4]; for (var idx = 0; idx < ints.Length; ++idx) { ints[idx] = BitConverter.ToInt32(bytes, 4 * idx); }. You can still normalize s first, of course. You can use new UTF32Encoding(...) if you want strange endianness.Rephrase
@Virtlink: I see the issue now. Would have been nice if the second parameter was ref int to return to number of characters swallowed.Rhinoceros
Yes, Virtlink is right, this is broken. If the string contains "\u0928\u093F", the latter of those code points is swallowed. Both code points are in the BMP (plane 0), no surrogate pairs there obviously, but they still constitute one "text element".Rephrase
M
25

You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

  1. The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
  2. The character is outside the BMP, and encoded using a surrogare high-low pair of code units

Therefore, assuming the string is valid, this returns an array of code points for a given string:

public static int[] ToCodePoints(string str)
{
    if (str == null)
        throw new ArgumentNullException("str");

    var codePoints = new List<int>(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        codePoints.Add(Char.ConvertToUtf32(str, i));
        if (Char.IsHighSurrogate(str[i]))
            i += 1;
    }

    return codePoints.ToArray();
}

An example with a surrogate pair 🌀 and a composed character ñ:

ToCodePoints("\U0001F300 El Ni\u006E\u0303o");                        // 🌀 El Niño
// { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } // 🌀   E l   N i n ̃◌ o

Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:

ToCodePoints("\U0001D162\U0001D181");              // 𝅘𝅥𝅰𝆁
// { 0x1d162, 0x1d181 }                            // 𝅘𝅥𝅰 𝆁◌

When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:

ToCodePoints("\U0001D162\U0001D181".Normalize());  // 𝅘𝅥𝅰𝆁
// { 0x1d158, 0x1d165, 0x1d170, 0x1d181 }          // 𝅘 𝅥 𝅰 𝆁◌

Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ñ in the string is represented by a Latin lowercase n followed by a combining tilde ̃◌. Leppie's solution discards any combining characters that cannot be normalized into a single code point.

Meyerhof answered 26/1, 2015 at 17:12 Comment(5)
I'd use var codePoint = Char.ConvertToUtf32(...); if(codePoint > 0xFFFF) i++; instead of Char.IsHighSurrogate.Agonist
@CodesInChaos: I believe that would be equivalent. If and only if the first char is a high surrogate can you ever get a code point above 0xFFFF, but please tell me if I'm mistaken.Meyerhof
It's equivalent. It was only a stylistic suggestion.Agonist
You may want to add your Devanagari syllable "ni" example here as well, i.e. a single text element consisting of two code points that do not unite to a single code point under any normalization form. The tilde n, ñ, can turn into one code point through (suitable) normalization.Rephrase
@JeppeStigNielsen I instead added an example of a single text element of two code points that are both surrogate pairs and expand into four code point surrogate pairs under normalization.Meyerhof
R
7

This answer is not correct. See @Virtlink's answer for the correct one.

static int[] ExtractScalars(string s)
{
  if (!s.IsNormalized())
  {
    s = s.Normalize();
  }

  List<int> chars = new List<int>((s.Length * 3) / 2);

  var ee = StringInfo.GetTextElementEnumerator(s);

  while (ee.MoveNext())
  {
    string e = ee.GetTextElement();
    chars.Add(char.ConvertToUtf32(e, 0));
  }

  return chars.ToArray();
}

Notes: Normalization is required to deal with composite characters.

Rhinoceros answered 26/3, 2009 at 20:28 Comment(10)
▼: Your solution discards any modifier characters, and you are dealing with text elements and not code points. For example, the result of ExtractScalars("El Ni\u006E\u0303o") converted back to a string would be "El Nino" instead of "El Niño".Meyerhof
@Virtlink: Interesting. From the docs it must have sounded like char.ConvertToUtf32(string, int) should deal with it. Edit: The damn docs claims it should! msdn.microsoft.com/en-us/library/z2ys180b(v=vs.110).aspxRhinoceros
@Virtlink: Ok, it does not deal with composite characters, but does for surrogate pairs.Rhinoceros
I realize you may be looking at my strange use of ConvertToUtf32 overloads. Yeah, that's fixed now, but that wasn't the issue. It's about the difference between surrogate pairs and composite characters, and text elements and code points. Your code indeed handles surrogate pairs.Meyerhof
@Virtlink: Fixed. Just Normalize the input ,if needed, to deal with composites. Your codepoints are in fact not normalized, not incorrect, but would be tricky :D Edit: The roundtrip works now. Thanks for pointing it out!Rhinoceros
@Rhinoceros Only some combinations of base character and composite characters will turn into a single codepoint when normalized to FormC. So this answer is still incorrect. Something TextElement is simply not the right approach when you want a sequence of codepoints.Agonist
Yeah, I was just looking into that. For example, the Devanagari syllable "ni" is a composable character \u0928\u093F that doesn't turn into one code point when normalized. Also, if you have a latin character with multiple modifiers (e.g. ^ and ~), that also doesn't get normalized into a single code point. You have to accept that your code deals with text elements (combinations of code points that represent a single grapheme) and you discard all code points except the first by doing ConvertToUtf32(e, 0). There is no way to make your code work with code points using text elements.Meyerhof
An alternative strategy is this: var bytes = Encoding.UTF32.GetBytes(s); var ints = new int[bytes.Length / 4]; for (var idx = 0; idx < ints.Length; ++idx) { ints[idx] = BitConverter.ToInt32(bytes, 4 * idx); }. You can still normalize s first, of course. You can use new UTF32Encoding(...) if you want strange endianness.Rephrase
@Virtlink: I see the issue now. Would have been nice if the second parameter was ref int to return to number of characters swallowed.Rhinoceros
Yes, Virtlink is right, this is broken. If the string contains "\u0928\u093F", the latter of those code points is swallowed. Both code points are in the BMP (plane 0), no surrogate pairs there obviously, but they still constitute one "text element".Rephrase
I
4

Doesn't seem like it should be much more complicated than this:

public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
  bool      useBigEndian = !BitConverter.IsLittleEndian;
  Encoding  utf32        = new UTF32Encoding( useBigEndian , false , true ) ;
  byte[]    octets       = utf32.GetBytes( s ) ;

  for ( int i = 0 ; i < octets.Length ; i+=4 )
  {
    int codePoint = BitConverter.ToInt32(octets,i);
    yield return codePoint;
  }

}
Irvine answered 26/1, 2015 at 18:11 Comment(3)
BitConverter uses native endianness, Encoding.UTF32 uses little endian. So this will break on a big endian system.Agonist
I just want to say that I posted the same solution (virtually) as a comment to leppie's answer, six seconds before you submitted your answer. And mentioned endianness trouble as well.Rephrase
@JeppeStigNielsen: Clearly, great minds think alike :)Irvine
B
1

I came up with the same approach suggested by Nicholas (and Jeppe), just shorter:

    public static IEnumerable<int> GetCodePoints(this string s) {
        var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
        var bytes = utf32.GetBytes(s);
        return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
    }

The enumeration was all I needed, but getting an array is trivial:

int[] codePoints = myString.GetCodePoints().ToArray();
Bark answered 19/7, 2016 at 14:10 Comment(1)
This gave the same output as the accepted answer. Thanks!Cataphyll
F
1

This solution produces the same results as the solution by Daniel A.A. Pelsmaeker but is a little bit shorter:

public static int[] ToCodePoints(string s)
{
    byte[] utf32bytes = Encoding.UTF32.GetBytes(s);
    int[] codepoints = new int[utf32bytes.Length / 4];
    Buffer.BlockCopy(utf32bytes, 0, codepoints, 0, utf32bytes.Length);
    return codepoints;
}
Fredricfredrick answered 12/6, 2020 at 6:44 Comment(1)
This gives the same output as the accepted answer even for ZWJ sequences. Thanks!Cataphyll
C
0

Another solution from here:

    public static int[] GetCodePoints(string input)
    {
        var cp_lst = new ArrayList();
        for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1) {
            int codepoint = char.ConvertToUtf32(input, i);
            cp_lst.Add(codepoint);
            //Console.WriteLine(codepoint);
        }
        return (int[]) cp_lst.ToArray(typeof(int));
    }
Cataphyll answered 9/3, 2023 at 4:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.