It appears that you want to extract the first "atomic" character from the user point of view (i.e. the first Unicode grapheme cluster) from the highUnicodeChar
string, where an "atomic" character includes both halves of a surrogate pair.
You can use StringInfo.GetTextElementEnumerator()
to do just this, breaking a string
down into atomic chunks then taking the first.
First, define the following extension method:
public static class TextExtensions
{
public static IEnumerable<string> TextElements(this string s)
{
// StringInfo.GetTextElementEnumerator is a .Net 1.1 class that doesn't implement IEnumerable<string>, so convert
if (s == null)
yield break;
var enumerator = StringInfo.GetTextElementEnumerator(s);
while (enumerator.MoveNext())
yield return enumerator.GetTextElement();
}
}
Now, you can do:
var result2 = highUnicodeChar.TextElements().FirstOrDefault() ?? "";
Note that StringInfo.GetTextElementEnumerator()
will also group Unicode combining characters, so that the first grapheme cluster of the string Hฬ=Tฬ+Vฬ
will be Hฬ
not H
.
Sample fiddle here.
Update for .NET Core
In .NET 6 and later, you can use StringInfo.GetNextTextElementLength(ReadOnlySpan<Char>)
to iterate through the text elements of a string as a sequence of slices like so:
public static class TextExtensions
{
public static IEnumerable<ReadOnlyMemory<char>> TextElements(this string s) => (s ?? "").AsMemory().TextElements();
public static IEnumerable<ReadOnlyMemory<char>> TextElements(this ReadOnlyMemory<char> s)
{
for (int index = 0, length = StringInfo.GetNextTextElementLength(s.Span);
length > 0;
index += length, length = StringInfo.GetNextTextElementLength(s.Span.Slice(index)))
yield return s.Slice(index, length);
}
}
This avoids allocating a string
for each grapheme.
Or, if you just want the first grapheme, you can do:
var first = highUnicodeChar.AsSpan()
.Slice(0, StringInfo.GetNextTextElementLength(highUnicodeChar));
Demo fiddle #2 here.
And in .NET Core 3 and later, if you really only want to enumerate through the Unicode code points of a string
treating surrogate pairs as a single character but ignoring combining characters and other grapheme groupings, you may use
String.EnumerateRunes()
to enumerate it as a sequence of
Rune
structs:
var highUnicodeChar = "๐"; //Not the standard A
foreach (var rune in highUnicodeChar.EnumerateRunes())
{
Console.WriteLine($"{rune} = {rune.Value:X}"); // Prints ๐ = 1D400
}
The Rune
struct:
represents a Unicode scalar value, which means any code point excluding the surrogate range (U+D800..U+DFFF). The type's constructors and conversion operators validate the input, so consumers can call the APIs assuming that the underlying Rune instance is well formed.
Demo fiddle #3 here.