How to make charArray that doesn't separate diacritics?

Asked 11/5, 2022 at 12:41 Answered 12/5, 2022 at 6:33

c#unity-game-engine normalization diacritics hebrew

I'm trying to separate a Hebrew word into letters in C#, but ToCharArray() separates the diacritics as if they're separate letters (which they're not). I'm fine with either keeping the letters whole with their diacritics, or worst case getting rid of the diacritics altogether.

Example: כֶּלֶב is coming out as 6 different letters.

Goldfilled answered 11/5, 2022 at 12:41 Comment(11)

Does word.Length() gives you the correct value? – Gustavogustavus 11/5, 2022 at 13:2

Post an example. There are normalization functions but it's quite possible the diacritics are separate characters. Unicode allows specific characters to appear one over the other – Revivalist 11/5, 2022 at 13:2

@GeekyQuentin word.Length is counting the diacritics as seperate letters – Goldfilled 11/5, 2022 at 13:8

@PanagiotisKanavos כֶּלֶב is an example. 3 letter word, it coming out as 6 – Goldfilled 11/5, 2022 at 13:8

I have taken an example of a name with diacritics and it turned out that they are not getting separated. Can you take a look at this and tell me if I was mistaken anywhere? – Gustavogustavus 11/5, 2022 at 13:10

@GeekyQuentin try it with the word I just posted, it seperates it into 6 letters – Goldfilled 11/5, 2022 at 13:12

They are separate letters, and when you use non-separate letters this is a denormalized string. Perhaps you can strip diacritics. Why do you need them like this anyway? – Guard 11/5, 2022 at 14:29

They are multiple char .. try writing char y = 'לֶ'; or char x = 'כֶּ'; -> No you simply can not force a symbol that is combined of 2 or 3 char into a single char ... nevertheless, interesting case though ;) – Avaria 11/5, 2022 at 14:34

@Guard one use case I can think of might e.g. be "letter by letter" animations ^^ – Avaria 11/5, 2022 at 14:36

@Avaria it's essentially "letter by letter animations" like Charlieface said, the problem is if you use "Trim()" on the whole string it doesn't remove the diacritics, as if they're one letter, but when you split it into chars all of the sudden you can iterate over them as different letters – Goldfilled 11/5, 2022 at 14:52

Trim is for whitesapce so don't see how that's relevant. Yes they are different letters, but the font will specify how to produce the glyph overlapping the previous letter. So you need to draw the string, then scroll the result – Guard 11/5, 2022 at 15:12

The StringInfo class knows about base characters and accents and can handle this:

string s = "כֶּלֶב";
System.Globalization.TextElementEnumerator charEnum = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
    Console.WriteLine(charEnum.GetTextElement());
}

will print 3 lines:

כֶּ
לֶ
ב

Vienna answered 12/5, 2022 at 6:33 Comment(0)

Strings in C# are stored as arrays of char. That is to say: they are arrays of UTF-16 code units. ToCharArray() just returns that UTF-16 array. And it sometimes takes multiple code units to form a single "symbol".

Would char.GetUnicodeCategory(char) be of any help? Maybe you could split that array on OtherLetter or something (not familiar with Hebrew)?

const string word = "כֶּלֶב";
Console.WriteLine(word.Length);
Console.WriteLine(string.Join(" ", word.ToCharArray().Select(x => (int)x)));
Console.WriteLine(string.Join(" ", word.ToCharArray().Select(char.GetUnicodeCategory)));

Output:

6
1499 1468 1462 1500 1462 1489
OtherLetter NonSpacingMark NonSpacingMark OtherLetter NonSpacingMark OtherLetter

Cyb answered 11/5, 2022 at 18:2 Comment(2)

interesting! So basically you would need to combine them until reaching the next OtherLetter item – Avaria 12/5, 2022 at 7:51

@Avaria I think so. You might have to try it on for size. But I really like Hans's solution above and I +1'd it. Maybe you'd only need to resort to char.GetUnicodeCategory if you need really low-level control – Cyb 12/5, 2022 at 15:44

Recommended topics

Hot tags