How to make charArray that doesn't separate diacritics?
Asked Answered
G

2

5

I'm trying to separate a Hebrew word into letters in C#, but ToCharArray() separates the diacritics as if they're separate letters (which they're not). I'm fine with either keeping the letters whole with their diacritics, or worst case getting rid of the diacritics altogether.

Example: כֶּלֶב is coming out as 6 different letters.

Goldfilled answered 11/5, 2022 at 12:41 Comment(11)
Does word.Length() gives you the correct value?Gustavogustavus
Post an example. There are normalization functions but it's quite possible the diacritics are separate characters. Unicode allows specific characters to appear one over the otherRevivalist
@GeekyQuentin word.Length is counting the diacritics as seperate lettersGoldfilled
@PanagiotisKanavos כֶּלֶב is an example. 3 letter word, it coming out as 6Goldfilled
I have taken an example of a name with diacritics and it turned out that they are not getting separated. Can you take a look at this and tell me if I was mistaken anywhere?Gustavogustavus
@GeekyQuentin try it with the word I just posted, it seperates it into 6 lettersGoldfilled
They are separate letters, and when you use non-separate letters this is a denormalized string. Perhaps you can strip diacritics. Why do you need them like this anyway?Guard
They are multiple char .. try writing char y = 'לֶ'; or char x = 'כֶּ'; -> No you simply can not force a symbol that is combined of 2 or 3 char into a single char ... nevertheless, interesting case though ;)Avaria
@Guard one use case I can think of might e.g. be "letter by letter" animations ^^Avaria
@Avaria it's essentially "letter by letter animations" like Charlieface said, the problem is if you use "Trim()" on the whole string it doesn't remove the diacritics, as if they're one letter, but when you split it into chars all of the sudden you can iterate over them as different lettersGoldfilled
Trim is for whitesapce so don't see how that's relevant. Yes they are different letters, but the font will specify how to produce the glyph overlapping the previous letter. So you need to draw the string, then scroll the resultGuard
V
5

The StringInfo class knows about base characters and accents and can handle this:

string s = "כֶּלֶב";
System.Globalization.TextElementEnumerator charEnum = System.Globalization.StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
    Console.WriteLine(charEnum.GetTextElement());
}

will print 3 lines:

כֶּ
לֶ
ב

Vienna answered 12/5, 2022 at 6:33 Comment(0)
C
3

Strings in C# are stored as arrays of char. That is to say: they are arrays of UTF-16 code units. ToCharArray() just returns that UTF-16 array. And it sometimes takes multiple code units to form a single "symbol".

Would char.GetUnicodeCategory(char) be of any help? Maybe you could split that array on OtherLetter or something (not familiar with Hebrew)?

const string word = "כֶּלֶב";
Console.WriteLine(word.Length);
Console.WriteLine(string.Join(" ", word.ToCharArray().Select(x => (int)x)));
Console.WriteLine(string.Join(" ", word.ToCharArray().Select(char.GetUnicodeCategory)));

Output:

6
1499 1468 1462 1500 1462 1489
OtherLetter NonSpacingMark NonSpacingMark OtherLetter NonSpacingMark OtherLetter
Cyb answered 11/5, 2022 at 18:2 Comment(2)
interesting! So basically you would need to combine them until reaching the next OtherLetter itemAvaria
@Avaria I think so. You might have to try it on for size. But I really like Hans's solution above and I +1'd it. Maybe you'd only need to resort to char.GetUnicodeCategory if you need really low-level controlCyb

© 2022 - 2024 — McMap. All rights reserved.