Unicode comparison of Cyrillic 'С' and Latin 'C'

Asked 14/10, 2013 at 0:0 Answered 14/10, 2013 at 5:19

Solved unicode normalization collation unicode-normalization accent-insensitive

I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.

Unexpected answered 14/10, 2013 at 0:0 Comment(2)

Bad luck I think as Cyrillic C is Latin S. You could make a CharsetEncoder/Decoder. However why not a Comparator<String> which tackles AaBCcEeHKMOoPpTUuXxYy (or so). Mind that Serbian knows a j, Belarussian a i. Maybe you could go with font glyphs of, say, Arial Unicode MS and derive a visual similarity table. – Ionium 14/10, 2013 at 0:11

Note that, for UTF16, it's practical to build a 65K char array that translates from one char set to another. Go much into UTF32, though, and it gets too big to be practical. – Debark 14/10, 2013 at 0:59

There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.

Sponsor answered 14/10, 2013 at 5:19 Comment(0)

when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'.

Pocketbook answered 14/10, 2013 at 0:56 Comment(0)

Recommended topics

Hot tags