I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.
There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.
when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'
.
© 2022 - 2024 — McMap. All rights reserved.
Comparator<String>
which tacklesAaBCcEeHKMOoPpTUuXxYy
(or so). Mind that Serbian knows aj
, Belarussian ai
. Maybe you could go with font glyphs of, say, Arial Unicode MS and derive a visual similarity table. – Ionium