Is there a list of characters that look similar to English letters?

Asked 29/2, 2012 at 0:20 Answered 25/5, 2022 at 1:18

I’m having a crack at profanity filtering for a web forum written in Python.

As part of that, I’m attempting to write a function that takes a word, and returns all possible mock spellings of that word that use visually similar characters in place of specific letters (e.g. s†å©køv€rƒ|øw).

I expect I’ll have to expand this list over time to cover people’s creativity, but is there a list floating around anywhere on the internet that I could use as a starting point?

Power answered 29/2, 2012 at 0:20 Comment(11)

I can't answer the question, but I wouldn't use a function that returns all possible mock spellings of a word. That can be awfully many. Instead, I'd normalize each word in the posts before looking it up in the list of bad words, i.e. transform "s†å©køv€rƒ|øw" to "stackoverflow" before the look-up. – Ranch 29/2, 2012 at 0:24

@SvenMarnach: oh dear, that would be a better approach wouldn’t it. The newbie programmer mind is like a mogwai — should not be fed after midnight. – Power 29/2, 2012 at 0:32

Related but not exact duplicate: #4846865 – Officialism 29/2, 2012 at 0:32

@SvenMarnach: pop that in as an answer, and I’d be delighted to accept it. – Power 29/2, 2012 at 0:37

@PaulD.Waite: No, I won't. It doesn't solve the problem, it's rather a side note. You will still need data on the character mapping, which is the main issue here. (And I think your question is perfectly valid and on-topic.) – Ranch 29/2, 2012 at 0:43

@SvenMarnach: I see your point, and cheers. – Power 29/2, 2012 at 0:45

There are scripts and programs that leetify a word (toggle case and replace o with zero, 3 with e, etc. I'd start by looking at those. – Enshrine 29/2, 2012 at 1:15

this idea just sprang to my mind - it's neither analysed thoroughly nor tested in any way. however, how about 1. choose a font 2. create bitmap renderings of all glyphs 3. define a similarity measure over bitmaps (simple one: proportion of equal vs. different bit values over all grid positions inside a std bounding box). 4. compute the similarity matrix for pairs of chars 5. cluster the glyphs accordingly 6. choose a rep for each cluster (ideally these would come out as a-zA-Z0-9). then filtering would amount to mapping each char onto the proper cluster rep and a dict lookup. – Miley 1/3, 2012 at 14:21

... obviously you'd have to apply a similar technique to normalize homophones (at least in languages like english with non-unique phonem-grapheme correspondences). rhite, dewd ? ;-) – Miley 1/3, 2012 at 14:23

@collapsar: that’s a good approach. One day. – Power 1/3, 2012 at 15:13

For normalizing homophones, look up "soundex" and its descendants. For the rest, you'll probably want to also look out for Cyrillic characters etc., right? "IDN homograph attacks" is the term here. There's probably a list of those already. – Inessa 5/3, 2012 at 20:16

This is probably both vastly more deep than you need, yet not wide enough to cover your use case, but the Unicode consortium have had to deal with attacks against internationalised domain names and came up with this list of homographs (characters with the same or similar rendering):

http://www.unicode.org/Public/security/latest/confusables.txt

Might make a starting point at least.

Gonzales answered 9/4, 2012 at 13:6 Comment(1)

Excellent. I needed something that was visually indistinguishable from a capital E, but would come after any ordinary English word. "MATHEMATICAL SANS-SERIF CAPITAL E" is perfect (for the font in my case.) – Apps 7/5, 2019 at 15:55

http://en.wikipedia.org/wiki/Letterlike_Symbols

It's much much much less comprehensive but is more comprehensible.

Alpenhorn answered 12/12, 2013 at 18:16 Comment(3)

Wait which one is it? – Wyant 20/7, 2016 at 14:40

I believe comprehensive and comprehensible are independent qualities? – Alpenhorn 25/7, 2016 at 21:37

Comprehensive: Containing much information and details from many sources. Comprehensible: Understandable. – Sikhism 10/2, 2017 at 20:41

I created a python class to do exactly this, based on Robin's unicode link for "confusables"

https://github.com/wanderingstan/Confusables

For example, "Hello" would get expanded into the following set of regexp character classes:

[H\Ｈ\ℋ\ℌ\ℍ\𝐇\𝐻\𝑯\𝓗\𝕳\𝖧\𝗛\𝘏\𝙃\𝙷\Η\𝚮\𝛨\𝜢\𝝜\𝞖\Ⲏ\Н\Ꮋ\ᕼ\ꓧ\𐋏\Ⱨ\Ң\Ħ\Ӊ\Ӈ] [e\℮\ｅ\ℯ\ⅇ\𝐞\𝑒\𝒆\𝓮\𝔢\𝕖\𝖊\𝖾\𝗲\𝘦\𝙚\𝚎\ꬲ\е\ҽ\ɇ\ҿ] [l\‎\|\∣\⏽\￨1\‎\۱\𐌠\‎\𝟏\𝟙\𝟣\𝟭\𝟷I\Ｉ\Ⅰ\ℐ\ℑ\𝐈\𝐼\𝑰\𝓘\𝕀\𝕴\𝖨\𝗜\𝘐\𝙄\𝙸\Ɩ\ｌ\ⅼ\ℓ\𝐥\𝑙\𝒍\𝓁\𝓵\𝔩\𝕝\𝖑\𝗅\𝗹\𝘭\𝙡\𝚕\ǀ\Ι\𝚰\𝛪\𝜤\𝝞\𝞘\Ⲓ\І\Ӏ\‎\‎\‎\‎\‎\‎\‎\‎\ⵏ\ᛁ\ꓲ\𖼨\𐊊\𐌉\‎\‎\ł\ɭ\Ɨ\ƚ\ɫ\‎\‎\‎\‎\ŀ\Ŀ\ᒷ\🄂\⒈\‎\⒓\㏫\㋋\㍤\⒔\㏬\㍥\⒕\㏭\㍦\⒖\㏮\㍧\⒗\㏯\㍨\⒘\㏰\㍩\⒙\㏱\㍪\⒚\㏲\㍫\ǉ\Ĳ\‖\∥\Ⅱ\ǁ\‎\𐆙\⒒\Ⅲ\𐆘\㏪\㋊\㍣\Ю\⒑\㏩\㋉\㍢\ʪ\₶\Ⅳ\Ⅸ\ɮ\ʫ\㏠\㋀\㍙] [l\‎\|\∣\⏽\￨1\‎\۱\𐌠\‎\𝟏\𝟙\𝟣\𝟭\𝟷I\Ｉ\Ⅰ\ℐ\ℑ\𝐈\𝐼\𝑰\𝓘\𝕀\𝕴\𝖨\𝗜\𝘐\𝙄\𝙸\Ɩ\ｌ\ⅼ\ℓ\𝐥\𝑙\𝒍\𝓁\𝓵\𝔩\𝕝\𝖑\𝗅\𝗹\𝘭\𝙡\𝚕\ǀ\Ι\𝚰\𝛪\𝜤\𝝞\𝞘\Ⲓ\І\Ӏ\‎\‎\‎\‎\‎\‎\‎\‎\ⵏ\ᛁ\ꓲ\𖼨\𐊊\𐌉\‎\‎\ł\ɭ\Ɨ\ƚ\ɫ\‎\‎\‎\‎\ŀ\Ŀ\ᒷ\🄂\⒈\‎\⒓\㏫\㋋\㍤\⒔\㏬\㍥\⒕\㏭\㍦\⒖\㏮\㍧\⒗\㏯\㍨\⒘\㏰\㍩\⒙\㏱\㍪\⒚\㏲\㍫\ǉ\Ĳ\‖\∥\Ⅱ\ǁ\‎\𐆙\⒒\Ⅲ\𐆘\㏪\㋊\㍣\Ю\⒑\㏩\㋉\㍢\ʪ\₶\Ⅳ\Ⅸ\ɮ\ʫ\㏠\㋀\㍙] [o\ం\ಂ\ം\ං\०\੦\૦\௦\౦\೦\൦\๐\໐\၀\‎\۵\ｏ\ℴ\𝐨\𝑜\𝒐\𝓸\𝔬\𝕠\𝖔\𝗈\𝗼\𝘰\𝙤\𝚘\ᴏ\ᴑ\ꬽ\ο\𝛐\𝜊\𝝄\𝝾\𝞸\σ\𝛔\𝜎\𝝈\𝞂\𝞼\ⲟ\о\ჿ\օ\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\‎\ഠ\ဝ\𐓪\𑣈\𑣗\𐐬\‎\ø\ꬾ\ɵ\ꝋ\ө\ѳ\ꮎ\ꮻ\ꭴ\‎\ơ\œ\ɶ\∞\ꝏ\ꚙ\ൟ\တ]

This regexp will match against "𝓗℮𝐥1೦"

Lankford answered 1/2, 2018 at 5:27 Comment(0)

I don't have solution per se, but I have some ideas.

@collapsar's approach in the comments sounds good to me in principle, but I think you'd want to use an off-the-shelf OCR library rather than try to analyze the images yourself. To make the images, I'd use a font like something in the DejaVu family, because it has good coverage of relatively obscure Unicode characters.

Another easy way to get data is to look at the decompositions of "precomposed" characters like "à"; if a character can be decomposed into one or more combining chapters followed by a base character that looks like an English letter, it probably looks like an English letter itself.

Nothing beats lots of data for a problem like this. You could collect a lot of good examples of character substitutions people have made by scraping the right web forums. Then you can use this procedure to learn new ones: first, find "words" containing mostly characters you can identify, along with some you can't. Make a regex from the word, converting everything you can to regular letters and replacing everything else with ".". Then match your regex against a dictionary, and if you get only one match, you have some very good candidates for what the unknown characters are supposed to represent. (I would not actually use a regex for searching a dictionary, but you get the idea.)

Instead of mining forums, you may be able to use Google's n-gram corpus (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html) instead, but I'm not able to check right now if it contains the kind of pseudo-words you need.

Recommended topics

Hot tags