Find characters that are similar glyphically in Unicode?
Asked Answered
I

3

18

Lets say I have the characters Ú, Ù, Ü. All of them are similar glyphically to the English U.

Is there some list or algorithm to do this:

  • Given a Ú or Ù or Ü return the English U
  • Given a English U, return the list of all U-similar characters

I'm not sure if the code point of the Unicode characters is the same across all fonts? If it is, I suppose there could be some easy way and efficient to do this?

UPDATE

If you're using Ruby, there is a gem available unicode-confusable for this that may help in some cases.

Illyria answered 30/1, 2011 at 23:39 Comment(7)
Yes, and so are ∪ U+222A UNION and ⋃ U+223C N-ARY UNION and ⩌ U+2A4C CLOSED UNION WITH SERIFS and U U+FF35 FULLWIDTH LATIN CAPITAL LETTER U and a whole lot more. What are you trying to do? U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ȕ, Ȗ, ᵁ, Ṳ, Ṵ, Ṷ, Ṹ, Ṻ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự, Ⓤ, ....Pessimism
possible duplicate of Converting Symbols, Accent Letters to English Alphabet.Dotdotage
Have you looked at the unidecode module? pypi.python.org/pypi/UnidecodeVoidable
The Unicode concept of "confusables" is also worth mentioning here; see a demo, full list, and the technical report.Rodent
@Rodent I was about to comment the same, but realized that the confusables are almost exact lookalikes and not the accented versions. (like: 𝐔𝑈𝑼𝒰𝖴𝚄)Enounce
@Enounce Correct. I agree that only looking at confusable might not cut it, but it provides Ù → U + ̀. Easier way to get such information might be normalization.Rodent
I think that if you are treating the result as if there are some kind of relationship between the input and the output (beside looks), then I think you need know the language of the input. For example, in Swedish an Å is not an A with a ring modification, it is a separate character. A is as different from Å, as A is different from B. In contrast to e and ê, where the later is ane with a circumflex modification. Sorting in Swedish is like this: [A], [B], [C], [EÉÈÊË], [X], [Y], [Z], [Å], [Ä], [Ö], where all chars within the same brackets should be treated as the same.Oates
E
14

This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:

# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')

Output

U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U

To find accent characters, use something like:

import unicodedata as ud
import string

def asc(unichr):
    return ud.normalize('NFD',unichr).encode('ascii','ignore')

U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
    print u''.join(u for u in U if asc(u) == c)

Output

aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
 :
etc.
Erudition answered 31/1, 2011 at 8:4 Comment(1)
i sure hope your source text doesn't include æ, or em dashes, or CJK text, or emoji, or any of the tens of thousands of other characters in Unicode that aren't related to English letters.Grope
P
33

It is very unclear what you are asking to do here.

  • There are characters whose canonical decompositions all start with the same base character: e, é, ê, ë, ē, ĕ, ė, ę, ě, ȅ, ȇ, ȩ, ḕ, ḗ, ḙ, ḛ, ḝ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, e̳, … or s, ś, ŝ, ş, š, ș, ṡ, ṣ, ṥ, ṧ, ṩ, ….

  • There are characters whose compatibility decompositions all include a particular character: ᵉ, ₑ, ℯ, ⅇ, ⒠, ⓔ, ㋍, ㋎, e, … or s, ſ, ˢ, ẛ, ₨, ℁, ⒮, ⓢ, ㎧, ㎨, ㎮, ㎯, ㎰, ㎱, ㎲, ㎳, ㏛, ſt, st, s, … or R, ᴿ, ₨, ℛ, ℜ, ℝ, Ⓡ, ㏚, R, ….

  • There are characters that just happen to look alike in some fonts: ß and β and ϐ, or 3 and Ʒ and Ȝ and ȝ and ʒ and ӡ and ᴣ, or ɣ and ɤ and γ, or F and Ϝ and ϝ, or B and Β and В, or ∅ and ○ and 0 and O and ০ and ੦ and ౦ and ૦, or 1 and l and I and Ⅰ and ᛁ and | and ǀ and ∣, ….

  • Characters that are the same case-insensitively, like s and S and ſ, or ss and Ss and SS and ß and ẞ, ….

  • Characters that all have the same numeric value, like all these for the value 1: 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១៱᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁⅟ ① ⑴ ⒈ ⓵ ❶➀➊꘡꣑꤁꧑꩑꯱𐄇𐅂𐅘𐅙𐅚𐌠𐏑𐒡𐡘𐤖𐩀𐩽𐭘𐭸𐹠𒐕𒐞𒐬𒐴𒑏𒑘𝍠𝟏𝟙𝟣𝟭𝟷 🄂 Ⅰⅰꛦ㆒㈠㊀𑁒𑁧.

  • Characters that all have the same primary collation strength, like all these that are the same as d: DdÐðĎďĐđ◌ͩᴰᵈᶞ◌ᷘ◌ᷙḊḋḌḍḎḏḐḑḒḓⅅⅆⅮⅾ Ⓓ ⓓ ꝹꝺDd𝐃𝐝𝐷𝑑𝑫𝒅𝒟𝒹𝓓𝓭𝔇𝔡𝔻𝕕𝕯𝖉𝖣𝖽𝗗𝗱𝘋𝘥𝘿𝙙𝙳𝚍 🄳 🅓 🅳 🇩 . Note that some of those are not accessible through any kind of decomposition, but only through the DUCET/UCA values; for example, the fairly common ð or the newish ꝺ can be equated to d only through a primary UCA strength comparison; same with ƶ and z, ȼ and c, etc.

  • Characters that are same in certain locales, like æ and ae, or ä and ae, or ä and aa, or MacKinley and McKinley, …. Note that locale can make a really big difference, since in some locales both c and ç are the same character while in others they are not; similarly for n and ñ, or a and á and ã, ….

Some of these can be handled. Some cannot. All require different approaches depending on different needs.

What is your real goal?

Pessimism answered 31/1, 2011 at 0:8 Comment(1)
+1 to all of it, but mostly "What is your real goal?"! Knowing that is necessary to find the correct approach!Dotdotage
E
14

This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:

# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')

Output

U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U

To find accent characters, use something like:

import unicodedata as ud
import string

def asc(unichr):
    return ud.normalize('NFD',unichr).encode('ascii','ignore')

U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
    print u''.join(u for u in U if asc(u) == c)

Output

aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
 :
etc.
Erudition answered 31/1, 2011 at 8:4 Comment(1)
i sure hope your source text doesn't include æ, or em dashes, or CJK text, or emoji, or any of the tens of thousands of other characters in Unicode that aren't related to English letters.Grope
A
6

Why not just compare glyphs with something like this?

package similarglyphcharacterdetector;

import java.awt.Color;
import java.awt.Font;
import java.awt.Graphics2D;
import java.awt.Rectangle;
import java.awt.font.FontRenderContext;
import java.awt.image.BufferedImage;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;

public class SimilarGlyphCharacterDetector {

    static char[] TEST_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890".toCharArray();
    static BufferedImage[] SAMPLES = null;

    public static BufferedImage drawGlyph(Font font, String string) {
        FontRenderContext frc = ((Graphics2D) new BufferedImage(1, 1, BufferedImage.TYPE_BYTE_GRAY).getGraphics()).getFontRenderContext();

        Rectangle r= font.getMaxCharBounds(frc).getBounds();

        BufferedImage res = new BufferedImage(r.width, r.height, BufferedImage.TYPE_BYTE_GRAY);
        Graphics2D g = (Graphics2D) res.getGraphics();
        g.setBackground(Color.WHITE);
        g.fillRect(0, 0, r.width, r.height);
        g.setPaint(Color.BLACK);
        g.setFont(font);
        g.drawString(string, 0, r.height - font.getLineMetrics(string, g.getFontRenderContext()).getDescent());
        return res;
    }

    private static void drawSamples(Font f) {
        SAMPLES = new BufferedImage[TEST_CHARS.length];
        for (int i = 0; i < TEST_CHARS.length; i++)
            SAMPLES[i] = drawGlyph(f, String.valueOf(TEST_CHARS[i]));
    }

    private static int compareImages(BufferedImage img1, BufferedImage img2) {
        if (img1.getWidth() != img2.getWidth() || img1.getHeight() != img2.getHeight())
            throw new IllegalArgumentException();
        int d = 0;
        for (int y = 0; y < img1.getHeight(); y++) {
            for (int x = 0; x < img1.getWidth(); x++) {
                if (img1.getRGB(x, y) != img2.getRGB(x, y))
                    d++;
            }
        }
        return d;
    }

    private static int nearestSampleIndex(BufferedImage image, int maxDistance) {
        int best = Integer.MAX_VALUE;
        int bestIdx = -1;
        for (int i = 0; i < SAMPLES.length; i++) {
            int diff = compareImages(image, SAMPLES[i]);
            if (diff < best) {
                best = diff;
                bestIdx = i;
            }
        }
        if (best > maxDistance)
            return -1;
        return bestIdx;
    }

    public static void main(String[] args) throws Exception {
        Font f = new Font("FreeMono", Font.PLAIN, 13);
        drawSamples(f);
        HashMap<Character, StringBuilder> res = new LinkedHashMap<Character, StringBuilder>();
        for (char c : TEST_CHARS)
            res.put(c, new StringBuilder(String.valueOf(c)));
        int maxDistance = 5;
        for (int i = 0x80; i <= 0xFFFF; i++) {
            char c = (char)i;
            if (f.canDisplay(c)) {
                int n = nearestSampleIndex(drawGlyph(f, String.valueOf(c)), maxDistance);
                if (n != -1) {
                    char nc = TEST_CHARS[n];
                    res.get(nc).append(c);
                }
            }
        }
        for (Map.Entry<Character, StringBuilder> entry : res.entrySet())
            if (entry.getValue().length() > 1)
                System.out.println(entry.getValue());
    }
}

Output:

AÀÁÂÃÄÅĀĂĄǍǞȀȦΆΑΛАѦӒẠẢἈἉᾸᾹᾺᾼ₳Å
BƁƂΒБВЬḂḄḆ
CĆĈĊČƇΓЄГСὉℂⅭ
...
Amadoamador answered 29/2, 2012 at 20:25 Comment(2)
What delicious hack. I doubt it's going to correctly say that I, l and 1 or 0, o and O are all different though in most fonts, while still saying all those B-like things are the same.Roshelle
Hey, how did I stumble into Code Golf?!Lavettelavigne

© 2022 - 2024 — McMap. All rights reserved.