get all unicode variations of a latin character
Asked Answered
M

4

6

E.g., for the character "a", I want to get a string (list of chars) like "aàáâãäåāăą" (not sure if that example list is complete...) (basically all unicode chars with names "Latin Small Letter A with *").

Is there a generic way to get this?

I'm asking for Python, but if the answer is more generic, this is also fine, although I would appreciate a Python code snippet in any case. Python >=3.5 is fine. But I guess you need to have access to the Unicode database, e.g. the Python module unicodedata, which I would prefer over other external data sources.

I could imagine some solution like this:

def get_variations(char):
   import unicodedata
   name = unicodedata.name(char)
   chars = char
   for variation in ["WITH CEDILLA", "WITH MACRON", ...]:
      try: 
          chars += unicodedata.lookup("%s %s" % (name, variation))
      except KeyError:
          pass
   return chars
Mantelletta answered 23/7, 2019 at 17:38 Comment(3)
If you only mean accented characters, try iterating over all the Latin combining accents.Rosenzweig
At some level, you have to consider non-spacing combining marks (category Mn). If you want to list all strings with a Latin letter and any combining mark, you should know that any number of combing marks is allowed. In that case, the answer is an infinite set. But, maybe you only want ones that can be normalized, which means for many such "characters" there are distinct two strings for the same semantic value.Normand
Depending on what you need that for, things may get more complicated if you take non latin scripts into account. A cyrillic A is often indistinguishable from a latin A or a greek uppercase alpha. And then you have the Kelvin symbol which looks like a K (no suprise). See guido-flohr.net/unicode-regex-pitfallsLuis
B
6

To start, get a collection of the Unicode combining diacritical characters; they're contiguous, so this is pretty easy, e.g.:

# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))

Now define a function that attempts to compose each one with a base ASCII character; when the composed normal form is length 1 (meaning the ASCII + combining became a single Unicode ordinal), save it:

import unicodedata

def get_unicode_variations(letter):
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = []
    # We could just loop over map(chr, range(768, 880)) without caching
    # in combining_chars, but that increases runtime ~20%
    for combiner in combining_chars:
        normalized = unicodedata.normalize('NFKC', letter + combiner)
        if len(normalized) == 1:
            variations.append(normalized)
    return ''.join(variations)

This has the advantage of not trying to manually perform string lookups in the unicodedata DB, and not needing to hardcode all possible descriptions of the combining characters. Anything that composes to a single character gets included; runtime for the check on my machine comes in under 50 µs, so if you're not doing this too often, the cost is reasonable (you could decorate with functools.lru_cache if you intend to call it repeatedly with the same arguments and want to avoid recomputing it every time).

If you want to get everything built out of one of these characters, a more exhaustive search can find it, but it'll take longer (functools.lru_cache would be nigh mandatory unless it's only ever called once per argument):

import functools
import sys
import unicodedata

@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter): 
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = [] 
    for testlet in map(chr, range(sys.maxunicode)): 
        if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter: 
            variations.append(testlet) 
    return ''.join(variations) 

This looks for any character that decomposes into a form that includes the target letter; it does mean that searching the first time takes roughly a third of a second, and the result includes stuff that isn't really just a modified version of the character (e.g. 'L''s result will include , which isn't really a "modified 'L'), but it's as exhaustive as you can get.

Bucket answered 23/7, 2019 at 18:47 Comment(11)
I feel like NFKD would be more useful, but the question is somewhat vague.Prelate
@JoshLee: NFKD? I can see NFKC being useful (if you want to accept non-ASCII inputs with ASCII compatible equivalents), but NFKD doesn't combine characters, so it's not super useful for finding the results of composition.Bucket
Maybe to find characters whose decomposition contains the desired character.Prelate
@JoshLee: Ah. That would work, but it would be much slower; while there are only 112 combining diacritical marks, there are around 138,000 characters (named + control characters), over 277,000 assigned code points (and the obvious approach would be to scan up to sys.maxunicode, which is over one million). Scanning 112 possibilities is fairly cheap (< 50 µs), scanning a million gets expensive (~350 ms on my machine).Bucket
I suppose it depends on the OP's needs; looking only at combining diacritics gets 'ĹĽḶĻḼḺĹ', which are all definitely "modified 'L's"; exhaustive search gets 'LĹĻĽĿLJLjᴸḶḸḺḼℒ℡ⅬⓁ㋏L𝐋𝐿𝑳𝓛𝔏𝕃𝕷𝖫𝗟𝘓𝙇𝙻🄛🄻', many of which aren't really 'L's, they're just things built out of 'L's (e.g. ).Bucket
@JoshLee: I've added the alternate solution; the OP can choose which works for them.Bucket
Thanks a lot! Your first solution is almost like what I wanted. Note that e.g. for "a" I get "àáâãāăȧäảåǎȁȃạḁąàá", i.e. there is not the "a" itself in it, and e.g. "à" is there twice. But it was easy to modify to cover that.Mantelletta
This seems to work nice for "a" and many others, but e.g. it fails for "q", where I does not return any variations, although there are a couple, e.g. "q́q̄q̇q̣̇q̈q̣̈q̋q̣". Why is that?Mantelletta
Ah, maybe my Python unicodedata version. unicodedata.unidata_version == '9.0.0' for me.Mantelletta
@Albert: I intentionally left out the character itself in both implementations; you can always explicitly add it back in if you like (remove the test to prevent self insertion in the second case). If duplicates occur due to multiple combining characters producing the same character, you can change the final line to: return ''.join(dict.fromkeys(variations)), which on Python 3.6+ will dedup while preserving order of first appearance (on 3.5 and earlier, you'd use collections.OrderedDict.fromkeys).Bucket
@Albert: As for q, I've got version 11.0.0 and still get no output, while the exhaustive version returns '⒬ⓠ㏃q𝐪𝑞𝒒𝓆𝓺𝔮𝕢𝖖𝗊𝗾𝘲𝙦𝚚'. I suspect q with diacritics doesn't have a single code point composed form, it's only formed from q + a combining diacritic.Bucket
A
2

You can use the decomposition mappings of the Unicode database directly. The following code checks all mappings for characters with a decomposition starting with a certain letter:

def get_unicode_variations(letter):
    letter_code = ord(letter)
    # For some characters, you might want to check all
    # code points up to 0x10FFFF
    for i in range(65536):
        decomp = unicodedata.decomposition(chr(i))
        # Mappings starting with '<...>' indicate a
        # compatibility mapping (NFKD, NFKC) which we ignore.
        while decomp != '' and not decomp.startswith('<'):
            first_code = int(decomp.split()[0], 16)
            if first_code == letter_code:
                print(chr(i), unicodedata.name(chr(i)))
                break
            # Try to decompose further
            decomp = unicodedata.decomposition(chr(first_code))

This is rather inefficient if you want to process multiple characters, though. For the letter a, the code above prints:

à LATIN SMALL LETTER A WITH GRAVE
á LATIN SMALL LETTER A WITH ACUTE
â LATIN SMALL LETTER A WITH CIRCUMFLEX
ã LATIN SMALL LETTER A WITH TILDE
ä LATIN SMALL LETTER A WITH DIAERESIS
å LATIN SMALL LETTER A WITH RING ABOVE
ā LATIN SMALL LETTER A WITH MACRON
ă LATIN SMALL LETTER A WITH BREVE
ą LATIN SMALL LETTER A WITH OGONEK
ǎ LATIN SMALL LETTER A WITH CARON
ǟ LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
ǡ LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
ǻ LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
ȁ LATIN SMALL LETTER A WITH DOUBLE GRAVE
ȃ LATIN SMALL LETTER A WITH INVERTED BREVE
ȧ LATIN SMALL LETTER A WITH DOT ABOVE
ḁ LATIN SMALL LETTER A WITH RING BELOW
ạ LATIN SMALL LETTER A WITH DOT BELOW
ả LATIN SMALL LETTER A WITH HOOK ABOVE
ấ LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
ầ LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
ẩ LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
ẫ LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
ậ LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
ắ LATIN SMALL LETTER A WITH BREVE AND ACUTE
ằ LATIN SMALL LETTER A WITH BREVE AND GRAVE
ẳ LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
ẵ LATIN SMALL LETTER A WITH BREVE AND TILDE
ặ LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
Autogamy answered 24/7, 2019 at 16:0 Comment(0)
T
1

There is none that I know of, however you could build one yourself. Just look up the start and end numbers of your special characters. You can do so using unicode character table. And then for each character create a list using these numbers:

ranges = {
  'A': (192, 199),
  'B': (0, 0),
  'E': (200, 204),
  ...
}

map = {}
for char, rng in ranges.items():
  start, end = rng 
  map[char] = char + ''.join([chr(i) for i in range(start, end)])

This would generate a map such that:

{
  'A': 'AÀÁÂÃÄÅÆ'
  'B': 'B',
  'E': 'EÈÉÊË',
  ...
}
Terribly answered 23/7, 2019 at 17:58 Comment(0)
H
0

With unichars:

› unichars -a | grep -i 'Latin Small Letter A with'
 à  U+000E0 LATIN SMALL LETTER A WITH GRAVE
 á  U+000E1 LATIN SMALL LETTER A WITH ACUTE
 â  U+000E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
 ã  U+000E3 LATIN SMALL LETTER A WITH TILDE
 ä  U+000E4 LATIN SMALL LETTER A WITH DIAERESIS
 å  U+000E5 LATIN SMALL LETTER A WITH RING ABOVE
 ā  U+00101 LATIN SMALL LETTER A WITH MACRON
 ă  U+00103 LATIN SMALL LETTER A WITH BREVE
 ą  U+00105 LATIN SMALL LETTER A WITH OGONEK
 ǎ  U+001CE LATIN SMALL LETTER A WITH CARON
 ǟ  U+001DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
 ǡ  U+001E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
 ǻ  U+001FB LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
 ȁ  U+00201 LATIN SMALL LETTER A WITH DOUBLE GRAVE
 ȃ  U+00203 LATIN SMALL LETTER A WITH INVERTED BREVE
 ȧ  U+00227 LATIN SMALL LETTER A WITH DOT ABOVE
 ᶏ  U+01D8F LATIN SMALL LETTER A WITH RETROFLEX HOOK
 ◌ᷲ  U+01DF2 COMBINING LATIN SMALL LETTER A WITH DIAERESIS
 ḁ  U+01E01 LATIN SMALL LETTER A WITH RING BELOW
 ẚ  U+01E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
 ạ  U+01EA1 LATIN SMALL LETTER A WITH DOT BELOW
 ả  U+01EA3 LATIN SMALL LETTER A WITH HOOK ABOVE
 ấ  U+01EA5 LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
 ầ  U+01EA7 LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
 ẩ  U+01EA9 LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
 ẫ  U+01EAB LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
 ậ  U+01EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
 ắ  U+01EAF LATIN SMALL LETTER A WITH BREVE AND ACUTE
 ằ  U+01EB1 LATIN SMALL LETTER A WITH BREVE AND GRAVE
 ẳ  U+01EB3 LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
 ẵ  U+01EB5 LATIN SMALL LETTER A WITH BREVE AND TILDE
 ặ  U+01EB7 LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
 ⱥ  U+02C65 LATIN SMALL LETTER A WITH STROKE
Hymenopteran answered 24/7, 2019 at 7:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.