To start, get a collection of the Unicode combining diacritical characters; they're contiguous, so this is pretty easy, e.g.:
# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))
Now define a function that attempts to compose each one with a base ASCII character; when the composed normal form is length 1 (meaning the ASCII + combining became a single Unicode ordinal), save it:
import unicodedata
def get_unicode_variations(letter):
if len(letter) != 1:
raise ValueError("letter must be a single character to check for variations")
variations = []
# We could just loop over map(chr, range(768, 880)) without caching
# in combining_chars, but that increases runtime ~20%
for combiner in combining_chars:
normalized = unicodedata.normalize('NFKC', letter + combiner)
if len(normalized) == 1:
variations.append(normalized)
return ''.join(variations)
This has the advantage of not trying to manually perform string lookups in the unicodedata
DB, and not needing to hardcode all possible descriptions of the combining characters. Anything that composes to a single character gets included; runtime for the check on my machine comes in under 50 µs, so if you're not doing this too often, the cost is reasonable (you could decorate with functools.lru_cache
if you intend to call it repeatedly with the same arguments and want to avoid recomputing it every time).
If you want to get everything built out of one of these characters, a more exhaustive search can find it, but it'll take longer (functools.lru_cache
would be nigh mandatory unless it's only ever called once per argument):
import functools
import sys
import unicodedata
@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter):
if len(letter) != 1:
raise ValueError("letter must be a single character to check for variations")
variations = []
for testlet in map(chr, range(sys.maxunicode)):
if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter:
variations.append(testlet)
return ''.join(variations)
This looks for any character that decomposes into a form that includes the target letter; it does mean that searching the first time takes roughly a third of a second, and the result includes stuff that isn't really just a modified version of the character (e.g. 'L'
's result will include ℡
, which isn't really a "modified 'L'
), but it's as exhaustive as you can get.