How can I check if a Python unicode string contains non-Western letters?
Asked Answered
G

8

41

I have a Python Unicode string. I want to make sure it only contains letters from the Roman alphabet (A through Z), as well as letters commonly found in European alphabets, such as ß, ü, ø, é, à, and î. It should not contain characters from other alphabets (Chinese, Japanese, Korean, Arabic, Cyrillic, Hebrew, etc.). What's the best way to go about doing this?

Currently I am using this bit of code, but I don't know if it's the best way:

def only_roman_chars(s):
    try:
        s.encode("iso-8859-1")
        return True
    except UnicodeDecodeError:
        return False

(I am using Python 2.5. I am also doing this in Django, so if the Django framework happens to have a way to handle such strings, I can use that functionality -- I haven't come across anything like that, however.)

Girandole answered 22/6, 2010 at 15:13 Comment(5)
What is your goal in filtering these characters? I can't think of a good reason to do this that isn't a symptom of something wrong elsewhere in the codebase.Educationist
Filtering mailing addresses. Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses.Girandole
Can't you filter on country instead then? (otherwise, interesting question +1)Liquidate
Not really. Someone could select "China" and still enter an appropriate address, for example.Girandole
github.com/EliFinkelshteyn/alphabet-detector/blob/master/…Quint
O
49
import unicodedata as ud

latin_letters= {}

def is_latin(uchr):
    try: return latin_letters[uchr]
    except KeyError:
         return latin_letters.setdefault(uchr, 'LATIN' in ud.name(uchr))

def only_roman_chars(unistr):
    return all(is_latin(uchr)
           for uchr in unistr
           if uchr.isalpha()) # isalpha suggested by John Machin

>>> only_roman_chars(u"ελληνικά means greek")
False
>>> only_roman_chars(u"frappé")
True
>>> only_roman_chars(u"hôtel lœwe")
True
>>> only_roman_chars(u"123 ångstrom ð áß")
True
>>> only_roman_chars(u"russian: гага")
False
Oralla answered 22/7, 2010 at 12:33 Comment(5)
Consider uchr.isalpha() instead of unicodedata.category(uchr).startswith('L'). Consider using a set constructed at module load time: okletters = set(unichr(i) for i in xrange(sys.maxunicode+1) if unicodedata.name(unichr(i), "").startswith('LATIN ')) i.e. use uchr in okletters instead of 'LATIN' in unicodedata.name(uchr)Homeless
@John: uchr.isalpha is a better suggestion, thank you; I will update my answer. For the optimization suggestion, I'd go with a memoized-style function.Oralla
For the is_latin function, a subclass of defaultdict appropriately overriding __missing__ would also be a nice solution.Oralla
+1. This was extremely non-trivial. I'm trying to do something similar, where I want to transform Cyrillic to Latin without transforming anything else. Thanks!Checklist
Still incomplete solution, since some very doubtful characters are called 'LATIN' in Unicode. E.g. look unicode-table.com/en/A7B7Owen
E
39

The top answer to this by @tzot is great, but IMO there should really be a library for this that works for all scripts. So, I made one (heavily based on that answer).

pip install alphabet-detector

and then use it directly:

from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()

ad.only_alphabet_chars(u"ελληνικά means greek", "LATIN") #False
ad.only_alphabet_chars(u"ελληνικά", "GREEK") #True
ad.only_alphabet_chars(u'سماوي يدور', 'ARABIC')
ad.only_alphabet_chars(u'שלום', 'HEBREW')
ad.only_alphabet_chars(u"frappé", "LATIN") #True
ad.only_alphabet_chars(u"hôtel lœwe 67", "LATIN") #True
ad.only_alphabet_chars(u"det forårsaker første", "LATIN") #True
ad.only_alphabet_chars(u"Cyrillic and кириллический", "LATIN") #False
ad.only_alphabet_chars(u"кириллический", "CYRILLIC") #True

Also, a few convenience methods for major languages:

ad.is_cyrillic(u"Поиск") #True  
ad.is_latin(u"howdy") #True
ad.is_cjk(u"hi") #False
ad.is_cjk(u'汉字') #True
Examinee answered 3/3, 2015 at 20:39 Comment(2)
I wanted to use the library to detect if someone in my twitch (irc) chat types in cyrillic, but I dislike that is_cyrillic returns true when the unicode string contains a single '?'. Also I dislike that it won't be triggered by something like this: "hello Поиск". I am using a regular expression with ranges now just wanted to give some feedback :)Dukey
Thanks for the feedback and trying to use the library! You bring up some great points. Would you mind opening up an issue on the github repo so we can discuss them?Examinee
C
4

The standard string package contains all Latin letters, numbers and symbols. You can remove these values from the text and if there is anything left, it is not-Latin characters. I did that:

In [1]: from string import printable                                                                                                                                                                           

In [2]: def is_latin(text): 
   ...:     return not bool(set(text) - set(printable)) 
   ...:                                                                                                                                                                                                        

In [3]: is_latin('Hradec Králové District,,Czech Republic,')                                                                                                                                                   
Out[3]: False

In [4]: is_latin('Hradec Krlov District,,Czech Republic,')                                                                                                                                                     
Out[4]: True

I have no way to check all non-Latin characters and if anyone can do that, please let me know. Thanks.

Carbonate answered 30/7, 2020 at 10:9 Comment(1)
ISTM you implemented more of like an is_ascii function, not an is_latin; é is a latin character but not an ASCII one.Oralla
C
2

Checking for ISO-8559-1 would miss reasonable Western characters like 'œ' and '€'. The solution depends on how you define "Western", and how you want to handle non-letters. Here's one approach:

import unicodedata

def is_permitted_char(char):
    cat = unicodedata.category(char)[0]
    if cat == 'L': # Letter
        return 'LATIN' in unicodedata.name(char, '').split()
    elif cat == 'N': # Number
        # Only DIGIT ZERO - DIGIT NINE are allowed
        return '0' <= char <= '9'
    elif cat in ('S', 'P', 'Z'): # Symbol, Punctuation, or Space
        return True
    else:
        return False

def is_valid(text):
    return all(is_permitted_char(c) for c in text)
Cleanshaven answered 23/6, 2010 at 5:8 Comment(1)
(1) return unicodedata.name(char, '').startswith('LATIN ') should suffice (2) memoising the function results might be a good idea, which could be made better by preloading the usual suspects [-A-Za-z0-9,./ '] etc into the memo (3) symbol/punctuation is rather wide (4) should category Space be replaced by '\x20'?Homeless
B
1

check the code in django.template.defaultfilters.slugify

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')

is what you are looking for, you can then compare the resulting string with the original

Baldachin answered 22/6, 2010 at 15:34 Comment(1)
I don't want to turn everything lowercase or convert spaces to dashes, just check to see if a string has the unwanted characters. I changed the question to avoid using the word "filter".Girandole
H
1

For what you say you want to do, your approach is about right. If you are running on Windows, I'd suggest using cp1252 instead of iso-8859-1. You might also allow cp1250 as well -- this would pick up eastern European countries like Poland, Czech Republic, Slovakia, Romania, Slovenia, Hungary, Croatia, etc where the alphabet is Latin-based. Other cp125x would include Turkish and Maltese ...

You may also like to consider transcription from Cyrillic to Latin; as far as I know there are several systems, one of which may be endorsed by the UPU (Universal Postal Union).

I'm a little intrigued by your comment "Our shipping department doesn't want to have to fill out labels with, e.g., Chinese addresses" ... three questions: (1) do you mean "addresses in country X" or "addresses written in X-ese characters" (2) wouldn't it be better for your system to print the labels? (3) how does the order get shipped if it fails your test?

Homeless answered 22/6, 2010 at 22:26 Comment(3)
(1) The latter (addresses written in X-ese characters). (2) Perhaps. Right now, it doesn't. The form is part of a web app; the data gets shunted to another system entirely, which handles the management of orders, etc. (3) The form fails validation and prompts the user to enter an appropriate address.Girandole
I'll suggest not using cp125x whether or not you're on windows, as it is incompatible with proper standardized character sets and encodings. It puts legal western characters in what's reserved as the "C1" control character range in Unicode & ISO. The ancient "CP" code page encodings were a workaround for limited character set space and should be avoided in all modern code.Leoni
@StephenP: The OP already has Unicode strings; I was suggesting that he consider the strong possibility that the characters that he needs to watch out for could be found in the cp125x character sets; Windows USERS incorrigibly have data encoded in cp125x. This is a fact of life. The ancient ISO-8859-x encodings although sanctified by standards are even more limited, and should be avoided in code; use UTF-8, UTF-16, or GB18030. If one has Unicode data with code points 0080 to 009F, probability(C1 controls) == 0.1%, prob(cp125x-encoded data decoded as latin1) == 99.9%Homeless
A
0

Maybe this will do if you're a django user?

from django.template.defaultfilters import slugify 

def justroman(s):
  return len(slugify(s)) == len(s)
Africah answered 28/6, 2010 at 18:47 Comment(0)
P
0

To simply tzot's answer using the built-in unicodedata library, this seems to work for me:

import unicodedata as ud

def is_latin(word):
    return all(['LATIN' in ud.name(c) for c in word])
Prudhoe answered 20/1, 2021 at 9:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.