Match any unicode letter?

Asked 11/6, 2011 at 7:5 Answered 1/6, 2020 at 8:36

Solved python regex character-properties

In .net you can use \p{L} to match any letter, how can I do the same in Python? Namely, I want to match any uppercase, lowercase, and accented letters.

Everara answered 11/6, 2011 at 7:5 Comment(3)

See: #1833393 – Ectomere 11/6, 2011 at 7:8

You know that 'é' isn't a unicode in 2.x, right? – Hammer 11/6, 2011 at 7:46

@Ignacio/Tim: Oh! Right. Forgot about that! Thanks :D It's a little confusing because it doesn't throw an error or anything either. – Everara 11/6, 2011 at 17:9

Python's re module doesn't support Unicode properties yet. But you can compile your regex using the re.UNICODE flag, and then the character class shorthand \w will match Unicode letters, too.

Since \w will also match digits, you need to then subtract those from your character class, along with the underscore:

[^\W\d_]

will match any Unicode letter.

>>> import re
>>> r = re.compile(r'[^\W\d_]', re.U)
>>> r.match('x')
<_sre.SRE_Match object at 0x0000000001DBCF38>
>>> r.match(u'é')
<_sre.SRE_Match object at 0x0000000002253030>

Endora answered 11/6, 2011 at 7:9 Comment(7)

Clever, but it doesn't seem to work. See update. I copied that e off of en.wikipedia.org/wiki/List_of_Unicode_characters, it doesn't seem to recognize it. – Everara 11/6, 2011 at 7:44

It works perfectly, but 'é' is not an Unicode object, it's a string of bytes. – Thomsen 11/6, 2011 at 7:48

@rosh try u'é' – Palomo 9/3, 2017 at 19:55

^[a-zœéèâêçàñ ]+$ – Arachnoid 30/3, 2018 at 14:8

Python re doesn't but regex does, you can use \p{L} like you would in .net. – Khalsa 2/7, 2019 at 14:18

What exactly works also depends on Unicode normalization. é can be represented as the single glyph U+00E9 or as the pair e followed by COMBINING ACUTE ACCENT U+0301 which is obviously two code points. To get consistent results, use the normalization functions from unicodedata. – Secund 4/12, 2023 at 17:31

@Arachnoid That's a very small sample of all the available accented characters, even just from the Latin-1 subset, even for a single language. – Secund 4/12, 2023 at 17:34

PyPi regex module supports \p{L} Unicode property class, and many more, see "Unicode codepoint properties, including scripts and blocks" section in the documentation and full list at http://www.unicode.org/Public/UNIDATA/PropList.txt. Using regex module is convenient because you get consistent results across any Python version (mind that the Unicode standard is constantly evolving and the number of supported letters grows).

Install the library using pip install regex (or pip3 install regex) and use

\p{L}        # To match any Unicode letter
\p{Lu}       # To match any uppercase Unicode letter
\p{Ll}       # To match any lowercase Unicode letter
\p{L}\p{M}*  # To match any Unicode letter and any amount of diacritics after it

See some usage examples below:

import regex
text = r'Abc-++-Абв. It’s “Łąć”!'
# Removing letters:
print( regex.sub(r'\p{L}+', '', text) ) # => -++-. ’ “”!
# Extracting letter chunks:
print( regex.findall(r'\p{L}+', text) ) # => ['Abc', 'Абв', 'It', 's', 'Łąć']
# Removing all but letters:
print( regex.sub(r'\P{L}+', '', text) ) # => AbcАбвItsŁąć
# Removing all letters but ASCII letters:
print( regex.sub(r'[^\P{L}a-zA-Z]+', '', text) ) # => Abc-++-. It’s “”!

See a Python demo online

Research answered 1/6, 2020 at 8:36 Comment(1)

FYI, PiPy regex module also supports POSIX character classes and thus [:alpha:] (any letter), [:lower:] (all lowercase letters), and [:upper:] (to match all uppercase letters) can be used inside character classes to match various kinds of letters, too. Note that these POSIX character classes can be negated in a similar way as shorthand character classes. E.g. to match any char but a letter, you can use [:^alpha:]. The last regex.sub regex ([^\P{L}a-zA-Z]+) can be written as [^[:^alpha:]a-zA-Z]+. – Sulfonation 29/7, 2021 at 7:33

Recommended topics

Hot tags