Python regex matching Unicode properties

Asked 2/12, 2009 at 13:25 Answered 4/3, 2012 at 5:12

Solved python regex unicode ucd character-properties

Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll} to match an arbitrary lower-case letter, or p{Zs} for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.

Sinclair answered 2/12, 2009 at 13:25 Comment(1)

Actually, Perl supports all Unicode properties, not just the general categories. Examples include

\p{Block=Greek}, \p{Script=Armenian}, \p{General_Category=Uppercase_Letter}, \p{White_Space}, \p{Alphabetic}, \p{Math}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter }, \p{Numeric_Value=10}, \p{Hangul_Syllable_Type=Leading_Jamo}, \p{Sentence_Break=SContinue},

and around 1,000 more. Only Perl’s and ICU’s regexes bother to cover the full complement of Unicode properties. Everybody else covers a tiny few, usually not even enough for minimal Unicode work. – Crosslink 25/4, 2011 at 23:3

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

Terrazas answered 2/12, 2009 at 22:22 Comment(2)

this module has not the same API as Python re module – Sequestered 20/5, 2012 at 20:6

Last commit to Ponyguruma module was apparently 2010 (dev.pocoo.org/hg/sandbox/ponyguruma) whereas the Python regex module on PyPI is actively developed: pypi.python.org/pypi/regex – Theophrastus 18/9, 2013 at 13:4

The regex module (an alternative to the standard re module) supports Unicode codepoint properties with the \p{} syntax.

Barbarossa answered 30/11, 2010 at 16:37 Comment(5)

Not sure how complete the \p{} support is, but this module is actively developed and should eventually replace the built-in re module: see pypi.python.org/pypi/regex – Theophrastus 18/9, 2013 at 13:5

+1: regex is a drop-in replacement for stdlib's re module. If you know how to use re; you immediately can use regex. import regex as re and you have \p{} syntax support. Here's an example how to remove all punctuations in a string using \p{P} – Officialdom 21/12, 2013 at 2:11

@Theophrastus When you say that regex "should" eventually replace the built-in re module, did you mean that there are plans for it to do so? I would like to have access to Unicode \p{} properties without the dependency. (It causes problems with pypy and pyodide.) – Rhiana 3/1, 2023 at 18:15

@Rhiana that was just my personal opinion, not sure if anything's changed since 2013 – Theophrastus 4/1, 2023 at 11:57

@Theophrastus there are three approaches: 1) use the Regex module; 2) use PyICU, and icu.RegexMatcher() which gives access to all Unicode properties, or 3) use PyICU and icu.UnicodeSet() using set operations to create sets of characters that can be used with re or regex as patterns. – Elstan 27/10, 2023 at 13:29

Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian} to match Armenian characters. \p{Ll} or \p{Zs} work too.

Terrazas answered 2/12, 2009 at 22:22 Comment(2)

this module has not the same API as Python re module – Sequestered 20/5, 2012 at 20:6

You can painstakingly use unicodedata on each character:

import unicodedata

def strip_accents(x):
    return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')

Loiretcher answered 12/11, 2010 at 0:23 Comment(2)

Thanks. Although outside regex's, this might be viable alternative for certain cases. – Sinclair 13/11, 2010 at 21:9

It seems that the Python unicodedata module doesn't presently contain information about e.g. the script or Unicode block of a character. See also stackoverflow.com/questions/48058402/… – Lattimore 10/1, 2018 at 6:44

Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...} into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L, Zs), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).

Example usage:

>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'ÁñÇ_1+2').group(0)
ÁñÇ_1
>>>

Here's the source. There is also a JavaScript version, using the same data.

Dismast answered 4/3, 2012 at 5:12 Comment(3)

Nice one, although you're using hand-crafted literals for the ranges in the code. It would be nice to have those literals be generated from some textual form of the spec. Or from unicodedata (docs.python.org/library/unicodedata.html#module-unicodedata). You could probably run through all valid unicode code points and run them through unicodedata.category(), and use the output to populate the map ... – Sinclair 6/3, 2012 at 17:43

Thanks for the tip, I may implement that someday. The code above was created for JavaScript first (for which there were few sensible alternatives at the time), then ported to Python. I ran some regexes on the specs and finished with a throwaway script, but I agree a repeatable procedure would have been better, so I can keep it up-do-date. – Dismast 7/3, 2012 at 0:52

I've hacked up a quick function that builds up the map dynamically (with just lists of chars as values): def unicats(maxu): m = defaultdict(list) for i in range(maxu): try: cat=unicodedata.category(unichr(i)) except: cat=None if cat: m[cat].append(i) return m m=unicats(10FFFF) Be aware that some categories get really big (e.g. len(m['Cn']) == 873882). – Sinclair 7/3, 2012 at 9:9

You're right that Unicode property classes are not supported by the Python regex parser.

If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M} or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M} would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F], and \P{M} would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F].

People would thank you. :)

Scant answered 2/12, 2009 at 14:26 Comment(1)

Right, creating character classes crossed my mind. But with roughly 40 categories you end up producing 80 classes, and that's not counting unicode scripts, blocks, planes and whatnot. Might be worth a little open source project, but still a maintenance nightmare. I just discovered that re.VERBOSE doesn't apply to character classes, so no comments here or white space to help readability... – Sinclair 2/12, 2009 at 15:17

Note that while \p{Ll} has no equivalent in Python regular expressions, \p{Zs} should be covered by '(?u)\s'. The (?u), as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s means any spacing character.

Tapia answered 5/12, 2009 at 15:24 Comment(2)

You're right. Problem is, '(?u)\s' is larger than '\p{Zs}', including e.g. newline. So if you really want to match only space separators, the former is overgenerating. – Sinclair 7/12, 2009 at 21:32

@ThomasH: to get "space except not newline" you can use the double-negated character class: (?u)[^\S\n] – Mariquilla 22/3, 2012 at 22:12

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags