Perl and some other current regex engines support Unicode properties, such as the category, in a regex. E.g. in Perl you can use \p{Ll}
to match an arbitrary lower-case letter, or p{Zs}
for any space separator. I don't see support for this in either the 2.x nor 3.x lines of Python (with due regrets). Is anybody aware of a good strategy to get a similar effect? Homegrown solutions are welcome.
Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian}
to match Armenian characters. \p{Ll}
or \p{Zs}
work too.
The regex module (an alternative to the standard re
module) supports Unicode codepoint properties with the \p{}
syntax.
\p{}
support is, but this module is actively developed and should eventually replace the built-in re
module: see pypi.python.org/pypi/regex –
Theophrastus regex
is a drop-in replacement for stdlib's re
module. If you know how to use re
; you immediately can use regex
. import regex as re
and you have \p{}
syntax support. Here's an example how to remove all punctuations in a string using \p{P}
–
Officialdom re
module, did you mean that there are plans for it to do so? I would like to have access to Unicode \p{}
properties without the dependency. (It causes problems with pypy and pyodide.) –
Rhiana icu.RegexMatcher()
which gives access to all Unicode properties, or 3) use PyICU and icu.UnicodeSet()
using set operations to create sets of characters that can be used with re or regex as patterns. –
Elstan Have you tried Ponyguruma, a Python binding to the Oniguruma regular expression engine? In that engine you can simply say \p{Armenian}
to match Armenian characters. \p{Ll}
or \p{Zs}
work too.
You can painstakingly use unicodedata on each character:
import unicodedata
def strip_accents(x):
return u''.join(c for c in unicodedata.normalize('NFD', x) if unicodedata.category(c) != 'Mn')
unicodedata
module doesn't presently contain information about e.g. the script or Unicode block of a character. See also stackoverflow.com/questions/48058402/… –
Lattimore Speaking of homegrown solutions, some time ago I wrote a small program to do just that - convert a unicode category written as \p{...}
into a range of values, extracted from the unicode specification (v.5.0.0). Only categories are supported (ex.: L
, Zs
), and is restricted to the BMP. I'm posting it here in case someone find it useful (although that Oniguruma really seems a better option).
Example usage:
>>> from unicode_hack import regex
>>> pattern = regex(r'^\\p{Lu}(\\p{L}|\\p{N}|_)*')
>>> print pattern.match(u'疂_1+2').group(0)
疂_1
>>>
Here's the source. There is also a JavaScript version, using the same data.
You're right that Unicode property classes are not supported by the Python regex parser.
If you wanted to do a nice hack, that would be generally useful, you could create a preprocessor that scans a string for such class tokens (\p{M}
or whatever) and replaces them with the corresponding character sets, so that, for example, \p{M}
would become [\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
, and \P{M}
would become [^\u0300–\u036F\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]
.
People would thank you. :)
Note that while \p{Ll}
has no equivalent in Python regular expressions, \p{Zs}
should be covered by '(?u)\s'
.
The (?u)
, as the docs say, “Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.” and \s
means any spacing character.
(?u)[^\S\n]
–
Mariquilla © 2022 - 2024 — McMap. All rights reserved.
\p{Block=Greek}, \p{Script=Armenian}, \p{General_Category=Uppercase_Letter}, \p{White_Space}, \p{Alphabetic}, \p{Math}, \p{Bidi_Class=Right_to_Left}, \p{Word_Break=A_Letter }, \p{Numeric_Value=10}, \p{Hangul_Syllable_Type=Leading_Jamo}, \p{Sentence_Break=SContinue},
and around 1,000 more. Only Perl’s and ICU’s regexes bother to cover the full complement of Unicode properties. Everybody else covers a tiny few, usually not even enough for minimal Unicode work. – Crosslink