python-re: How do I match an alpha character

Asked 10/1, 2010 at 23:43 Answered 3/3, 2016 at 16:4

Solved python regex unicode regex-negation

How can I match an alpha character with a regular expression. I want a character that is in \w but is not in \d. I want it unicode compatible that's why I cannot use [a-zA-Z].

Villosity answered 10/1, 2010 at 23:43 Comment(3)

"unicode compatible" - does that mean that you want to match both e and é, for example? – Glycerin 10/1, 2010 at 23:46

In Python, remember that to indicate a unicode string you must use this: u'Unicode string here' - given that have you tried str.find() where str is your unicode string? – Suffumigate 11/1, 2010 at 0:11

What I meant was that I wanted to match a,é,あ,日나 but not 1, . (dot), ９, 9, 。 etc. for example. – Villosity 11/1, 2010 at 8:23

Your first two sentences contradict each other. "in \w but is not in \d" includes underscore. I'm assuming from your third sentence that you don't want underscore.

Using a Venn diagram on the back of an envelope helps. Let's look at what we DON'T want:

(1) characters that are not matched by \w (i.e. don't want anything that's not alpha, digits, or underscore) => \W
(2) digits => \d
(3) underscore => _

So what we don't want is anything in the character class [\W\d_] and consequently what we do want is anything in the character class [^\W\d_]

Here's a simple example (Python 2.6).

>>> import re
>>> rx = re.compile("[^\W\d_]+", re.UNICODE)
>>> rx.findall(u"abc_def,k9")
[u'abc', u'def', u'k']

Further exploration reveals a few quirks of this approach:

>>> import unicodedata as ucd
>>> allsorts =u"\u0473\u0660\u06c9\u24e8\u4e0a\u3020\u3021"
>>> for x in allsorts:
...     print repr(x), ucd.category(x), ucd.name(x)
...
u'\u0473' Ll CYRILLIC SMALL LETTER FITA
u'\u0660' Nd ARABIC-INDIC DIGIT ZERO
u'\u06c9' Lo ARABIC LETTER KIRGHIZ YU
u'\u24e8' So CIRCLED LATIN SMALL LETTER Y
u'\u4e0a' Lo CJK UNIFIED IDEOGRAPH-4E0A
u'\u3020' So POSTAL MARK FACE
u'\u3021' Nl HANGZHOU NUMERAL ONE
>>> rx.findall(allsorts)
[u'\u0473', u'\u06c9', u'\u4e0a', u'\u3021']

U+3021 (HANGZHOU NUMERAL ONE) is treated as numeric (hence it matches \w) but it appears that Python interprets "digit" to mean "decimal digit" (category Nd) so it doesn't match \d

U+2438 (CIRCLED LATIN SMALL LETTER Y) doesn't match \w

All CJK ideographs are classed as "letters" and thus match \w

Whether any of the above 3 points are a concern or not, that approach is the best you will get out of the re module as currently released. Syntax like \p{letter} is in the future.

Desecrate answered 11/1, 2010 at 1:41 Comment(0)

What about:

\p{L}

You can to use this document as reference: Unicode Regular Expressions

EDIT: Seems Python doesn't handle Unicode expressions. Take a look into this link: Handling Accented Characters with Python Regular Expressions -- [A-Z] just isn't good enough (no longer active, link to internet archive)

Another references:

For posterity, here are the examples on the blog:

import re
string = 'richÃ©'
print string
richÃ©

richre = re.compile('([A-z]+)')
match = richre.match(string)
print match.groups()
('rich',)

richre = re.compile('(\w+)',re.LOCALE)
match = richre.match(string)
print match.groups()
('rich',)

richre = re.compile('([Ã©\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

richre = re.compile('([\xe9\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

richre = re.compile('([\xe9-\xf8\w]+)')
match = richre.match(string)
print match.groups()
('rich\xe9',)

string = 'richÃ©Ã±'
match = richre.match(string)
print match.groups()
('rich\xe9\xf1',)

richre = re.compile('([\u00E9-\u00F8\w]+)')
print match.groups()
('rich\xe9\xf1',)

matched = match.group(1)
print matched
richÃ©Ã±

Hephzipah answered 10/1, 2010 at 23:45 Comment(3)

Thank you but I cannot know wether a character is a (CJK)punctuation symbol or a numeric symbol other than 0-9 if I do a range like \u00E9-\u00F8. – Villosity 10/1, 2010 at 23:55

you can work with letter ranges, if you refer to a document like tamasoft.co.jp/en/general-info/unicode.html and to pick all letters interval (that could be boring...); this link can also help you: kourge.net/projects/regexp-unicode-block – Hephzipah 11/1, 2010 at 0:3

An example of this in action would be helpful here. – Crowell 15/6, 2014 at 6:20

You can use one of the following expressions to match a single letter:

(?![\d_])\w

\w(?<![\d_])

Here I match for \w, but check that [\d_] is not matched before/after that.

From the docs:

(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length and shouldn’t contain group references. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

Winny answered 3/3, 2016 at 16:4 Comment(0)

Recommended topics

Hot tags