matching unicode characters in python regular expressions

Asked 17/2, 2011 at 12:8 Answered 25/10, 2012 at 5:46

Solved python regex unicode non-ascii-characters character-properties

I have read thru the other questions at Stackoverflow, but still no closer. Sorry, if this is allready answered, but I didn`t get anything proposed there to work.

>>> import re
>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg')
>>> print m.groupdict()
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

All is well, then I try something with Norwegian characters in it ( or something more unicode-like ):

>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg')
>>> print m.groupdict()
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groupdict'

How can I match typical unicode characters, like øæå? I`d like to be able to match those characters as well, in both the tag-group above and the one for filename.

Bodine answered 17/2, 2011 at 12:8 Comment(1)

Make sure you normalize your strings because there are diffent codepoint-sequences generating the same visual apperance. – Restrainer 26/8, 2016 at 17:25

You need to specify the re.UNICODE flag, and input your string as a Unicode string by using the u prefix:

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

This is in Python 2; in Python 3 you must leave out the u because all strings are Unicode, and you can leave off the re.UNICODE flag.

Giga answered 17/2, 2011 at 12:18 Comment(3)

+1 for: and input your string as a Unicode string by using the u prefix – Lite 18/12, 2013 at 15:56

I don't understand, why do we need to pass in re.UNICODE? (I'm using python 3) – Belsen 30/6, 2022 at 19:9

@CharlieParker Notice the date of this answer :) In Python 3, re.UNICODE does nothing. – Giga 30/6, 2022 at 20:1

You need the UNICODE flag:

m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg', re.UNICODE)

Peritoneum answered 17/2, 2011 at 12:12 Comment(3)

Is it required for Python3 too? – Polythene 4/10, 2016 at 7:36

@Polythene - you don't need the unicode flag with Python 3. "Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns..." - docs.python.org/3/howto/regex.html – Valuer 26/8, 2019 at 18:50

I don't understand, why do we need to pass in re.UNICODE? (I'm using python 3) – Belsen 30/6, 2022 at 19:9

In Python 2, you need the re.UNICODE flag and the unicode string constructor

>>> re.sub(r"[\w]+","___",unicode(",./hello-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./cześć-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./привет-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好-=+","utf-8"),flags=re.UNICODE)
u',./___-=+'
>>> re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
u',./___\uff0c___-=+'
>>> print re.sub(r"[\w]+","___",unicode(",./你好，世界-=+","utf-8"),flags=re.UNICODE)
,./___，___-=+

(In the latter case, the comma is Chinese comma.)

Amphitrite answered 25/10, 2012 at 5:46 Comment(1)

I don't understand, why do we need to pass in re.UNICODE? (I'm using python 3) – Belsen 30/6, 2022 at 19:9

Recommended topics

Hot tags