Python and regular expression with Unicode
Asked Answered
H

2

94

I need to delete some Unicode symbols from the string 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'

I know they exist here for sure. I tried:

re.sub('([\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+)', '', 'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ')

but it doesn't work. String stays the same. What am I doing wrong?

Hebdomadal answered 26/12, 2008 at 14:40 Comment(0)
Q
110

Are you using python 2.x or 3.0?

If you're using 2.x, try making the regex string a unicode-escape string, with 'u'. Since it's regex it's good practice to make your regex string a raw string, with 'r'. Also, putting your entire pattern in parentheses is superfluous.

re.sub(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', '', ...)

http://docs.python.org/tutorial/introduction.html#unicode-strings

Edit:

It's also good practice to use the re.UNICODE/re.U/(?u) flag for unicode regexes, but it only affects character class aliases like \w or \b, of which this pattern does not use any and so would not be affected by.

Quelpart answered 26/12, 2008 at 14:57 Comment(3)
Hmm, did not know you could concatenate both u and r prefixes. That's pretty cool!Cryptogenic
@BalthazarRouberol I get SyntaxError: invalid syntax in Python 3.6Lepine
You can't use ur in python 3. Just use r.Jackinthepulpit
C
79

Use unicode strings. Use the re.UNICODE flag.

>>> myre = re.compile(ur'[\u064B-\u0652\u06D4\u0670\u0674\u06D5-\u06ED]+', 
                      re.UNICODE)
>>> myre
<_sre.SRE_Pattern object at 0xb20b378>
>>> mystr = u'بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ'
>>> result = myre.sub('', mystr)
>>> len(mystr), len(result)
(38, 22)
>>> print result
بسم الله الرحمن الرحيم

Read the article by Joel Spolsky called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Cycling answered 26/12, 2008 at 15:55 Comment(4)
@nosklo, why the curly braces that sets the number of chars -- {5} -- are not working with unicode characters, I'm having problems with it, yet, the + works fine..do you have any idea? Thanks!Vehement
@Vehement I have no idea, and without my magic crystal ball there's no way to help. I just tested it, and it works fine for me. If it doesn't work for you, I suggest you ask a new question, providing your code and the result you're getting.Cycling
In case you want to use re in python, you have to know that it doesn't support Unicode character property (like \p{L}). pypi.python.org/pypi/regex does.Pula
re.UNICODE flag is useless here, since it only affects shorthand character classes \w, \d, \s.Yazzie

© 2022 - 2024 — McMap. All rights reserved.