Python raw strings and unicode : how to use Web input as regexp patterns?

Asked 17/1, 2010 at 16:17 Answered 17/1, 2010 at 17:10

EDIT : This question doesn't really make sense once you have picked up what the "r" flag means. More details here. For people looking for a quick anwser, I added on below.

If I enter a regexp manually in a Python script, I can use 4 combinations of flags for my pattern strings :

p1 = "pattern"
p2 = u"pattern"
p3 = r"pattern"
p4 = ru"pattern"

I have a bunch a unicode strings coming from a Web form input and want to use them as regexp patterns.

I want to know what process I should apply to the strings so I can expect similar result from the usage of the manual form above. Something like :

import re
assert re.match(p1, some_text) == re.match(someProcess1(web_input), some_text)
assert re.match(p2, some_text) == re.match(someProcess2(web_input), some_text)
assert re.match(p3, some_text) == re.match(someProcess3(web_input), some_text)
assert re.match(p4, some_text) == re.match(someProcess4(web_input), some_text)

What would be someProcess1 to someProcessN and why ?

I suppose that someProcess2 doesn't need to do anything while someProcess1 should do some unicode conversion to the local encoding. For the raw string literals, I am clueless.

Linseed answered 17/1, 2010 at 16:17 Comment(2)

Note that re.match creates a regexp match object and you cannot compare those with == if they contain the same data (== checks on identity, which means that the references need to be the same). – Equinox 17/1, 2010 at 16:58

You're right, we should use something like a mix of .group() and .groups() to get the expected result. – Linseed 17/1, 2010 at 17:11

Apart from possibly having to encode Unicode properly (in Python 2.*), no processing is needed because there is no specific type for "raw strings" -- it's just a syntax for literals, i.e. for string constants, and you don't have any string constants in your code snippet, so there's nothing to "process".

Grata answered 17/1, 2010 at 16:44 Comment(1)

Yeah, I should have asked the other question first. Now I got it, I understand this one makes no sense. – Linseed 17/1, 2010 at 16:58

"r" flags just prevent Python from interpreting "\" in a string. Since the Web doesn't care about what kind of data it carries, your web input will be a bunch of bytes you are free to interpret the way you want.

So to address this problem :

be sure you use Unicode (e.g. utf-8) all long the way
when you get the string, it will be Unicode and "\n", "\t" and "\a" will be literals, so you don't need to care about if you need to escape them of not.

Linseed answered 17/1, 2010 at 17:6 Comment(0)

Note the following in your first example:

>>> p1 = "pattern"
>>> p2 = u"pattern"
>>> p3 = r"pattern"
>>> p4 = ur"pattern" # it's ur"", not ru"" btw
>>> p1 == p2 == p3 == p4
True

While these constructs look different, they all do the same thing, they create a string object (p1 and p3 a str and p2 and p4 a unicode object in Python 2.x), containing the value "pattern". The u, r and ur just tell the parser, how to interpret the following quoted string, namely as a unicode text (u) and/or a raw text (r) where backslashes to encode other characters are ignored. However in the end it doesn't matter how a string was created, being it a raw string or not, internally it is stored the same.

When you get unicode text as input, you have to differ (in Python 2.x) if it is a unicode text or a str object. If you want to work with the unicode content, you should internally work only with those, and convert all str objects to unicode objects (either with str.decode() or with the u'text' syntax for hard-coded texts). If you however encode it to your local encoding, you will get problems with unicode symbols.

A different approach would be using Python 3, which str object supports unicode directly and stores everything as unicode and where you simply don't need to care about the encoding.

Equinox answered 17/1, 2010 at 17:10 Comment(0)

Recommended topics

Hot tags