Raw string and regular expression in Python
Asked Answered
M

4

8

I'm confused about raw string in the following code:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
print (text2_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13.

print (r'(\d+)/(\d+)/(\d+)') #output: (\d+)/(\d+)/(\d+)

As I understand raw string, without r, the \ is treated as an escape character; with r, the backslash \ is treated literally as itself (a backslash).

However, what I cannot understand in the above code is that:

  • In the regular expression Line 5, even though there is a r, the "\d" inside is treated as one number [0-9] instead of one backslash \ plus one letter d.
  • In the second print Line 8, all characters are treated as raw strings.

What is the difference?

Additional Edition:

I made the following four variations, with or without r:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)
text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

print (text2_re)
print (text2_re1)
print (text2_re2)
print (text2_re3)

And get the following output:

Could you explain these Four situations specifically?

Muscatel answered 11/5, 2015 at 9:30 Comment(0)
L
18

You're getting confused by the difference between a string and a string literal.

A string literal is what you put between " or ' and the python interpreter parses this string and puts it into memory. If you mark your string literal as a raw string literal (using r') then the python interpreter will not change the representation of that string before putting it into memory but once they've been parsed they are stored exactly the same way.

This means that in memory there is no such thing as a raw string. Both the following strings are stored identically in memory with no concept of whether they were raw or not.

r'a regex digit: \d'  # a regex digit: \d
'a regex digit: \\d'  # a regex digit: \d

Both these strings contain \d and there is nothing to say that this came from a raw string. So when you pass this string to the re module it sees that there is a \d and sees it as a digit because the re module does not know that the string came from a raw string literal.

In your specific example, to get a literal backslash followed by a literal d you would use \\d like so:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\\d+)/(\\d+)/(\\d+)', r'\3-\1-\2', text2)
print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013.

Alternatively, without using raw strings:

import re

text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text_re = re.sub('(\\d+)/(\\d+)/(\\d+)', '\\3-\\1-\\2', text2)
print (text_re) #output: Today is 2012-11-27. PyCon starts 2013-3-13.

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub('(\\\\d+)/(\\\\d+)/(\\\\d+)', '\\3-\\1-\\2', text2)
print (text2_re) #output: Today is 11/27/2012. PyCon starts 3/13/2013.

I hope that helps somewhat.

Edit: I didn't want to complicate things but because \d is not a valid escape sequence python does not change it, so '\d' == r'\d' is true. Since \\ is a valid escape sequence it gets changed to \, so you get the behaviour '\d' == '\\d' == r'\d'. Strings get confusing sometimes.

Edit2: To answer your edit, let's look at each line specifically:

text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)

re.sub receives the two strings (\d+)/(\d+)/(\d+) and \3-\1-\2. Hopefully this behaves as you expect now.

text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)

Again (because \d is not a valid string escape it doesn't get changed, see my first edit) re.sub receives the two strings (\d+)/(\d+)/(\d+) and \3-\1-\2. Since \d doesn't get changed by the python interpreter r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)'. If you understand my first edit then hopefully you should understand why these two cases behave the same.

text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

This case is a bit different because \1, \2 and \3 are all valid escape sequences, they are replaced with the unicode character whose decimal representation is given by the number. That's quite complex but it basically boils down to:

\1  # stands for the ascii start-of-heading character
\2  # stands for the ascii start-of-text character
\3  # stands for the ascii end-of-text character

This means that re.sub receives the first string as it has done in the first two examples ((\d+)/(\d+)/(\d+)) but the second string is actually <start-of-heading>/<start-of-text>/<end-of-text>. So re.sub replaces the match with that second string exactly but since none of the three (\1, \2 or \3) are printable characters python just prints a stock place-holder character instead.

text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

This behaves like the third example because r'(\d+)/(\d+)/(\d+)' == '(\d+)/(\d+)/(\d+)', as explained in the second example.

Leuko answered 11/5, 2015 at 10:2 Comment(4)
Could you also explain the additional part in the question?Muscatel
I've had a go at explaining them. This is quite complex behaviour so hopefully I haven't just confused you more.Leuko
repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. docs.python.org/3/library/re.html#re.subOverhead
you can test you regex here (in python and others) regex101.com (check the python flavor)Overhead
H
6

There is a distinction you have to make between the python interpreter and the re module.

In python, a backslash followed by a character can mean a special character if the string is not rawed. For instance, \n will mean a newline character, \r will mean a carriage return, \t will mean the tab character, \b represents a nondestructive backspace. By itself, \d in a python string does not mean anything special.

In regex however, there are a bunch of characters that would otherwise not always mean anything in python. But that's the catch, 'not always'. One of the things that can be misinterpreted is \b which in python is a backspace, in regex means a word boundary. What this implies is that if you pass on an unrawed \b to the regular expression part of a regex, this \b gets substituted by the backspace before it is passed to the regex function and it won't mean a thing there. So you have to absolutely pass the b with its backslash and to do that, you either escape the backslash, or raw the string.

Back to your question regarding \d, \d has no special meaning whatsoever in python, so it remains untouched. The same \d passed as a regular expression gets converted by the regex engine, which is a separate entity to the python interpreter.


Per question's edit:

import re

text2 = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
text2_re = re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
text2_re1 = re.sub('(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text2)
text2_re2 = re.sub(r'(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)
text2_re3 = re.sub('(\d+)/(\d+)/(\d+)', '\3-\1-\2', text2)

print(text2_re)
print(text2_re1)
print(text2_re2)
print(text2_re3)

The first two should be straightforward. re.sub does its thing by matching the numbers and forward slashes and replacing them in a different order with hyphens instead. Since \d does not have any special meaning in python, \d passed on to re.sub whether the expression is rawed or not.

The third and fourth happens because you have not rawed the strings for the replace expression. \1, \2 and \3 have a special meaning in python, representing a white (or unfilled) smiley face, a black (filled) smiley face and a heart respectively (if the characters cannot be displayed, you get these 'character boxes'). So instead of replacing by the captured groups, you are replacing the strings by specific characters.

enter image description here

Heterogony answered 11/5, 2015 at 9:59 Comment(0)
E
2

I feel like the above answers are way over complicating it. If you're running re.search(), the string you send is parsed through two layers:

  1. Python interprets \ characters you write through this filter.

  2. Then, regular expression interprets \ characters you write through its own filter.

They happen in that order.

The "raw" string syntax r"\nlolwtfbbq" is for when you want to bypass the Python interpreter, it doesn't affect re:

>>> print "\nlolwtfbbq"

lolwtfbbq
>>> print r"\nlolwtfbbq"
\nlolwtfbbq
>>>

Note that a newline is printed in the first example, but the actual characters \ and n are printed in the second, because it's raw.

Any strings you send to re go through the regular expression interpreter, so to answer your specific question, \d means "digit 0-9" in regular expression.

Expectation answered 11/5, 2018 at 17:8 Comment(0)
G
1

Not all \ will cause problems. The Python interpreter has some builtins like \b etc. So now if r is not there, Python will consider \b as its own literal rather than word boundary for regex. When it is used with r (rawstring) mode then \b is left as it is. That's in layman language. Not much into technicals.\d is not a special builtin in python, so that will be safe even without r mode.

Here you can see the list. This is the list which Python understands and will interpret, like \b ,\n and not \d.

In the first print the \d interpretation is being done by regex module not by Python. In the second print it is being done by Python. As it is in r mode it will put as it is.

Glassware answered 11/5, 2015 at 9:34 Comment(2)
What do you mean interpretation is being done by regex or by python? What is the difference?Muscatel
@fluency_03 \d means nothing for python.Its the regex module which knows \d is [0-9].On the same lines python know \b,\n so when it finds these it will interpret them.So if you want python not to interpret these you put everything in r mode.Glassware

© 2022 - 2024 — McMap. All rights reserved.