Python regex - r prefix
Asked Answered
S

5

110

Can anyone explain why example 1 below works, when the r prefix is not used? I thought the r prefix must be used whenever escape sequences are used. Example 2 and example 3 demonstrate this.

# example 1
import re
print (re.sub('\s+', ' ', 'hello     there      there'))
# prints 'hello there there' - not expected as r prefix is not used

# example 2
import re
print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello     there      there'))
# prints 'hello     there' - as expected as r prefix is used

# example 3
import re
print (re.sub('(\b\w+)(\s+\1\b)+', '\1', 'hello     there      there'))
# prints 'hello     there      there' - as expected as r prefix is not used
Sunglasses answered 11/2, 2010 at 1:18 Comment(1)
i think I explain well why raw strings are needed here: https://mcmap.net/q/25945/-escaping-regex-stringUpswing
W
106

Because \ begin escape sequences only when they are valid escape sequences.

>>> '\n'
'\n'
>>> r'\n'
'\\n'
>>> print '\n'


>>> print r'\n'
\n
>>> '\s'
'\\s'
>>> r'\s'
'\\s'
>>> print '\s'
\s
>>> print r'\s'
\s

Unless an 'r' or 'R' prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C. The recognized escape sequences are:

Escape Sequence   Meaning Notes
\newline  Ignored  
\\    Backslash (\)    
\'    Single quote (')     
\"    Double quote (")     
\a    ASCII Bell (BEL)     
\b    ASCII Backspace (BS)     
\f    ASCII Formfeed (FF)  
\n    ASCII Linefeed (LF)  
\N{name}  Character named name in the Unicode database (Unicode only)  
\r    ASCII Carriage Return (CR)   
\t    ASCII Horizontal Tab (TAB)   
\uxxxx    Character with 16-bit hex value xxxx (Unicode only) 
\Uxxxxxxxx    Character with 32-bit hex value xxxxxxxx (Unicode only) 
\v    ASCII Vertical Tab (VT)  
\ooo  Character with octal value ooo
\xhh  Character with hex value hh

Never rely on raw strings for path literals, as raw strings have some rather peculiar inner workings, known to have bitten people in the ass:

When an "r" or "R" prefix is present, a character following a backslash is included in the string without change, and all backslashes are left in the string. For example, the string literal r"\n" consists of two characters: a backslash and a lowercase "n". String quotes can be escaped with a backslash, but the backslash remains in the string; for example, r"\"" is a valid string literal consisting of two characters: a backslash and a double quote; r"\" is not a valid string literal (even a raw string cannot end in an odd number of backslashes). Specifically, a raw string cannot end in a single backslash (since the backslash would escape the following quote character). Note also that a single backslash followed by a newline is interpreted as those two characters as part of the string, not as a line continuation.

To better illustrate this last point:

>>> r'\'
SyntaxError: EOL while scanning string literal
>>> r'\''
"\\'"
>>> '\'
SyntaxError: EOL while scanning string literal
>>> '\''
"'"
>>> 
>>> r'\\'
'\\\\'
>>> '\\'
'\\'
>>> print r'\\'
\\
>>> print r'\'
SyntaxError: EOL while scanning string literal
>>> print '\\'
\
Waksman answered 11/2, 2010 at 1:24 Comment(3)
As a minor fix, '\s' (like r'\s') is also represented as '\\s', due to '\s' not being a recognized escape sequence.Toastmaster
@MassoodKhaari I'd swear that the output was correct back when I wrote this answer... Fixed.Fortuna
8 years certainly justify the magical change in python behavior. :DToastmaster
R
49

the 'r' means the the following is a "raw string", ie. backslash characters are treated literally instead of signifying special treatment of the following character.

http://docs.python.org/reference/lexical_analysis.html#literals

so '\n' is a single newline
and r'\n' is two characters - a backslash and the letter 'n'
another way to write it would be '\\n' because the first backslash escapes the second

an equivalent way of writing this

print (re.sub(r'(\b\w+)(\s+\1\b)+', r'\1', 'hello     there      there'))

is

print (re.sub('(\\b\\w+)(\\s+\\1\\b)+', '\\1', 'hello     there      there'))

Because of the way Python treats characters that are not valid escape characters, not all of those double backslashes are necessary - eg '\s'=='\\s' however the same is not true for '\b' and '\\b'. My preference is to be explicit and double all the backslashes.

Reisfield answered 11/2, 2010 at 1:30 Comment(1)
If r or R means raw string, then how come the plus sign makes even any sense??! It would literally search for that exact string: "(\b\w+)(\s+\1\b)+". I don't understand your choice of string. Or somehow, regular expressions are a part of raw strings? In which case, I am utterly confused.Welldone
H
6

Not all sequences involving backslashes are escape sequences. \t and \f are, for example, but \s is not. In a non-raw string literal, any \ that is not part of an escape sequence is seen as just another \:

>>> "\s"
'\\s'
>>> "\t"
'\t'

\b is an escape sequence, however, so example 3 fails. (And yes, some people consider this behaviour rather unfortunate.)

Histoid answered 11/2, 2010 at 1:24 Comment(6)
Exactly. Although, @JT, I recommend using '\\s' or r'\s', or you'll probably inadvertently hit some escape sequences that you didn't mean to.Grindery
Indeed: always use raw string literals when you want the string to contain backslashes (as opposed to actually wanting the escape sequences.)Histoid
@Thomas: r still escapes some sequences when they appear at the end of the string: r"\" is invalid, to do that you have to do "\\". If you do r"\\", you get a \\ printed ("\\\\" string). Be careful with that.Fortuna
Yes, raw string literals can't end in a single `\`.Histoid
@Blair/Thomas: thanks - this was the general rule I was following that got me confused in the first place! ... all is clear now, thanks all. Though in following this rule ... when reading the pattern from a plain text file, how would the pattern be passed on as a raw literal string?Sunglasses
@JT, if I understand the question correctly, you would just put \s in the plain text file - no interpretation of the string contents would occur when you read it in.Grindery
B
1

Try that:

a = '\''
'
a = r'\''
\'
a = "\'"
'
a = r"\'"
\'
Bolshevik answered 29/7, 2019 at 9:39 Comment(0)
D
1

Check below example:

print r"123\n123" 
#outputs>>>
123\n123


print "123\n123"
#outputs>>>
123
123
Darien answered 29/9, 2019 at 12:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.