Python Regex escape operator \ in substitutions & raw strings
Asked Answered
T

3

15

I don't understand the logic in the functioning of the scape operator \ in python regex together with r' of raw strings. Some help is appreciated.

code:

import re
text=' esto  .es  10  . er - 12 .23 with [  and.Other ] here is more ; puntuation'
print('text0=',text)
text1 = re.sub(r'(\s+)([;:\.\-])', r'\2', text)
text2 = re.sub(r'\s+\.', '\.', text)
text3 = re.sub(r'\s+\.', r'\.', text)
print('text1=',text1)
print('text2=',text2)
print('text3=',text3)

The theory says: backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.

And as far as the link provided at the end of this question explains, r' represents a raw string, i.e. there is no special meaning for symbols, it is as it stays.

so in the above regex I would expect text2 and text3 to be different, since the substitution text is '.' in text 2, i.e. a period, whereas (in principle) the substitution text in text 3 is r'.' which is a raw string, i.e. the string as it is should appear, backslash and period. But they result in the same:

The result is:

text0=  esto  .es  10  . er - 12 .23 with [  and.Other ] here is more ; puntuation
text1=  esto.es  10. er- 12.23 with [  and.Other ] here is more; puntuation
text2=  esto\.es  10\. er - 12\.23 with [  and.Other ] here is more ; puntuation
text3=  esto\.es  10\. er - 12\.23 with [  and.Other ] here is more ; puntuation
#text2=text3 but substitutions are not the same r'\.' vs '\.'

It looks to me that the r' does not work the same way in substitution part, nor the backslash. On the other hand my intuition tells me I am missing something here.

EDIT 1: Following @Wiktor Stribiżew comment. He pointed out that (following his link):

import re
print(re.sub(r'(.)(.)(.)(.)(.)(.)', 'a\6b', '123456'))
print(re.sub(r'(.)(.)(.)(.)(.)(.)', r'a\6b', '123456'))
# in my example the substitutions were not the same and the result were equal
# here indeed r' changes the results

which gives:

ab
a6b

that puzzles me even more.

Note: I read this stack overflow question about raw strings which is super complete. Nevertheless it does not speak about substitutions

Telephony answered 10/6, 2019 at 9:15 Comment(1)
It does not "speak" about substitutions because the replacement patterns are not regular expressions. '\.' = r'\.', it is a \ and . char combination. And since it is a replacement pattern, you get this text as is in the result. However, you are using \ in your tests and it is even trickier: it is special in the regex replacement pattern. re.sub(r'\s+\.', r'\\.', text) will result in the same string as text2 and text3. See this Python demo.Jaynejaynell
V
9

First and foremost,

replacement patterns ≠ regular expression patterns

We use a regex pattern to search for matches, we use replacement patterns to replace matches found with regex.

NOTE: The only special character in a substitution pattern is a backslash, \. Only the backslash must be doubled.

Replacement pattern syntax in Python

The re.sub docs are confusing as they mention both string escape sequences that can be used in replacement patterns (like \n, \r) and regex escape sequences (\6) and those that can be used as both regex and string escape sequences (\&).

I am using the term regex escape sequence to denote an escape sequence consisting of a literal backslash + a character, that is, '\\X' or r'\X', and a string escape sequence to denote a sequence of \ and a char or some sequence that together form a valid string escape sequence. They are only recognized in regular string literals. In raw string literals, you can only escape " (and that is the reason why you can't end a raw string literal with \", but the backlash is still part of the string then).

So, in a replacement pattern, you may use backreferences:

re.sub(r'\D(\d)\D', r'\1', 'a1b')    # => 1
re.sub(r'\D(\d)\D', '\\1', 'a1b')    # => 1
re.sub(r'\D(\d)\D', '\g<1>', 'a1b')  # => 1
re.sub(r'\D(\d)\D', r'\g<1>', 'a1b') # => 1

You may see that r'\1' and '\\1' is the same replacement pattern, \1. If you use '\1', it will get parse as a string escape sequence, a character with octal value 001. If you forget to use r prefix with the unambiguous backreference, there is no problem because \g is not a valid string escape sequence, and there, \ escape character remains in the string. Read on the docs I linked to:

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.

So, when you pass '\.' as a replacement string, you actually send \. two-char combination as the replacement string, and that is why you get \. in the result.

\ is a special character in Python replacement pattern

If you use re.sub(r'\s+\.', r'\\.', text), you will get the same result as in text2 and text3 cases, see this demo.

That happens because \\, two literal backslashes, denote a single backslash in the replacement pattern. If you have no Group 2 in your regex pattern, but pass r'\2' in the replacement to actually replace with \ and 2 char combination, you would get an error.

Thus, when you have dynamic, user-defined replacement patterns you need to double all backslashes in the replacement patterns that are meant to be passed as literal strings:

re.sub(some_regex, some_replacement.replace('\\', '\\\\'), input_string)
Vilayet answered 10/6, 2019 at 10:19 Comment(4)
Thanks for your answer, which solved this question! Side-remark: funny that we have to use a replace(...) to do a re.sub replace. Recursive!Calysta
@Calysta Actually, it is common practice to pre-process dynamic replacement patterns that should stay literal as a result. In Java, there is Matcher.quoteReplacement method designed specifically for that purpose. However, the set of special characters in replacement patterns differ from language to language.Jaynejaynell
This shows Python should have re.escape_repl or re.escape(..., repl=True) or re.escape(..., mode='repl') to escape replacement patterns, in adition to re.escape which escapes regex search patterns. What do you think @WiktorStribiżew?Calysta
@Calysta Yes, that is would be a good addition to the re API.Jaynejaynell
C
3

A simple way to work around all these string escaping issues is to use a function/lambda as the repl argument, instead of a string. For example:

output = re.sub(
    pattern=find_pattern,
    repl=lambda _: replacement,
    string=input,
)

The replacement string won't be parsed at all, just substituted in place of the match.

Coucher answered 14/11, 2021 at 13:37 Comment(2)
Unfortunately this cannot be used if you need to combine the replacement string with backreferences.Opportunist
This is a great alternative to escaping the escape sequences. ThxPrat
Y
1

From the doc (my emphasis):

re.sub(pattern, repl, string, count=0, flags=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.

The repl argument is not just plain text. It can also be the name of a function or refer to a position in a group (e.g. \g<quote>, \g<1>, \1).

Also, from here:

Unlike Standard C, all unrecognized escape sequences are left in the string unchanged, i.e., the backslash is left in the result.

Since . is not a special escape character, '\.' is the same as r'\.\.

Yarber answered 10/6, 2019 at 9:33 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.