Multiple regex substitutions using a dict with regex expressions as keys

Asked 19/2, 2021 at 0:21 Answered 19/2, 2021 at 0:38

Solved python regex string substitution python-re

I want to make multiple substitutions to a string using multiple regular expressions. I also want to make the substitutions in a single pass to avoid creating multiple instances of the string.

Let's say for argument that I want to make the substitutions below, while avoiding multiple use of re.sub(), whether explicitly or with a loop:

import re

text = "local foals drink cola"
text = re.sub("(?<=o)a", "w", text)
text = re.sub("l(?=a)", "co", text)

print(text) # "local fowls drink cocoa"

The closest solution I have found for this is to compile a regular expression from a dictionary of substitution targets and then to use a lambda function to replace each matched target with its value in the dictionary. However, this approach does not work when using metacharacters, thus removing the functionality needed from regular expressions in this example.

Let me demonstrate first with an example that works without metacharacters:

import re

text = "local foals drink cola"

subs_dict = {"a":"w", "l":"co"}
subs_regex = re.compile("|".join(subs_dict.keys()))
text = re.sub(subs_regex, lambda match: subs_dict[match.group(0)], text)

print(text) # "coocwco fowcos drink cocow"

Now observe that adding the desired metacharacters to the dictionary keys results in a KeyError:

import re

text = "local foals drink cola"

subs_dict = {"(?<=o)a":"w", "l(?=a)":"co"}
subs_regex = re.compile("|".join(subs_dict.keys()))
text = re.sub(subs_regex, lambda match: subs_dict[match.group(0)], text)

>>> KeyError: "a"

The reason for this is that the sub() function correctly finds a match for the expression "(?<=o)a", so this must now be found in the dictionary to return its substitution, but the value submitted for dictionary lookup by match.group(0) is the corresponding matched string "a". It also does not work to search for match.re in the dictionary (i.e. the expression that produced the match) because the value of that is the whole disjoint expression that was compiled from the dictionary keys (i.e. "(?<=o)a|l(?=a)").

EDIT: In case anyone would benefit from seeing thejonny's solution implemented with a lambda function as close to my originals as possible, it would work like this:

import re

text = "local foals drink cola"

subs_dict = {"(?<=o)a":"w", "l(?=a)":"co"}
subs_regex = re.compile("|".join("("+key+")" for key in subs_dict))

group_index = 1
indexed_subs = {}
for target, sub in subs_dict.items():
    indexed_subs[group_index] = sub
    group_index += re.compile(target).groups + 1

text = re.sub(subs_regex, lambda match: indexed_subs[match.lastindex], text)

print(text) # "local fowls drink cocoa"

Anya answered 19/2, 2021 at 0:21 Comment(1)

This is similar to question: https://mcmap.net/q/2035996/-multiple-specific-regex-substitutions-in-python/13968392 – Ludovico 6/11, 2021 at 20:42

If no expression you want to use matches an empty string (which is a valid assumption if you want to replace), you can use groups before |ing the expressions, and then check which group found a match:

(exp1)|(exp2)|(exp3)

Or maybe named groups so you don't have to count the subgroups inside the subexpressions.

The replacement function than can look which group matched, and chose the replacement from a list.

I came up with this implementation:


import re
def dictsub(replacements, string):
    """things has the form {"regex1": "replacement", "regex2": "replacement2", ...}"""
    exprall = re.compile("|".join("("+x+")" for x in replacements))
    gi = 1
    replacements_by_gi = {}
    for (expr, replacement) in replacements.items():
        replacements_by_gi[gi] = replacement
        gi += re.compile(expr).groups + 1


    def choose(match):
        return replacements_by_gi[match.lastindex]

    return re.sub(exprall, choose, string)


text = "local foals drink cola"
print(dictsub({"(?<=o)a":"w", "l(?=a)":"co"}, text))

that prints local fowls drink cocoa

Payton answered 19/2, 2021 at 0:29 Comment(0)

You could do this by keeping your key as the expected match and storing both your replace and regex in a nested dict. Given you're looking to match specific chars, this definition should work.

subs_dict = {"a": {'replace': 'w', 'regex': '(?<=o)a'}, 'l': {'replace': 'co', 'regex': 'l(?=a)'}}
subs_regex = re.compile("|".join([subs_dict[k]['regex'] for k in subs_dict.keys()]))
re.sub(subs_regex, lambda match: subs_dict[match.group(0)]['replace'], text)

'local fowls drink cocoa'

Vassallo answered 19/2, 2021 at 0:38 Comment(2)

With this approach, how would I deal with replacements that have more than one regex condition? For example, if I want "a" replaced by "w" when preceded by "o" but by "x" when preceded by "c"? – Anya 19/2, 2021 at 0:48

@BarnabyClunge The other answer (which has now been edited) accounts for this. Have tested and your example above works when using the answer provided by thejonny. – Vassallo 19/2, 2021 at 1:42

Recommended topics

Hot tags