Given a string and a list of substring that should be replaces as placeholders, e.g.
import re
from copy import copy
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
The first goal is to first replace the substrings from phrases
in the original_text
with indexed placeholders, e.g.
text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[out]:
Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen
Then there'll be some functions to manipulate the text
with the placeholders, e.g.
cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)
that outputs:
MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2
the last step is to do the replacement we did in a backwards manner and put back the original phrases, i.e.
' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])
[out]:
"'s_morgen ik 's-Hertogenbosch depository_financial_institution"
The questions are:
- If the list of substrngs in
phrases
is huge, the time to do the 1st replacement and the last backplacement would take very long.
Is there a way to do the replacement/backplacement with a regex?
- using the
re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
regex substitution isn't very helpful esp. if there are substrings in the phrases that matches not the full word,
E.g.
phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
we get an awkward output:
Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen
I've tried using '\b{}\b'.format(phrase)
but that'll didn't work for the phrases with punctuations, i.e.
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)
[out]:
Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen
Is there some where to denote the word boundary for the phrases in the re.sub
regex pattern?
phrases
are removed, except forik
. Why is that? – GibbousThen there'll be some functions to manipulate the text with the placeholders
. So, you have a function to work on the text after adding the placeholders. And that function must do a split on whitespace or something. So, now you have an array where you manipulate all the elements except the placeholders, then you want to join the array into a string, then substitute the placeholders back using the real words. Is that correct ? – Uriah((?:(?!phrase1|phrase2|phrase3)[\S\s])+)|(phrase1|phrase2|phrase3)
. Where, capture group 1 is a non-phrase string part, capture group 2 is a phrase. – Uriahr"(?<!\w){}(?!\w)".format(phrase)
. Since some of your keywords start with a non-word chars, you cannot use\b
. Could you please provide some more logic that you need to implement? It looks like you might need to pass a callback/lambda as the second argument tore.sub
to replace each match just once. – Jeffreys