Replacing string with placeholder and replacing them back after a function.
Asked Answered
S

4

6

Given a string and a list of substring that should be replaces as placeholders, e.g.

import re
from copy import copy 

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

The first goal is to first replace the substrings from phrases in the original_text with indexed placeholders, e.g.

text = copy(original_text)
backplacement = {}
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[out]:

Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen

Then there'll be some functions to manipulate the text with the placeholders, e.g.

cleaned_text = func('Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen')
print(cleaned_text)

that outputs:

MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2

the last step is to do the replacement we did in a backwards manner and put back the original phrases, i.e.

' '.join([backplacement[tok] if tok in backplacement else tok for tok in clean_text.split()])

[out]:

"'s_morgen ik 's-Hertogenbosch depository_financial_institution"

The questions are:

  1. If the list of substrngs in phrases is huge, the time to do the 1st replacement and the last backplacement would take very long.

Is there a way to do the replacement/backplacement with a regex?

  1. using the re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text) regex substitution isn't very helpful esp. if there are substrings in the phrases that matches not the full word,

E.g.

phrases = ["org", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"{}".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

we get an awkward output:

Something, 's mMWEPHRASE0en, ik MWEPHRASE1 im das MWEPHRASE2 gehen

I've tried using '\b{}\b'.format(phrase) but that'll didn't work for the phrases with punctuations, i.e.

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"
backplacement = {}
text = copy(original_text)
for i, phrase in enumerate(phrases):
    backplacement["MWEPHRASE{}".format(i)] = phrase.replace(' ', '_')
    text = re.sub(r"\b{}\b".format(phrase), "MWEPHRASE{}".format(i), text)
print(text)

[out]:

Something, 's morgen, ik 's-Hertogenbosch im das MWEPHRASE2 gehen

Is there some where to denote the word boundary for the phrases in the re.sub regex pattern?

Sexpot answered 14/3, 2018 at 8:36 Comment(6)
In your desired output, all strings that do not occur in phrases are removed, except for ik. Why is that?Gibbous
You're doing this the hard way. Then there'll be some functions to manipulate the text with the placeholders. So, you have a function to work on the text after adding the placeholders. And that function must do a split on whitespace or something. So, now you have an array where you manipulate all the elements except the placeholders, then you want to join the array into a string, then substitute the placeholders back using the real words. Is that correct ?Uriah
Single pass, I would match all the words using a regex and put them into two dimension array ( or list). Dimension 0 is the string part, dimension 1 is a flag. When you match a non-phrase string part, the flag is 0, when it is a phrase word, the flag is 1. You can then iterate the array and ignore the ones where the flag is 1. Add, delete, re-arrange elements as needed. Then join them back together. The regex is simple ((?:(?!phrase1|phrase2|phrase3)[\S\s])+)|(phrase1|phrase2|phrase3). Where, capture group 1 is a non-phrase string part, capture group 2 is a phrase.Uriah
This seem to be one alternative: github.com/vi3k6i5/flashtextSexpot
As for the word boundary, you must be looking for r"(?<!\w){}(?!\w)".format(phrase). Since some of your keywords start with a non-word chars, you cannot use \b. Could you please provide some more logic that you need to implement? It looks like you might need to pass a callback/lambda as the second argument to re.sub to replace each match just once.Jeffreys
Have you tried my approach? Or do you want to switch to FlashText now?Jeffreys
T
2

Instead of using re.sub you can split it!

def do_something_with_str(string):
    # do something with string here.
    # for example let's wrap the string with "@" symbol if it's not empty
    return f"@{string}" if string else string


def get_replaced_list(string, words):
    result = [(string, True), ]

    # we take each word we want to replace
    for w in words:

        new_result = []

        # Getting each word in old result
        for r in result:

            # Now we split every string in results using our word.
            split_list = list((x, True) for x in r[0].split(w)) if r[1] else list([r, ])

            # If we replace successfully - add all the strings
            if len(split_list) > 1:

                # This one would be for [text, replaced, text, replaced...]
                sub_result = []
                ws = [(w, False), ] * (len(split_list) - 1)
                for x, replaced in zip(split_list, ws):
                    sub_result.append(x)
                    sub_result.append(replaced)
                sub_result.append(split_list[-1])

                # Add to new result
                new_result.extend(sub_result)

            # If not - just add it to results
            else:
                new_result.extend(split_list)
        result = new_result
    return result


if __name__ == '__main__':
    initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
    words_to_replace = ('a', 'c')
    replaced_list = get_replaced_list(initial_string, words_to_replace)
    modified_list = [(do_something_with_str(x[0]), True) if x[1] else x for x in replaced_list]
    final_string = ''.join([x[0] for x in modified_list])

Here's variables values of the example above:

initial_string = 'acbbcbbcacbbcbbcacbbcbbca'
words_to_replace = ('a', 'c')
replaced_list = [('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('bb', True), ('c', False), ('bb', True), ('c', False), ('', True), ('a', False), ('', True)]
modified_list = [('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True), ('c', False), ('@bb', True), ('c', False), ('@bb', True), ('c', False), ('', True), ('a', False), ('', True)]
final_string = 'ac@bbc@bbcac@bbc@bbcac@bbc@bbca'

As you can see the lists contain tuples. They contain two values - some string and boolean, representing whether it's a text or replaced value (True when text). After you get replaced list, you can modify it as in the example, checking if it's text value (if x[1] == True). Hope that helps!

P.S. String formatting like f"some string here {some_variable_here}" requires Python 3.6

Textual answered 21/3, 2018 at 8:28 Comment(0)
O
2

I think there are two keys to using regular expressions for this task:

  1. Use custom boundaries, capture them, and substitute them back, along with the phrase.

  2. Use a function to process the substitution matches, in both directions.

Below is an implementation that uses this approach. I tweaked your text slightly to repeat one of the phrases.

import re
from copy import copy 

original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen"
text = copy(original_text)

#
# The phrases of interest
#
phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]

#
# Create the mapping dictionaries
#
phrase_to_mwe = {}
mwe_to_phrase = {}

#
# Build the mappings
#
for i, phrase in enumerate(phrases):

    mwephrase                = "MWEPHRASE{}".format(i)
    mwe_to_phrase[mwephrase] = phrase.replace(' ', '_')
    phrase_to_mwe[phrase]    = mwephrase

#
# Regex match handlers
#
def handle_forward(match):

    b1     = match.group(1)
    phrase = match.group(2)
    b2     = match.group(3)

    return b1 + phrase_to_mwe[phrase] + b2


def handle_backward(match):

    return mwe_to_phrase[match.group(1)]

#
# The forward regex will look like:
#
#    (^|[ ])('s morgen|'s-Hertogenbosch|depository financial institution)([, ]|$)
# 
# which captures three components:
#
#    (1) Front boundary
#    (2) Phrase
#    (3) Back boundary
#
# Anchors allow matching at the beginning and end of the text. Addtional boundary characters can be
# added as necessary, e.g. to allow semicolons after a phrase, we could update the back boundary to:
#
#    ([,; ]|$)
#
regex_forward  = re.compile(r'(^|[ ])(' + '|'.join(phrases) + r')([, ]|$)')
regex_backward = re.compile(r'(MWEPHRASE\d+)')

#
# Pretend we cleaned the text in the middle
#
cleaned = 'MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2 MWEPHRASE0'

#
# Do the translations
#
text1 = regex_forward .sub(handle_forward,  text)
text2 = regex_backward.sub(handle_backward, cleaned)

print('original: {}'.format(original_text))
print('text1   : {}'.format(text1))
print('text2   : {}'.format(text2))

Running this generates:

original: Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen 's morgen
text1   : Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen MWEPHRASE0
text2   : 's_morgen ik 's-Hertogenbosch depository_financial_institution 's_morgen
Obediah answered 21/3, 2018 at 16:2 Comment(0)
R
1

Here's a strategy you could use:

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

# need this module for the reduce function
import functools as fn

#convert phrases into a dictionary of numbered placeholders (tokens)
tokens = { kw:"MWEPHRASE%s"%i for i,kw in enumerate(phrases) }

#replace embedded phrases with their respective token
tokenized = fn.reduce(lambda s,kw: tokens[kw].join(s.split(kw)), phrases, original_text)

#Apply text cleaning logic on the tokenized text 
#This assumes the placeholders are left untouched, 
#although it's ok to move them around)
cleaned_text = cleanUpfunction(tokenized)

#reverse the token dictionary (to map original phrases to numbered placeholders)
unTokens = {v:k for k,v in tokens.items() }

#rebuild phrases with original text associated to each token (placeholder)
final_text = fn.reduce(lambda s,kw: unTokens[kw].join(s.split(kw)), phrases, cleaned_text)
Revel answered 23/3, 2018 at 3:26 Comment(0)
S
1

What you are looking for is called "multi-string search" or "multi-pattern search". The more common solutions are Aho-Corasick and Rabin-Karp algorithms. If you want to impement it yourself, go with Rabin-Karp, since its easier to grasp. Otherwise, you'll find some libraries. Here's a solution with the library https://pypi.python.org/pypi/py_aho_corasick.

Let

phrases = ["'s morgen", "'s-Hertogenbosch", "depository financial institution"]
original_text = "Something, 's morgen, ik 's-Hertogenbosch im das depository financial institution gehen"

And, for testing purposes:

def clean(text):
    """A simple stub"""
    assert text == 'Something, MWEPHRASE0, ik MWEPHRASE1 im das MWEPHRASE2 gehen'
    return "MWEPHRASE0 ik MWEPHRASE1 MWEPHRASE2"

Now, you have to define two automatons, one for the outward journey, and the other for the return. An automaton is defined by a list of (key,value):

fore_automaton = py_aho_corasick.Automaton([(phrase,"MWEPHRASE{}".format(i)) for i, phrase in enumerate(phrases)])
back_automaton = py_aho_corasick.Automaton([("MWEPHRASE{}".format(i), phrase.replace(' ','_')) for i, phrase in enumerate(phrases)])

The automaton will scan the text and return a list of matches. A match is a triplet (position, key, value). With a little work on matches, you'll be able to replace keys by values:

def process(automaton, text):
    """Returns a new text, with keys of the automaton replaced by values"""
    matches = automaton.get_keywords_found(text.lower()) # text.lower() because auomaton of py_aho_corasick uses lowercase for keys
    bk_value_eks = [(i,v,i+len(k)) for i,k,v in matches] # (begin of key, value, end of key)
    chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]] # see below
    return "".join(chunks)

A brief explanation on chunks = [bk_value_ek1[1]+text[bk_value_ek1[2]:bk_value_ek2[0]] for bk_value_ek1,bk_value_ek2 in zip([(-1,"",0)]+bk_value_eks, bk_value_eks+[(len(text),"",-1)] if bk_value_ek1[2] <= bk_value_ek2[0]]. I zip matches with itself almost as usual : zip(arr, arr[1:]) will output (arr[0], arr[1)), (arr[1], arr[2]), ... to consider every match with its sucessor. Here I placed two sentinels to handle the begin and the end of matches.

  • For the normal case, I just output the value (=bk_value_ek1[1]) and the text between the end of the key and the begin of the next key (text[bk_value_ek1[2]:bk_value_ek2[0]).
  • The begin sentinel has an empty value, and its key ends at position 0, thus the first chunk will be "" + text[0:begin of key1], that is the text before the first key.
  • Similarly, the end sentinel has also an empty value and its key begins at the end of the text, thus the last chunk will be : value of the last match + text[end of the last key:len(text)].

What happens when keys overlap? Take an example: text="abcdef", phrases={"bcd":"1", "cde":"2"}. You have two matches: (1, "bcd", "1") and (2, "cde", "3"). Let's go: bk_value_eks = [(1, "1", 4), (2, "2", 5)]. Thus, without if bk_value_ek1[2] <= bk_value_ek2[0], the text will be replaced by text[:1]+"1"+text[4:2]+"2"+text[5:], that is "a"+"1"+""+"2"+"f" = "a12f" instead of "a1ef" (ignore the second match)...

Now, take a look a the result:

print(process(back_automaton, clean(process(fore_automaton, original_text))))
# "'s_morgen ik 's-Hertogenbosch depository_financial_institution"

You don't have to define a new process function for the return, just give it the back_automaton and it will do the job.

Subotica answered 23/3, 2018 at 17:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.