Recursively split strings that contains a defined set of prefixes - Python
Asked Answered
S

5

6

If i have a list of prefix that can be attached to a string, how do i split a string such into it's prefix and the other characters in the next substring. For example:

prefixes = ['over','under','re','un','co']

str1 = "overachieve"
output: ["over","achieve"]

str2 = "reundo"
output = ["re","un","do"]

Is there a better way to do the above task, maybe with regex or some string functions other than:

str1 = "reundo"
output = []

for x in [p for p in prefixes if p in str1]:
    output.append(x)    
    str1 =  str1.replace(x,"",1)
output.append(str1)
Summand answered 28/1, 2013 at 8:20 Comment(0)
P
5

Regular expressions are an efficient way to search for many alternative prefixes:

import re

def split_prefixes(word, prefixes):
    regex = re.compile('|'.join(sorted(prefixes, key=len, reverse=True)))
    result = []
    i = 0
    while True:
        mo = regex.match(word, i)
        if mo is None:
            result.append(word[i:])
            return result
        result.append(mo.group())
        i = mo.end()


>>> prefixes = ['over', 'under', 're', 'un', 'co']
>>> for word in ['overachieve', 'reundo', 'empire', 'coprocessor']:
        print word, '-->', split_prefixes(word, prefixes)

overachieve --> ['over', 'achieve']
reundo --> ['re', 'un', 'do']
empire --> ['empire']
coprocessor --> ['co', 'processor']
Pederast answered 28/1, 2013 at 9:5 Comment(3)
+1 for match(word, i), I never noticed that match also has pos and endpos.Heeley
but there is a problem with underachieve, i end up getting ["un","under","achieve"] =)Summand
Hmm, it works for me. split_prefixes('underachieve', ['over', 'under', 're', 'un', 'co']) --> ['under', 'achieve']Pederast
L
1

I would use the str.startswith method

for p in prefixes:
    if str1.startswith(p):
        output.append(p)
        str1 = str1.replace(p, '', 1)
output.append(str1)

The biggest flaw with your code is that strings like 'found' will output ['un', 'fod'].

However if you have a hypothetical string 'reuncoundo', then you'll need to iterate over the list more than once.

while True:
    if not any(str1.startswith(i) for i in prefixes):
        output.append(str1)
        break
    for p in prefixes:
        if str1.startswith(p):
            output.append(p)
            str1 = str1.replace(p, '', 1)

This outputs ['re', 'un', 'co', 'un', 'do']

Lifeline answered 28/1, 2013 at 8:30 Comment(2)
What about the hypothetical input "reuncoundo"? Your code would miss the second "un", surely?Dealing
@Dealing just addressed that issueLifeline
D
1
prefixes = ['over','under','re','un','co']

def test(string, prefixes, existing=None):
    prefixes.sort(key = lambda s: len(s))
    prefixes.reverse() # This and the previous line ensure that longer prefixes are searched first regardless of initial sorting.
    if existing is None:
        existing = [] # deals with the fact that placing [] as a default parameter and modifying it modifies it for the entire session
    for prefix in prefixes:
        if string.startswith(prefix):
            existing.append(prefix)
            return test(string[len(prefix):], prefixes, existing)
    existing.append(string)
    return existing

This code runs through a string recursively, removing known prefixes until it runs out, then returning the entire list. On longer strings, the generator is probably a better route, but on shorter strings the lack of need for the additional overhead of a generator might make this a better solution.

Dealing answered 28/1, 2013 at 8:32 Comment(4)
Can you confirm the "additional overhead of a generator"? I'd be more concerned with the additional overhead of recursion in this case. (minor remark)Heeley
I was half referring to the addition of the requirement to turn the generator output into a list (instead of an iterable) but also the (probably perceived) mental overhead of a generator. A lot of the people I work and study with consider them basically magic.Dealing
I encourage everyone to get acquainted with generators, there are generall (and even have leaked into C#). That said, if you insist on returning a list, the simplest way to rewrite a generator is to make a result = [] and replace yield foo with result.append(foo). Python doesn't fold tail recursion, but you may :). I don't recommend avoiding generators though; note how happily Python 3 uses iterables everywhere. It's just a nice concept to work with.Heeley
I keep putting off actually using generators more simple than [foo(x) for x in bar]. I think this might be the kick up the backside to start trying harder.Dealing
H
1

Having the "two problems" proverb in mind, I'd still say this is the job for a regular expression. Regexes compile to state machines which check all possible variants in parallel, not one-by-one.

Here's an implementation that leverages that:

import re

def split_string(string, prefixes):
    regex = re.compile('|'.join(map(re.escape, prefixes))) # (1)
    while True:
        match = regex.match(string)
        if not match:
            break
        end = match.end()
        yield string[:end]
        string = string[end:]
    if string:
        yield string # (2)

prefixes = ['over','under','re','un','co']
assert (list(split_string('recouncoundo',prefixes))
        == ['re','co','un','co','un','do'])

Note how the regular expression is constructed in (1):

  • the prefixes are escaped using re.escape so that special characters don't interfere
  • the escaped prefixes are joined using the | (or) regex operator
  • the whole thing gets compiled.

The line (2) yields the final word, if any is left over after splitting prefixes. You might want to remove the if string check if you want the function to return an empty string if nothing remains after prefix stripping.

Also note that re.match (contrary to re.search) only looks for the pattern at the beginning of the input string, so there's no need to append ^ to the regex.

Heeley answered 28/1, 2013 at 8:55 Comment(2)
Python regexes (like most other current regex engines) do check all possible variants in sequence, not in parallel. Therefore, order of the variants matters. For example (re|reun) would split reundo in re and undo, not reun and do as a DFA regex engine (leftmost-longest rule) would.Stannite
What I meant was that the regex would check the first letter first, and having discovered r it would proceed check re and reun but not over or under. Hence I'd expect it to work faster than a repeated for p in prefix-style solution.Heeley
L
1

If you are dealing with prefixes, you don't need regex, you only need startswith(). You can of course use regex, but it's harder to read and maintain, even for an easy one like this. startswith() is simpler, in my opinion.

And the other answers seems too complicated for such a simple problem. I'd suggest a recursive function like this one:

def split_prefixes (word, prefixes):
    split = [p for p in prefixes if word.startswith(p)]
    if split:
        return split + split_prefixes (word[len(split[0]):], prefixes)
    else:
        return [word]

This is the result:

"overachieve" -> ['over', 'achieve']
"reundo" -> ['re', 'un', 'do']
"reuncoundo" -> ['re', 'un', 'co', 'un', 'do']
"empire" -> ['empire']
Letreece answered 28/1, 2013 at 9:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.