Finding whether a string starts with one of a list's variable-length prefixes

Asked 24/9, 2011 at 15:29 Answered 4/12, 2018 at 14:7

Solved python string variable-length prefixes

I need to find out whether a name starts with any of a list's prefixes and then remove it, like:

if name[:2] in ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]:
    name = name[2:]

The above only works for list prefixes with a length of two. I need the same functionality for variable-length prefixes.

How is it done efficiently (little code and good performance)?

A for loop iterating over each prefix and then checking name.startswith(prefix) to finally slice the name according to the length of the prefix works, but it's a lot of code, probably inefficient, and "non-Pythonic".

Does anybody have a nice solution?

Marchant answered 24/9, 2011 at 15:29 Comment(6)

The solution you describe is pretty decent. – Melonie 24/9, 2011 at 15:30

It isn't a lot of code to do, just a lot of code to make clear. – Hoey 24/9, 2011 at 15:33

@Melonie the issue was that the prefixes could be multiple characters, so it wouldnt be sufficient to check name[:2] – Cetane 24/9, 2011 at 15:37

@FooBah No, the second solution of using startswith etc. – Melonie 24/9, 2011 at 15:40

A for loop iterating over each prefix and then checking name.startswith(prefix) to finally slice the name according to the length of the prefix works

That sounds pretty pythonic to me. That shouldn't me more than 5 or 10 lines of code. "Pythonic" doesn't mean it has to be done in 1 line. – Approximation 24/9, 2011 at 17:11

I know this is a really old question but what would you want to have happen if the name starts with multiple prefixes in the list, where each of the prefixes were different lengths? ex. name = "amazing", list = ['am', 'ama', 'amaz']. Should it remove 2, 3, or 4 characters? – Impercipient 29/9, 2014 at 2:39

A bit hard to read, but this works:

name=name[len(filter(name.startswith,prefixes+[''])[0]):]

Serpentiform answered 24/9, 2011 at 16:1 Comment(2)

Very nice, this even ignores unprefixed names. Perfect. – Marchant 26/9, 2011 at 12:7

For those more used to list comprehensions, this is equivalent to: name=name[len([prefix for prefix in prefixes+[''] if name.startswith(prefix)][0]):] – Bargeboard 11/9, 2012 at 11:40

str.startswith(prefix[, start[, end]])¶

Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.

$ ipython
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: prefixes = ("i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_")

In [2]: 'test'.startswith(prefixes)
Out[2]: False

In [3]: 'i_'.startswith(prefixes)
Out[3]: True

In [4]: 'd_a'.startswith(prefixes)
Out[4]: True

Twiddle answered 24/9, 2011 at 16:5 Comment(2)

I also need to remove the found prefix from the name in case it starts with one of the prefixes. Maybe the question was a little inaccurate, however I still like the fact that str.startswith also accepts a tuple. (unchecked) – Marchant 24/9, 2011 at 16:18

yes, because it accepts tuples it might be the cleanest implementation. – Twiddle 24/9, 2011 at 23:0

A bit hard to read, but this works:

name=name[len(filter(name.startswith,prefixes+[''])[0]):]

Serpentiform answered 24/9, 2011 at 16:1 Comment(2)

Very nice, this even ignores unprefixed names. Perfect. – Marchant 26/9, 2011 at 12:7

For those more used to list comprehensions, this is equivalent to: name=name[len([prefix for prefix in prefixes+[''] if name.startswith(prefix)][0]):] – Bargeboard 11/9, 2012 at 11:40

for prefix in prefixes:
    if name.startswith(prefix):
        name=name[len(prefix):]
        break

Selfhypnosis answered 24/9, 2011 at 15:41 Comment(3)

Except genexes don't leak the iterator name. – Hoey 24/9, 2011 at 15:45

@unutbu: The list is about 10 prefixes long. Thanks – Marchant 24/9, 2011 at 15:55

The first solution won't work, since only the last value of the iterator name is leaked. – Hoey 24/9, 2011 at 15:59

Regexes will likely give you the best speed:

prefixes = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_", "also_longer_"]
re_prefixes = "|".join(re.escape(p) for p in prefixes)

m = re.match(re_prefixes, my_string)
if m:
    my_string = my_string[m.end()-m.start():]

Bekah answered 24/9, 2011 at 16:49 Comment(1)

@JohnMachin Couldn't he just have done `re_prefixes = '^' + "|^".join(re.escape(p) for p in prefixes)'? Thanks. – Githens 25/9, 2019 at 4:32

If you define prefix to be the characters before an underscore, then you can check for

if name.partition("_")[0] in ["i", "c", "m", "l", "d", "t", "e", "b", "foo"] and name.partition("_")[1] == "_":
    name = name.partition("_")[2]

Cetane answered 24/9, 2011 at 15:34 Comment(1)

I'd use "_" in name as your second clause to avoid partitioning the string twice, and in fact I'd put that clause first to avoid partitioning the string at all if there's no underscore in it. But good thinking. – Killie 24/9, 2011 at 16:7

What about using filter?

prefs = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]
name = list(filter(lambda item: not any(item.startswith(prefix) for prefix in prefs), name))

Note that the comparison of each list item against the prefixes efficiently halts on the first match. This behaviour is guaranteed by the any function that returns as soon as it finds a True value, eg:

def gen():
    print("yielding False")
    yield False
    print("yielding True")
    yield True
    print("yielding False again")
    yield False

>>> any(gen()) # last two lines of gen() are not performed
yielding False
yielding True
True

Or, using re.match instead of startswith:

import re
patt = '|'.join(["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"])
name = list(filter(lambda item: not re.match(patt, item), name))

Backfill answered 24/9, 2011 at 16:46 Comment(0)

Regex, tested:

import re

def make_multi_prefix_matcher(prefixes):
    regex_text = "|".join(re.escape(p) for p in prefixes)
    print repr(regex_text)
    return re.compile(regex_text).match

pfxs = "x ya foobar foo a|b z.".split()
names = "xenon yadda yeti food foob foobarre foo a|b a b z.yx zebra".split()

matcher = make_multi_prefix_matcher(pfxs)
for name in names:
    m = matcher(name)
    if not m:
        print repr(name), "no match"
        continue
    n = m.end()
    print repr(name), n, repr(name[n:])

Output:

'x|ya|foobar|foo|a\\|b|z\\.'
'xenon' 1 'enon'
'yadda' 2 'dda'
'yeti' no match
'food' 3 'd'
'foob' 3 'b'
'foobarre' 6 're'
'foo' 3 ''
'a|b' 3 ''
'a' no match
'b' no match
'z.yx' 2 'yx'
'zebra' no match

Staging answered 24/9, 2011 at 22:40 Comment(1)

Nice complete solution and I appreciate the escaping and testing! I'm sure this regex based approach would run faster than list comprehensions etc for any sizeable amount of data, with a fairly long list of prefixes. – Skylar 8/4, 2013 at 16:49

When it comes to search and efficiency always thinks of indexing techniques to improve your algorithms. If you have a long list of prefixes you can use an in-memory index by simple indexing the prefixes by the first character into a dict.

This solution is only worth if you had a long list of prefixes and performance becomes an issue.

pref = ["i_", "c_", "m_", "l_", "d_", "t_", "e_", "b_"]

#indexing prefixes in a dict. Do this only once.
d = dict()
for x in pref:
        if not x[0] in d:
                d[x[0]] = list()
        d[x[0]].append(x)


name = "c_abcdf"

#lookup in d to only check elements with the same first character.
result = filter(lambda x: name.startswith(x),\
                        [] if name[0] not in d else d[name[0]])
print result

Janessa answered 24/9, 2011 at 15:56 Comment(0)

This edits the list on the fly, removing prefixes. The break skips the rest of the prefixes once one is found for a particular item.

items = ['this', 'that', 'i_blah', 'joe_cool', 'what_this']
prefixes = ['i_', 'c_', 'a_', 'joe_', 'mark_']

for i,item in enumerate(items):
    for p in prefixes:
        if item.startswith(p):
            items[i] = item[len(p):]
            break

print items

Output

['this', 'that', 'blah', 'cool', 'what_this']

Olmstead answered 24/9, 2011 at 17:23 Comment(0)

Could use a simple regex.

import re
prefixes = ("i_", "c_", "longer_")
re.sub(r'^(%s)' % '|'.join(prefixes), '', name)

Or if anything preceding an underscore is a valid prefix:

name.split('_', 1)[-1]

This removes any number of characters before the first underscore.

Lythraceous answered 4/12, 2018 at 14:7 Comment(0)

-1

import re

def make_multi_prefix_replacer(prefixes):
    if isinstance(prefixes,str):
        prefixes = prefixes.split()
    prefixes.sort(key = len, reverse=True)
    pat = r'\b(%s)' % "|".join(map(re.escape, prefixes))
    print 'regex patern :',repr(pat),'\n'
    def suber(x, reg = re.compile(pat)):
        return reg.sub('',x)
    return suber



pfxs = "x ya foobar yaku foo a|b z."
replacer = make_multi_prefix_replacer(pfxs)               

names = "xenon yadda yeti yakute food foob foobarre foo a|b a b z.yx zebra".split()
for name in names:
    print repr(name),'\n',repr(replacer(name)),'\n'

ss = 'the yakute xenon is a|bcdf in the barfoobaratu foobarii'
print '\n',repr(ss),'\n',repr(replacer(ss)),'\n'

Rescind answered 25/9, 2011 at 1:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Output

Recommended topics

Hot tags