Finding multiple substrings in a string without iterating over it multiple times

Asked 5/10, 2020 at 19:9 Answered 5/10, 2020 at 20:4

I need to find if items from a list appear in a string, and then add the items to a different list. This code works:

data =[]
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...] 
for i in _legal:
    if i in line:
        data.append(i)

However, the code iterates over line (which could be long) multiple times- as many times as there are item in _legal (which could be a lot). That's too slow for me, and I'm searching for a way to do it faster. line doesn't have any specific format, so using .split() couldn't work, as far as I know. Edit: changed line so that it better represents the problems.

Tavel answered 5/10, 2020 at 19:9 Comment(13)

Do you have duplicate values in the list? – Extrados 5/10, 2020 at 19:14

are the items of equal length? – Unfruitful 5/10, 2020 at 19:16

@Extrados No, all values in the list _legal are different. – Tavel 5/10, 2020 at 19:16

probably a stretch, but a modified version of KMP might be able to achieve this in a single pass, especially if strings in _legal have common prefixes – Middlings 5/10, 2020 at 19:17

Great. Do you want to much entire words, or they are random text in the list? – Extrados 5/10, 2020 at 19:17

@Jean-FrançoisFabre No, their lengths vary. – Tavel 5/10, 2020 at 19:17

For example thing1 will be found in the line as thing1 or abcthing1abc etc? – Extrados 5/10, 2020 at 19:19

Is it a theoretical question (i.e. you're after big-O complexity) or is it a practical problem? – Transvalue 5/10, 2020 at 19:23

@Extrados yes. – Tavel 5/10, 2020 at 19:24

@Transvalue practical problem – Tavel 5/10, 2020 at 19:24

@Transvalue your comment implies that Big-O complexity is not important in practical problems, which is painfully wrong – Middlings 5/10, 2020 at 19:27

@Middlings best asymptotic complexity ignores constants and and considers only asymptotic case. In many practical situations, best big-O solution is suboptimal. Seeing people not understanding it is also painful sometimes. – Transvalue 5/10, 2020 at 19:30

You are looking for aho-corasick algorithm. en.m.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm – Rickirickie 5/10, 2020 at 20:49

One way I could think of to improve is:

Get all unique lengths of the words in _legal
Build a dictionary of words from line of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths ), this should be better than brute force.
Now look for each thing in the dictionary in O(1).

Code:

line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
    s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))

Output:

{'thing1', 'fj la', 'thing2', 't5', '5'}

Chalky answered 5/10, 2020 at 19:33 Comment(2)

I added an answer based on your thinking. Feel free to copy it and let me know so that i will remove mine – Extrados 5/10, 2020 at 20:25

@Extrados Thanks. I thought I will add my own code and test things like words with spaces and added. – Chalky 5/10, 2020 at 20:55

One approach is to build a very simple regex pattern, and use re.findall() to find/extract any matched words in the string.

import re

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4']

exp = re.compile(r'|'.join(_legal), re.IGNORECASE)
exp.findall(line)

>>> ['Thing1', 'Thing2']

Allembracing answered 5/10, 2020 at 19:32 Comment(0)

The following should work quite efficiently. It implements the suggestion from @SomeDude given above

lens=set([len(i) for i in _legal])
d={}
for k in lens:
    d[k]=[line[i:i+k] for i in range(len(line)-k)]
s=set(sum(d.values(), []))
result=list(s.intersection(set(_legal)))

for the following data (i changed "line" a bit as it returned an empty list due to uppecase in Thing1 and Thing2)

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**thing1**aoufgyafkugafkjhafkjhflahfklh**thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...]

Output is:

print(result)

['thing2', 'thing1']

Explanation: We saved all possible lengths of the words in substrings. For these lengths, we created all possible substrings in text (set s). Finally we found common items in s and substrings, which is the answer to the question.

Extrados answered 5/10, 2020 at 20:4 Comment(6)

Quick observation, the line value in OP shows Thing1 and Thing2, in title case. – Allembracing 5/10, 2020 at 20:6

yes, i changed it a little, as it would return an empty list otherwise – Extrados 5/10, 2020 at 20:7

Not being funny, I believe hacking source data to match a desired output is slightly frowned upon ... – Allembracing 5/10, 2020 at 20:8

Are you really in the philosophy of the question? – Extrados 5/10, 2020 at 20:10

Implementing somebody else's answer as code without crediting them is also frowned upon... In fact, I'd go so far as to say it's pretty dishonest – Mathers 5/10, 2020 at 20:17

@Pranav Hosangadi you are right, i was inspired from SomeDude's answer. I added a credit. Please give any upvote to his answer instead of mine. Cheers – Extrados 5/10, 2020 at 20:22

Recommended topics

Hot tags