Finding multiple substrings in a string without iterating over it multiple times
Asked Answered
T

3

9

I need to find if items from a list appear in a string, and then add the items to a different list. This code works:

data =[]
line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...] 
for i in _legal:
    if i in line:
        data.append(i)

However, the code iterates over line (which could be long) multiple times- as many times as there are item in _legal (which could be a lot). That's too slow for me, and I'm searching for a way to do it faster. line doesn't have any specific format, so using .split() couldn't work, as far as I know. Edit: changed line so that it better represents the problems.

Tavel answered 5/10, 2020 at 19:9 Comment(13)
Do you have duplicate values in the list?Extrados
are the items of equal length?Unfruitful
@Extrados No, all values in the list _legal are different.Tavel
probably a stretch, but a modified version of KMP might be able to achieve this in a single pass, especially if strings in _legal have common prefixesMiddlings
Great. Do you want to much entire words, or they are random text in the list?Extrados
@Jean-FrançoisFabre No, their lengths vary.Tavel
For example thing1 will be found in the line as thing1 or abcthing1abc etc?Extrados
Is it a theoretical question (i.e. you're after big-O complexity) or is it a practical problem?Transvalue
@Extrados yes.Tavel
@Transvalue practical problemTavel
@Transvalue your comment implies that Big-O complexity is not important in practical problems, which is painfully wrongMiddlings
@Middlings best asymptotic complexity ignores constants and and considers only asymptotic case. In many practical situations, best big-O solution is suboptimal. Seeing people not understanding it is also painful sometimes.Transvalue
You are looking for aho-corasick algorithm. en.m.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithmRickirickie
C
4

One way I could think of to improve is:

  • Get all unique lengths of the words in _legal
  • Build a dictionary of words from line of those particular lengths using a sliding window technique. The complexity should be O( len(line)*num_of_unique_lengths ), this should be better than brute force.
  • Now look for each thing in the dictionary in O(1).

Code:

line = 'thing1 thing2 456 xxualt542l lthin. dfjladjfj lauthina '
_legal = ['thing1', 'thing2', 'thing3', 'thing4', 't5', '5', 'fj la']
ul = {len(i) for i in _legal}
s=set()
for l in ul:
    s = s.union({line[i:i+l] for i in range(len(line)-l)})
print(s.intersection(set(_legal)))

Output:

{'thing1', 'fj la', 'thing2', 't5', '5'}
Chalky answered 5/10, 2020 at 19:33 Comment(2)
I added an answer based on your thinking. Feel free to copy it and let me know so that i will remove mineExtrados
@Extrados Thanks. I thought I will add my own code and test things like words with spaces and added.Chalky
A
1

One approach is to build a very simple regex pattern, and use re.findall() to find/extract any matched words in the string.

import re

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**Thing1**aoufgyafkugafkjhafkjhflahfklh**Thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4']

exp = re.compile(r'|'.join(_legal), re.IGNORECASE)
exp.findall(line)

>>> ['Thing1', 'Thing2']
Allembracing answered 5/10, 2020 at 19:32 Comment(0)
E
0

The following should work quite efficiently. It implements the suggestion from @SomeDude given above

lens=set([len(i) for i in _legal])
d={}
for k in lens:
    d[k]=[line[i:i+k] for i in range(len(line)-k)]
s=set(sum(d.values(), []))
result=list(s.intersection(set(_legal)))

for the following data (i changed "line" a bit as it returned an empty list due to uppecase in Thing1 and Thing2)

line = 'akhgvfalfhda.dhgfa.lidhfalihflaih**thing1**aoufgyafkugafkjhafkjhflahfklh**thing2**dlfkhalfhafli...'
_legal = ['thing1', 'thing2', 'thing3', 'thing4',...]

Output is:

print(result)

['thing2', 'thing1']

Explanation: We saved all possible lengths of the words in substrings. For these lengths, we created all possible substrings in text (set s). Finally we found common items in s and substrings, which is the answer to the question.

Extrados answered 5/10, 2020 at 20:4 Comment(6)
Quick observation, the line value in OP shows Thing1 and Thing2, in title case.Allembracing
yes, i changed it a little, as it would return an empty list otherwiseExtrados
Not being funny, I believe hacking source data to match a desired output is slightly frowned upon ...Allembracing
Are you really in the philosophy of the question?Extrados
Implementing somebody else's answer as code without crediting them is also frowned upon... In fact, I'd go so far as to say it's pretty dishonestMathers
@Pranav Hosangadi you are right, i was inspired from SomeDude's answer. I added a credit. Please give any upvote to his answer instead of mine. CheersExtrados

© 2022 - 2024 — McMap. All rights reserved.