Can this code to find the neighborhood of a string be sped up?
Asked Answered
N

2

5

Given an input string, I would like to find the set of all other strings that are within Levenshtein distance 2. For example, if the input string is "aaa" and the alphabet is ['a', 'b'] then we have:

{'baaa', 'aab', 'a', 'aaaaa', 'baab', 'abbaa', 'abaa', 'aaabb', 'abb', 'aaab', 'ababa', 'aa', 'aabb', 'baba', 'baaab', 'aabab', 'aaaab', 'abaaa', 'aabaa', 'bbaaa', 'abaab', 'aaaa', 'baaaa', 'bab', 'bba', 'aba', 'aaaba', 'ba', 'aabba', 'abab', 'baa', 'aaa', 'bbaa', 'baaba', 'aaba', 'abba', 'ab', 'babaa'}

I have code to do this but it is inefficient. Here it is using all printable ascii characters as the alphabet and the input string aaaaaaaaaa.

import string

input_string = "a" * 10
f = (
    lambda input_string, dist=2, i=0: dist * input_string[i - 1 :]
    and {
        k[:i] + char + k[i + w:]
        for k in f(input_string, dist - 1)
        for char in [""] + list(string.printable)
        for w in (0, 1)
    }
    | f(input_string, dist, i + 1)
    or {input_string}
)
f(input_string)

I need to run this many times with different input strings. This code takes 2.9s currently on my desktop to make the 1631129 different strings. Can anyone see how to make it much faster?


League table so far (using printable as the alphabet):

My code: 2.98 s ± 146 ms

Alain T.'s code: 1.58 s ± 60.7 ms. Current winner.

ddg's code: 1.85 s ± 28.4 ms

Nidanidaros answered 16/12, 2020 at 20:13 Comment(3)
maybe first you could write it as normal function - not lambda - and with normal for-loop instead of list/dictionary comprehensions - and then it would be more readable for us so we could search paces for optimalizations. And this moment I see only [""] + list(string.printable) which you create many times but you could create once before function.Broussard
You should write a proper function. I can't understand your formula unless I refresh my knowledge of python operator precedence. _int * _str and _set1 | _set2 or _set3 means (_int * _str) and ( _set1 | _set2 ) or _set3, or (_set1 | _set2) if (_int and _str) else _set3. Even then it's hard to read and wrong (f('') gives {''}).Grocery
I think you might want to take a look at this answer. Even without going into cython, it seems to generate results very quickly. Hope it helps!Arvid
N
3

Using iterators (to avoid loading everything in memory) I can get down to 0.17 seconds. With more context about your usage case (why do you need all of them? for a spellchecker?) there's likely alternate approaches that avoid having to enumerate all possibilities.

from string import ascii_lowercase
from itertools import product

def validate(start: str):
    assert set(start) <= set(ascii_lowercase)

def iter_deletion(string) -> str:
    for i in range(len(string)):
        yield string[:i] + string[i+1:]
        
def iter_insertion(string) -> str:
    for i,c in product(range(len(string)+1), ascii_lowercase):
        yield string[:i] + c + string[i:]

def iter_replacement(string) -> str:
    for i,c in product(range(len(string)), ascii_lowercase):
        yield string[:i] + c + string[i+1:]

def iter_steps(string):
    yield from iter_deletion(string)
    yield from iter_insertion(string)
    yield from iter_replacement(string)

def view_all(string):
    validate(string)
    for d1 in iter_steps(string):
        yield from iter_steps(d1)

from timeit import timeit
timeit(lambda: set(view_all('aaaaaaaaaa')), number=1)
Neolith answered 16/12, 2020 at 22:5 Comment(4)
Once I add that in my code is 6s total, so slower than OP's.Neolith
Have you used a much smaller alphabet size? What timing do you get if you use the same one as OP?Nidanidaros
Any idea how to speed it up?Nidanidaros
How are you using the results? What do you need them for?Neolith
S
3

I don't think you'll be able to get significant speed improvements generating the whole set of possible edits.

You could shave 35% to 50% of execution time by avoiding the re-expansion of strings that produce the same result after the first edit. This would likely give a more significant difference with a higher editing distance. The level of improvement will also depend on the actual strings/words (that may not be composed of a single repeated letter).

In any case, here's an optimized version of the function:

import string
from collections import deque

def makeEdits(S,count=1,charSet=None):
    if charSet is None: charSet = string.printable
    result   = set([S])
    expand   = deque(result)
    lastEdit = False

    def addEdit(s):
        if s in result: return 
        result.add(s)
        if not lastEdit: expand.append(s)

    for edit in range(count):
        editing = expand.copy()
        expand.clear()
        lastEdit = edit == count-1
        for eString in editing:
            for i,char in enumerate(eString):
                left,right = eString[:i],eString[i+1:]                       
                addEdit(left+right)                  # deletion
                for newChar in charSet:                    
                    addEdit(left+newChar+char+right) # insertions before
                    addEdit(left+newChar+right)      # replacement                   
            for newChar in charSet:
                addEdit(eString+newChar)             # insertion at end
    return result

and performance testing on my laptop:

from timeit import timeit
count = 1
input_string = "a" * 10

print('makeEdits()...')
print(len(makeEdits(input_string,2)))
t = timeit(lambda:makeEdits(input_string,2),number=count)
print(t)

print('f()...')
print(len(f(input_string)))
t = timeit(lambda:f(input_string),number=count)
print(t)

...

makeEdits()...
1631129
2.0302121050000004
f()...
1631129
3.145120027999999

Note that a smaller character set, such as string.ascii_letters will improve the performance results considerably (by producing fewer strings):

timeit(lambda:makeEdits(input_string,2,string.ascii_letters),number=count)

# 0.48669491199999015

len(makeEdits(input_string,2,string.ascii_letters)) # 433913

timeit(lambda:makeEdits(input_string,2,string.ascii_lowercase),number=count)

# 0.10699477299999671

len(makeEdits(input_string,2,string.ascii_lowercase)) # 104805

If you're making a spell checker, it would be reasonable to convert everything to lowercase before processing and to treat all special characters as a single character that you include in your character set. This would allow your program to get the performance boost of a smaller character set (30 times faster in this case). You would only need to add some preprocessing of the strings and perhaps some adjustments of results afterward.

Sponge answered 19/12, 2020 at 18:16 Comment(8)
Maybe cython is the answer?Nidanidaros
I doubt cython would make it much faster. I think most of the time is spent doing string manipulations and set operations which are already implemented in low level code. For example, just creating a set with 1.6M entries takes a tenth of a second (timeit(lambda:set(range(1631129)),number=1) = 0.12 sec) so you won't be able to go much faster than that.Sponge
@AlainT. Good point although 0.12s is a small fraction of the current running time.Nidanidaros
I get 76.7 ms for %timeit set(range(1631129)) and 1.58s for your code (with printable).Nidanidaros
It would seem your PC is 30% to 50% faster than my laptop. What I was trying to say though is that it will be difficult to improve the performance of low level operations that make up the algorithm (i.e. set manipulations, string manipulations) and which I believe is where most of the execution time is spent.Sponge
I can speed it up to about 50ms in Julia but I know nothing of cython. Would it be helpful to share the Julia codNidanidaros
@Anush Am not at my computer but there could be a benefit to cython here to at least save the iteration time on all of those nested for loops or change data structures. Will churn out some cython code in the next day or two.Arvid
@Arvid That would be awesome.Nidanidaros

© 2022 - 2024 — McMap. All rights reserved.