Given an input string, I would like to find the set of all other strings that are within Levenshtein distance 2. For example, if the input string is "aaa" and the alphabet is ['a', 'b'] then we have:
{'baaa', 'aab', 'a', 'aaaaa', 'baab', 'abbaa', 'abaa', 'aaabb', 'abb', 'aaab', 'ababa', 'aa', 'aabb', 'baba', 'baaab', 'aabab', 'aaaab', 'abaaa', 'aabaa', 'bbaaa', 'abaab', 'aaaa', 'baaaa', 'bab', 'bba', 'aba', 'aaaba', 'ba', 'aabba', 'abab', 'baa', 'aaa', 'bbaa', 'baaba', 'aaba', 'abba', 'ab', 'babaa'}
I have code to do this but it is inefficient. Here it is using all printable ascii characters as the alphabet and the input string aaaaaaaaaa
.
import string
input_string = "a" * 10
f = (
lambda input_string, dist=2, i=0: dist * input_string[i - 1 :]
and {
k[:i] + char + k[i + w:]
for k in f(input_string, dist - 1)
for char in [""] + list(string.printable)
for w in (0, 1)
}
| f(input_string, dist, i + 1)
or {input_string}
)
f(input_string)
I need to run this many times with different input strings. This code takes 2.9s currently on my desktop to make the 1631129 different strings. Can anyone see how to make it much faster?
League table so far (using printable as the alphabet):
My code: 2.98 s ± 146 ms
Alain T.'s code: 1.58 s ± 60.7 ms. Current winner.
ddg's code: 1.85 s ± 28.4 ms
for
-loop instead of list/dictionary comprehensions - and then it would be more readable for us so we could search paces for optimalizations. And this moment I see only[""] + list(string.printable)
which you create many times but you could create once before function. – Broussard_int * _str and _set1 | _set2 or _set3
means(_int * _str) and ( _set1 | _set2 ) or _set3
, or(_set1 | _set2) if (_int and _str) else _set3
. Even then it's hard to read and wrong (f('')
gives{''}
). – Grocery