Levenshtein distance with bound/limit
Asked Answered
S

1

6

I have found some Python implementations of the Levenshtein distance.

I am wondering though how these algorithms can be efficiently modified so that they break if the Levenshtein distance is greater than n (e.g. 3) instead of running until the end?

So essentially I do not want to let the algorithm run for too long to calculate the final distance if I simply want to know if the distance is greater than a threshold or not.

I have found some relevant posts here:

  1. Modifying Levenshtein Distance algorithm to not calculate all distances
  2. Levenstein distance limit
  3. Most efficient way to calculate Levenshtein distance
  4. Levenshtein Distance Algorithm better than O(n*m)?

but still, I do not see any Python code which does what I describe above (which is more or less what these posts describe too).

P.S.: The solution provided by @amirouche below is based on the fastest implementation that I have tested with some benchmarking (from here: https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python, https://mcmap.net/q/40678/-edit-distance-in-python) and its bounded version is the fastest one of its kind from my tests (without excluding that there may be even faster ones).

Slunk answered 10/1, 2020 at 18:15 Comment(2)
I think the problem with using Levenshtein Distance as an upper-bound is a chicken / egg problem where you have to do a certain amount of calculation before you know to bail. It's certainly possible but most of the algorithm implementations don't have a running tally of the distance because they use the divide and conquer approach. I haven't tried the matrix algorithms but maybe those have an easy way of keeping a running distance tally?Shanda
@AlexW, I certainly see your point about the chicken / egg problem but I guess that it depends on the implementation too. The issue may be also that exactly the fastest LD implementations may not be modifiable to accommodate the bound or even they are then they may be actually quite slower (than for example modifying slower implementations). But anyway I may be wrong.Slunk
N
10

As described in Levenstein distance limit, you can add a test over the row that is computed to return early:

def levenshtein(s1, s2, maximum):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        if all((x >= maximum for x in distances_)):
            return False
        distances = distances_
    return distances[-1]
Noxious answered 7/2, 2020 at 12:43 Comment(17)
Thank you for your answer. However, I receive this error raise Exception("Distance is too big") Exception: Distance is too big for these strings: str1 = 'njsd is jnj dfbd', str2 = 'It is cold wfw wf w efwe'. Am I doing something wrong?Slunk
@DanD., as I describe at my post (I hope clearly ;) ) I want to just check if two strings have distance less than an upper bound. For example, if their distance is less than 4 (so not more than 3 characters difference in terms of replacement, deletion, insertion). (P.S. The painting at your photo is one of my very favourites too).Slunk
@Penseur the levenshtein distance of the strings you submit is bigger than 2. That is what is expected by the default argument called at_most. Do you want to make the limitation optional?Noxious
@amirouche, exactly it is bigger than 2 so the algorithm can break when it knows than it is more than 2 (or any threshold number that I will specify) instead of running until the end (or raising any exception). I am not sure that I totally understand your code (and it is not primarily your fault - I did not study it thoroughly) - if we replace the exception with return False then it would work as I want?Slunk
yes, you can return False instead of raising an exception. The code is adapted from the Wikipedia page you linked, the currently 6th version at en.wikibooks.org/wiki/Algorithm_Implementation/Strings/…Noxious
ok I did and tested it a bit and it works pretty well. In comparison with some other implementations, It is at least two times quicker with quite long strings with big distance although a bit worse with short strings with smaller distance (or if I remove the threshold and leave to run until the end). Therefore, overall, it looks quite good. Have you tested it a bit more extensively that it actually returns the right response/distance (or it may confuse things in some more idionsyncratic cases where eg it has too do too many deletions etc)?Slunk
By the way, because I am testing it a bit more now. In terms of the accuracy it seems to return the right responses (although I have not tested obviously in that numerous cases). In terms of the speed, it is quite slower than other implementations if the distance is below the upper-bound/threshold and hence the algorithm has to run until the end.Slunk
In this sense, perhaps it would be even more optimal to put this break if possible at the faster Levenstein implementation which is this from my investigation thus far: https://mcmap.net/q/40678/-edit-distance-in-python.Slunk
You will need to benchmark it. Last time I tried, I found the above implementation to be fasteer than the others (in terms of memory and cpu)Noxious
Sure, this is what I did not to a certain extent with comparing with other 6 implementations. Your implementation is as fast as most of the (fast) others except for the one which I mention above (https://mcmap.net/q/40678/-edit-distance-in-python) which is quite faster especially for long strings which are similar to each other.Slunk
I changed the implementation to use the one you linked, also the argument for settings the limit is called maximum. Maybe limit will be better!Noxious
Hey, thank you, I saw that you changed it - I am going to check it today. :)Slunk
So, this implementation seems to be up to 4 times faster (and on average about 2-3 times faster) than your previous implementation of the bounded Levenstein Distance. The one question (before giving you the +50 :0 ) is how accurate this is. Specifically, does this line if all((x >= maximum for x in distances_)) break every time at the right point or it may break even though the LD may be higher than the limit provided etc? I saw your link too about Levenstein distance limit but I just want to be sure.Slunk
It will break once it is not possible that LD will be smaller than the maximum because all paths lead to a distance of at least maximum.Noxious
Ok, I see and I think that you are right. Apologies for my trivial questions the last minutes but I am not so familiar with the algorithm as you are. (To be honest, I am also quite surprised that nobody has posted another answer thus far - probably because not many people have spent much time on understanding the algorithm etc?)Slunk
If no-one pops-up the next hours and say that there is something much better then I am going to tick your answer as correct ;)Slunk
I did some testing and the following use cases with the same edit distance returned different results levenshtein("ab","",2) == False levenshtein("ab", "ba",2) == 2, perhaps the final judgment condition should not contain an equal sign: all((x > maximum for x in distances_)).Sr

© 2022 - 2024 — McMap. All rights reserved.