Levenshtein distance with bound/limit

Thank you for your answer. However, I receive this error raise Exception("Distance is too big") Exception: Distance is too big for these strings: str1 = 'njsd is jnj dfbd', str2 = 'It is cold wfw wf w efwe'. Am I doing something wrong? – Slunk 7/2, 2020 at 14:22

@DanD., as I describe at my post (I hope clearly ;) ) I want to just check if two strings have distance less than an upper bound. For example, if their distance is less than 4 (so not more than 3 characters difference in terms of replacement, deletion, insertion). (P.S. The painting at your photo is one of my very favourites too). – Slunk 7/2, 2020 at 14:38

@Penseur the levenshtein distance of the strings you submit is bigger than 2. That is what is expected by the default argument called at_most. Do you want to make the limitation optional? – Noxious 7/2, 2020 at 14:41

@amirouche, exactly it is bigger than 2 so the algorithm can break when it knows than it is more than 2 (or any threshold number that I will specify) instead of running until the end (or raising any exception). I am not sure that I totally understand your code (and it is not primarily your fault - I did not study it thoroughly) - if we replace the exception with return False then it would work as I want? – Slunk 7/2, 2020 at 14:44

yes, you can return False instead of raising an exception. The code is adapted from the Wikipedia page you linked, the currently 6th version at en.wikibooks.org/wiki/Algorithm_Implementation/Strings/… – Noxious 7/2, 2020 at 14:47

ok I did and tested it a bit and it works pretty well. In comparison with some other implementations, It is at least two times quicker with quite long strings with big distance although a bit worse with short strings with smaller distance (or if I remove the threshold and leave to run until the end). Therefore, overall, it looks quite good. Have you tested it a bit more extensively that it actually returns the right response/distance (or it may confuse things in some more idionsyncratic cases where eg it has too do too many deletions etc)? – Slunk 7/2, 2020 at 14:52

By the way, because I am testing it a bit more now. In terms of the accuracy it seems to return the right responses (although I have not tested obviously in that numerous cases). In terms of the speed, it is quite slower than other implementations if the distance is below the upper-bound/threshold and hence the algorithm has to run until the end. – Slunk 7/2, 2020 at 15:4

In this sense, perhaps it would be even more optimal to put this break if possible at the faster Levenstein implementation which is this from my investigation thus far: https://mcmap.net/q/40678/-edit-distance-in-python. – Slunk 7/2, 2020 at 15:28

You will need to benchmark it. Last time I tried, I found the above implementation to be fasteer than the others (in terms of memory and cpu) – Noxious 7/2, 2020 at 15:57

Sure, this is what I did not to a certain extent with comparing with other 6 implementations. Your implementation is as fast as most of the (fast) others except for the one which I mention above (https://mcmap.net/q/40678/-edit-distance-in-python) which is quite faster especially for long strings which are similar to each other. – Slunk 7/2, 2020 at 16:7

I changed the implementation to use the one you linked, also the argument for settings the limit is called maximum. Maybe limit will be better! – Noxious 7/2, 2020 at 17:39

Hey, thank you, I saw that you changed it - I am going to check it today. :) – Slunk 10/2, 2020 at 10:43

So, this implementation seems to be up to 4 times faster (and on average about 2-3 times faster) than your previous implementation of the bounded Levenstein Distance. The one question (before giving you the +50 :0 ) is how accurate this is. Specifically, does this line if all((x >= maximum for x in distances_)) break every time at the right point or it may break even though the LD may be higher than the limit provided etc? I saw your link too about Levenstein distance limit but I just want to be sure. – Slunk 10/2, 2020 at 10:58

It will break once it is not possible that LD will be smaller than the maximum because all paths lead to a distance of at least maximum. – Noxious 10/2, 2020 at 11:20

Ok, I see and I think that you are right. Apologies for my trivial questions the last minutes but I am not so familiar with the algorithm as you are. (To be honest, I am also quite surprised that nobody has posted another answer thus far - probably because not many people have spent much time on understanding the algorithm etc?) – Slunk 10/2, 2020 at 11:58

If no-one pops-up the next hours and say that there is something much better then I am going to tick your answer as correct ;) – Slunk 10/2, 2020 at 12:21

I did some testing and the following use cases with the same edit distance returned different results levenshtein("ab","",2) == False levenshtein("ab", "ba",2) == 2, perhaps the final judgment condition should not contain an equal sign: all((x > maximum for x in distances_)). – Sr 28/8, 2023 at 8:19

Recommended topics

Hot tags