Percentage rank of matches using Levenshtein Distance matching
Asked Answered
P

8

30

I am trying to match a single search term against a dictionary of possible matches using a Levenshtein distance algorithm. The algorithm returns a distance expressed as number of operations required to convert the search string into the matched string. I want to present the results in ranked percentage list of top "N" (say 10) matches.

Since the search string can be longer or shorter than the individual dictionary strings, what would be an appropriate logic for expressing the distance as a percentage, which would qualitatively refelct how close "as a percentage" is each result to the query string, with 100% indicating an exact match.

I considered the following options:

Q = query string
M = matched string
PM = Percentage Match
Option 1. PMi = (1 - Lev_distance(Q, Mi)/Strlen(Q)) * 100
Option 2. PMi = (1 - Lev_distance(Q, Mi)/max(Strlen(Q), strlen(Mi))) * 100

Option 1 has the possibility of negative percentages in case the distance is greater than the search string length, where the match string is long. For example query "ABC" matched with "ABC Corp." would result in a negative match percent.

Option 2 does not appear to give a consistent percentage across a set of Mi, as each calculation would possibly use a different denominator and hence the resulting percentage values would not be normalized.

Only other way I can think of is to ditch the comparison of the lev_distance to either string lenghts, but instead present the comparitive distances of the top "N" matches as an inverse percentile rank (100-percentile-rank).

Any thoughts? Are there better approaches? I must be missing something as Levenshtein distance is probably the most common algorithm for fuzzy matches and this must be a very common problem.

Ptolemaic answered 1/5, 2012 at 22:46 Comment(2)
What about your 1st option but when the result is negative then simply return 0? PS: I've posted problem also here math.stackexchange.com/questions/1776860/…Autolycus
I didn't understand whats the problem with Option2 as I've implemented exactly the same logic you describe on it and seem to work properly. Can you please explain it better?Stereography
I
40

I had a similar problem and this thread helped me to figure out a solution. Hope it can help others too.

int levDis = Lev_distance(Q, Mi)
int bigger = max(strlen(Q), strlen(Mi))
double pct = (bigger - levDis) / bigger

It should return 100% if both strings are exactly the same and 0% if they are totaly different.

(sorry if my english isn't that good)

Impropriate answered 11/2, 2014 at 14:49 Comment(2)
This is not correct because it gives different results for ("ABC Corp", "ABC") and ("ABC Corp", "ABC Corporati")Autolycus
@WakanTanka but that's the whole point of having a relative distance. The absolute distance of both your examples is 5. But the question is about a relative distance. So both having the same absolute distance, while having different string-lengths, makes us expect a different result. - A quick example: ("a", "b") and ("this is an example text", "this is a example txt") the first one has levenshtein distance 1 the 2nd: 2 - does that mean the 1st example is a closer match than the 1st one? - no and that's why there is a need for relative distance; So yeah your example produces desired resultCorinnecorinth
S
5

My approach to this problem was by calculating maximum allowed operations, which is what Levenshtein distance is. The formula I used is:

percent = 0.75; // at least 75% of string must match
maxOperationsFirst = s1.length() - s1.length() * percent;
maxOperationsSecond = s2.length() - s2.length() * percent;
maxOperations = round(min(maxOperationsFirst, maxOperationsSecond));

It calculates maximum operations for each string, I believe that the calculation is easy to understand. I use the minimum value of both results and round it to closest whole number. You can skip this part and use just value of max operations from either of strings, it really depends on your data.

Once you've got the number of maximum operations, you can compare it with levenshtein result and determine if the string is acceptable. This way you can use any extended levenshtein methods, for example Damerau–Levenshtein distance, which count misspelling, e.g. test -> tset, only as 1 operation, which is quite useful when checking user input where those misspellings occur very often.

I hope this helps you get an idea on how to solve this problem.

Strove answered 14/8, 2013 at 23:11 Comment(0)
A
2

This is essentially option 2 mentioned in my question. However let me demonstrate a problem with that approach.

Q = "ABC Corp" (len = 8)
M1 = "ABC"
M2 = "ABC Corporati"
M3 = "ABC Corp"

I have chosen M1 and M2 such that their Lev distances are same (5 each). Using option 2, the match percentages would be

M1 = (1 - 5/8)*100  = 37.5%
M2 = (1 - 5/13)*100 = 61.5%
M3 = 100%

As you can see if I present the matches in that order, there is a huge rank difference between M1 and M2, even though they have the exact same lev distance. You see the problem?

Abel answered 11/1, 2013 at 1:1 Comment(2)
After some time I guess this is correct approach. Suppose that you have very short strings whose LevDisstance is 5. Suppose that you have very long strings whose LevDist is also 5. Then It is correct to say that shortest strings are less similar than longer.Autolycus
Tbh, I see no problem there because as @Wakan Tanka said, same distance to a longer string means that more characters have matched between them. Hence, there is no issue and Option2 is a valid option.Stereography
A
2
Max = Lev_distance(Q,''); //max operations to transform query string to empty string
PM = (Max - Lev_distance(Q, Mi)) / Max * 100%;

I think this will suffice your needs. It is correct for extreme values (exactly suffices the same and completely different strings) and plausible

Affecting answered 29/5, 2021 at 21:9 Comment(1)
Isn't Lev_distance(Q,'') always Length(Q)?Irving
A
0
(1 - (levNum / Math.max(s.length,t.length) ) ) *100

should be correct

Aberdeen answered 9/1, 2013 at 14:51 Comment(1)
The initial question already has this solution as the "Option 2". He is looking for an alternative solution to the problem.Assessor
A
0

What about this one:

100 - ( ((2*Lev_distance(Q, Mi)) / (Q.length + Mi.length)) * 100 )

It gives same distance on (Q, M1) and (Q,M2)

Autolycus answered 8/5, 2016 at 9:6 Comment(0)
I
0

Maximum number of levenshtein distance is [l1, l2].max. I think it is true. But we shouldn't divide by it.

gem install levenshtein diff-lcs

Diff::LCS.lcs "abc", "qwer"
=> []
Levenshtein.distance("abc", "qwer").to_f / [3, 4].max
=> 1.0

Diff::LCS.lcs "abc", "cdef"
=> ["c"]
Levenshtein.distance("abc", "cdef").to_f / [3, 4].max
=> 1.0

Diff::LCS.lcs "1234", "34567890"
=> ["3", "4"]
Levenshtein.distance("1234", "34567890").to_f / [4, 8].max
=> 1.0

Levenshtein doesn't look like reliable way to compare strings in percents. I don't want to treat similar strings as 100% different.

I can recommend just to analyze diff between each sequence and LCS.

def get_similarity(sequence_1, sequence_2)
  lcs_length = Diff::LCS::Internals.lcs(sequence_1, sequence_2).compact.length
  lcs_length.to_f * 2 / (sequence_1.length + sequence_2.length)
end
Improve answered 20/6, 2018 at 20:8 Comment(0)
B
0

I think a simpler approach could be:

from nltk import edit_distance

str1 = 'abc'
str2 = 'abd'
edit_dist  = edit_distance(str1,str2)
len_total = len(str1)+len(str2)
pct_edit_dist = ((len_total-edit_dist)/len_total)*100
print(pct_edit_dist)

pct_edit_dist will be 100 for complete match an 0 for no match.

Bydgoszcz answered 26/6, 2021 at 20:14 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.