difflib.get_close_matches GET SCORE
Asked Answered
C

3

12

I am trying to get the score of the best match using difflib.get_close_matches:

import difflib

best_match = difflib.get_close_matches(str,str_list,1)[0]

I know of the option to add 'cutoff' parameter, but couldn't find out how to get the actual score after setting the threshold. Am I missing something? Is there a better solution to match unicode strings?

Coleen answered 29/3, 2016 at 11:47 Comment(1)
Found great library that can score similarity between 2 strings rapidly and accurately - fuzzywuzzy [link] (pypi.python.org/pypi/fuzzywuzzy)Coleen
B
17

I found that difflib.get_close_matches is the simplest way for matching/fuzzy-matching strings. But there are a few other more advanced libraries like fuzzywuzzy as you mentioned in the comments.

But if you want to use difflib, you can use difflib.SequenceMatcher to get the score as follows:

import difflib
my_str = 'apple'
str_list = ['ape' , 'fjsdf', 'aerewtg', 'dgyow', 'paepd']
best_match = difflib.get_close_matches(my_str,str_list,1)[0]
score = difflib.SequenceMatcher(None, my_str, best_match).ratio()

In this example, the best match between 'apple' and the list is 'ape' and the score is 0.75.

You can also loop through the list and compute all the scores to check:

for word in str_list:
    print "score for: " + my_str + " vs. " + word + " = " + str(difflib.SequenceMatcher(None, my_str, word).ratio())

For this example, you get the following:

score for: apple vs. ape = 0.75
score for: apple vs. fjsdf = 0.0
score for: apple vs. aerewtg = 0.333333333333
score for: apple vs. dgyow = 0.0
score for: apple vs. paepd = 0.4

Documentation for difflib can be found here: https://docs.python.org/2/library/difflib.html

Brighten answered 15/6, 2016 at 9:7 Comment(1)
Can we make this output in pandas if we create a column df['Score']?Inverter
H
2

To answer the question, the usual route would be to obtain the comparative score for a match returned by get_close_matches() individually in this manner:

match_ratio = difflib.SequenceMatcher(None, 'aple', 'apple').ratio()

Here's a way that increases speed in my case by about 10% ...

I'm using get_close_matches() for spellcheck, it runs SequenceMatcher() under the hood but strips the scores returning just a list of matching strings. Normally.

But with a small change in Lib/difflib.py currently around line 736 the return can be a dictionary with scores as values, thus no need to run SequenceMatcher again on each list item to obtain their score ratios. In the examples I've shortened the output float values for clarity (like 0.8888888888888888 to 0.889). Input n=7 says to limit the return items to 7 if there are more than 7, i.e. the highest 7, and that could apply if candidates are many.

Current mere list return

In this example result would normally be like ['apple', 'staple', 'able', 'lapel']

... at the default cutoff of .6 if omitted (as in Ben's answer, no judgement).

The change

in difflib.py is simple (this line to the right shows the original):

return {v: k for (k, v) in result}  # hack to return dict with scores instead of list, original was ... [x for score, x in result]

New dictionary return

includes scores like {'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667}

>>> to_match = 'aple'
>>> candidates = ['lapel', 'staple', 'zoo', 'able', 'apple', 'appealing']

Increasing minimum score cutoff/threshold from .4 to .8:

>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.4)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667, 'appealing': 0.461}

>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.7)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75}

>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.8)
{'apple': 0.889, 'staple': 0.8}
Harlamert answered 9/2, 2022 at 18:48 Comment(0)
V
0

To get List of the matching string with its score try this solution. Go-to difflab.py return [x for score, x in result] replace this with return [[x,score] for score, x in result]

Vicenta answered 6/10, 2023 at 6:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.