To answer the question, the usual route would be to obtain the comparative score for a match returned by get_close_matches()
individually in this manner:
match_ratio = difflib.SequenceMatcher(None, 'aple', 'apple').ratio()
Here's a way that increases speed in my case by about 10% ...
I'm using get_close_matches()
for spellcheck, it runs SequenceMatcher()
under the hood but strips the scores returning just a list of matching strings. Normally.
But with a small change in Lib/difflib.py
currently around line 736 the return can be a dictionary with scores as values, thus no need to run SequenceMatcher
again on each list item to obtain their score ratios. In the examples I've shortened the output float values for clarity (like 0.8888888888888888 to 0.889). Input n=7 says to limit the return items to 7 if there are more than 7, i.e. the highest 7, and that could apply if candidates
are many.
Current mere list return
In this example result would normally be like ['apple', 'staple', 'able', 'lapel']
... at the default cutoff of .6 if omitted (as in Ben's answer, no judgement).
The change
in difflib.py
is simple (this line to the right shows the original):
return {v: k for (k, v) in result} # hack to return dict with scores instead of list, original was ... [x for score, x in result]
New dictionary return
includes scores like {'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667}
>>> to_match = 'aple'
>>> candidates = ['lapel', 'staple', 'zoo', 'able', 'apple', 'appealing']
Increasing minimum score cutoff/threshold from .4 to .8:
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.4)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75, 'lapel': 0.667, 'appealing': 0.461}
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.7)
{'apple': 0.889, 'staple': 0.8, 'able': 0.75}
>>> difflib.get_close_matches(to_match, candidates, n=7, cutoff=.8)
{'apple': 0.889, 'staple': 0.8}