Better Approach than FuzzyWuzzy?
Asked Answered
U

2

9

I'm getting a result in fuzzywuzzy that isn't working as well as hoped. If there is an extra word in the middle, due to the levenshtein difference, the score is lower.

Example:

from fuzzywuzzy import fuzz

score = fuzz.ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

score = fuzz.partial_ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.partial_ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

results: 81 85 71 81

I'm looking for the first pair (Daniel vs. Daniel William) to be the better match than the second pair (Daniel vs. David).

Is there a better approach than fuzzywuzzy to use here?

Underwood answered 31/7, 2018 at 23:56 Comment(0)
S
9

For your example, you could use token_set_ratio. The code doc says it takes the ratio of the intersection of the tokens and remaining tokens.

from fuzzywuzzy import fuzz

score = fuzz.token_set_ratio('DANIEL CARTWRIGHT', 'DANIEL WILLIAM CARTWRIGHT')
print(score)
score = fuzz.token_set_ratio('DANIEL CARTWRIGHT', 'DAVID CARTWRIGHT')
print(score)

Result:

100
85
Sacramental answered 1/8, 2018 at 0:16 Comment(0)
I
1

I had a similar challenge in using FuzzyWuzzy to compare one list of names to another list of names to identify matches between the lists. The FuzzyWuzzy token_set_ratio scorer didn't work for me because, to use your example, comparing "DANIEL CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" and "DANIEL WILLIAM CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" (partial match of 2 of 3 words vs. identity match of 3 of 3 words) both yield a 100% score. For me, a match of 3 words needed to score higher than a match of 2 of 3.

I ended up using nltk in a Bag-of-Words-like approach. The algorithm in the code below converts multi-word names to lists of distinct words (tokens) and counts matches of words in one list against the other and normalizes the counts to the numbers of words in each list. Because True = 1 and False = 0, a sum() over testing whether an element is in a list works nicely to count the elements of one list in another list.

An identity match of all words scores 1 (100%). Scoring for your comparisons works out as follows:

  • DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT = (2/2 + 2/3)/2 = (5/3)/2 = 0.83
  • DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT = (1/2 + 1/2)/2 = 1/2 = 0.5
    Note that my method ignores word order, which wasn't needed in my case.
    import nltk
    
    s1 = 'DANIEL CARTWRIGHT'
    s2 = ['DANIEL WILLIAM CARTWRIGHT', 'DAVID CARTWRIGHT']
    
    def myScore(lst1, lst2):
        # calculate score for comparing lists of words
        c = sum(el in lst1 for el in lst2)
        if (len(lst1) == 0 or len(lst2) == 0):
            retval = 0.0
        else:
            retval = 0.5 * (c/len(lst1) + c/len(lst2))
        
        return retval
    
    tokens1 = nltk.word_tokenize(s1)
    
    for s in s2:
        tokens2 = nltk.word_tokenize(s)
        score = myScore(tokens1, tokens2)
        print(' vs. '.join([s1, s]), ":", str(score))
    

    Output:

    DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT : 0.8333333333333333
    DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT : 0.5
    
  • Incarcerate answered 30/5, 2021 at 2:17 Comment(0)

    © 2022 - 2024 — McMap. All rights reserved.