I had a similar challenge in using FuzzyWuzzy to compare one list of names to another list of names to identify matches between the lists. The FuzzyWuzzy token_set_ratio scorer didn't work for me because, to use your example, comparing "DANIEL CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" and "DANIEL WILLIAM CARTWRIGHT" to "DANIEL WILLIAM CARTWRIGHT" (partial match of 2 of 3 words vs. identity match of 3 of 3 words) both yield a 100% score. For me, a match of 3 words needed to score higher than a match of 2 of 3.
I ended up using nltk in a Bag-of-Words-like approach. The algorithm in the code below converts multi-word names to lists of distinct words (tokens) and counts matches of words in one list against the other and normalizes the counts to the numbers of words in each list. Because True = 1 and False = 0, a sum() over testing whether an element is in a list works nicely to count the elements of one list in another list.
An identity match of all words scores 1 (100%). Scoring for your comparisons works out as follows:
DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT = (2/2 + 2/3)/2 = (5/3)/2 = 0.83
DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT = (1/2 + 1/2)/2 = 1/2 = 0.5
Note that my method ignores word order, which wasn't needed in my case.
import nltk
s1 = 'DANIEL CARTWRIGHT'
s2 = ['DANIEL WILLIAM CARTWRIGHT', 'DAVID CARTWRIGHT']
def myScore(lst1, lst2):
# calculate score for comparing lists of words
c = sum(el in lst1 for el in lst2)
if (len(lst1) == 0 or len(lst2) == 0):
retval = 0.0
else:
retval = 0.5 * (c/len(lst1) + c/len(lst2))
return retval
tokens1 = nltk.word_tokenize(s1)
for s in s2:
tokens2 = nltk.word_tokenize(s)
score = myScore(tokens1, tokens2)
print(' vs. '.join([s1, s]), ":", str(score))
Output:
DANIEL CARTWRIGHT vs. DANIEL WILLIAM CARTWRIGHT : 0.8333333333333333
DANIEL CARTWRIGHT vs. DAVID CARTWRIGHT : 0.5