I have a dataframe of names and addresses that I need to dedupe. The catch is that some of these fields might have typos, even though they are still duplicates. For example, suppose I had this dataframe:
index name zipcode
------- ---------- ---------
0 john doe 12345
1 jane smith 54321
2 john dooe 12345
3 jane smtih 54321
The typos could occur in either name or zipcode, but let's just worry about the name one for this question. Obviously 0 and 2 are duplicates as are 1 and 3. But what is the computationally most efficient way to figure this out?
I have been using the Levenshtein distance to calculate the distance between two strings from the fuzzywuzzy package, which works great when the dataframe is small and I can iterate through it via:
from fuzzywuzzy import fuzz
for index, row in df.iterrows():
for index2, row2 in df.iterrows():
ratio = fuzz.partial_ratio(row['name'], row2['name'])
if ratio > 90: # A good threshold for single character typos on names
# Do something to declare a match and throw out the duplicate
Obviously this is not a approach that will scale well and unfortunately I need to dedupe a dataframe that is about 7M rows long. And obviously this gets worse if I also need to dedupe potential typos in the zipcode too. Yes, I could do this with .itertuples()
, which would give me a factor of ~100 speed improvement, but am I missing something more obvious than this clunky O(n^2)
solution?
Are there more efficient ways I could go about deduping this noisy data? I have looked into the dedupe package, but that requires labeled data for supervised learning and I don't have any nor am I under the impression that this package will handle unsupervised learning. I could roll my own unsupervised text clustering algorithm, but I would rather not have to go that far if there is an existing, better approach.