Pandas fuzzy detect duplicates

Asked 14/9, 2016 at 12:13 Answered 12/7, 2020 at 16:26

Solved python pandas fuzzy-search locality-sensitive-hash record-linkage

How can use fuzzy matching in pandas to detect duplicate rows (efficiently)

How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?

Antilogarithm answered 14/9, 2016 at 12:13 Comment(10)

FuzzyWuzzy is an implementation of edit distance, which would be a good candidate for building a pairwise distance matrix in numpy or similar. to detect "duplicates" or near matches, you'll have to at least make the comparison from each row to the other rows or you'll never know if two are close to each other. see #24090473 for a solution using pdist in scipy. – Ballou 14/9, 2016 at 13:0

You could potentially approximate it -- see cs.stackexchange.com/questions/2093/… – Ballou 14/9, 2016 at 13:6

or get fancy: en.wikipedia.org/wiki/BK-tree. Not sure if any of those are particularly helpful for your case. – Ballou 14/9, 2016 at 13:7

Thanks - I will need to look into that. Would you recommend to perform the distance operation row-wise or would you suggest to "add" up the distances of each field? – Antilogarithm 14/9, 2016 at 13:59

This seems to be intersting gist.github.com/nibogd/94363e93f4e0256b4665eb743dbfa211 - they mention the indexing time is slow but surely not as slow as n^2? – Antilogarithm 14/9, 2016 at 14:5

I updated the notebook and wonder why I cannot set an arbitrary string distance function e.g. one from fuzzywuzzy as a distance metric – Antilogarithm 14/9, 2016 at 15:1

@mwormser which element would you consider for the root or would you create a separate tree per row? – Antilogarithm 14/9, 2016 at 16:17

I found github.com/ekzhu/datasketch/blob/master/README.md and cran.r-project.org/web/packages/textreuse/vignettes/… for now I will look a bit more into the python variant – Antilogarithm 14/9, 2016 at 18:42

You can use Scikit-learn for that. they have a LSH feature hasher that works well with strings. I thought you wanted to use edit distance, but standard similarity search might work well for you. good luck. – Ballou 14/9, 2016 at 18:57

Not necessarily. Just want to find the duplicates. Would you suggest to only use lsh or the combination with min hash – Antilogarithm 14/9, 2016 at 19:25

Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.

Alwin answered 18/9, 2016 at 2:52 Comment(0)

pandas-dedupe is your friend here. You can try to do the following:

import pandas as pd
from pandas_deudpe import dedupe_dataframe

df = pd.DataFrame.from_dict({'bank':['bankA', 'bankA', 'bankB', 'bankX'],'email':['email1', 'email1', 'email2', 'email3'],'name':['jon', 'john', 'mark', 'pluto']})

dd = dedupe_dataframe(df, ['bank', 'name', 'email'], sample_size=1)

If you also want to set a canonical name to same entitites, set canonicalize=True.

[I'm one of pandas-dedupe contributors]

Meador answered 12/7, 2020 at 16:26 Comment(0)

There is now a package to make it easier to use the dedupe library with pandas: pandas-dedupe

(I am a developer of the original dedupe library, but not the pandas-dedupe package)

Alwin answered 6/3, 2020 at 15:0 Comment(0)

Recommended topics

Hot tags