Pandas fuzzy detect duplicates
Asked Answered
A

3

6

How can use fuzzy matching in pandas to detect duplicate rows (efficiently)

enter image description here

How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?

Antilogarithm answered 14/9, 2016 at 12:13 Comment(10)
FuzzyWuzzy is an implementation of edit distance, which would be a good candidate for building a pairwise distance matrix in numpy or similar. to detect "duplicates" or near matches, you'll have to at least make the comparison from each row to the other rows or you'll never know if two are close to each other. see #24090473 for a solution using pdist in scipy.Ballou
You could potentially approximate it -- see cs.stackexchange.com/questions/2093/…Ballou
or get fancy: en.wikipedia.org/wiki/BK-tree. Not sure if any of those are particularly helpful for your case.Ballou
Thanks - I will need to look into that. Would you recommend to perform the distance operation row-wise or would you suggest to "add" up the distances of each field?Antilogarithm
This seems to be intersting gist.github.com/nibogd/94363e93f4e0256b4665eb743dbfa211 - they mention the indexing time is slow but surely not as slow as n^2?Antilogarithm
I updated the notebook and wonder why I cannot set an arbitrary string distance function e.g. one from fuzzywuzzy as a distance metricAntilogarithm
@mwormser which element would you consider for the root or would you create a separate tree per row?Antilogarithm
I found github.com/ekzhu/datasketch/blob/master/README.md and cran.r-project.org/web/packages/textreuse/vignettes/… for now I will look a bit more into the python variantAntilogarithm
You can use Scikit-learn for that. they have a LSH feature hasher that works well with strings. I thought you wanted to use edit distance, but standard similarity search might work well for you. good luck.Ballou
Not necessarily. Just want to find the duplicates. Would you suggest to only use lsh or the combination with min hashAntilogarithm
A
6

Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.

Alwin answered 18/9, 2016 at 2:52 Comment(0)
M
2

pandas-dedupe is your friend here. You can try to do the following:

import pandas as pd
from pandas_deudpe import dedupe_dataframe

df = pd.DataFrame.from_dict({'bank':['bankA', 'bankA', 'bankB', 'bankX'],'email':['email1', 'email1', 'email2', 'email3'],'name':['jon', 'john', 'mark', 'pluto']})

dd = dedupe_dataframe(df, ['bank', 'name', 'email'], sample_size=1)

If you also want to set a canonical name to same entitites, set canonicalize=True.

[I'm one of pandas-dedupe contributors]

Meador answered 12/7, 2020 at 16:26 Comment(0)
A
0

There is now a package to make it easier to use the dedupe library with pandas: pandas-dedupe

(I am a developer of the original dedupe library, but not the pandas-dedupe package)

Alwin answered 6/3, 2020 at 15:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.