Using machine learning to de-duplicate data
Asked Answered
B

2

27

I have the following problem and was thinking I could use machine learning but I'm not completely certain it will work for my use case.

I have a data set of around a hundred million records containing customer data including names, addresses, emails, phones, etc and would like to find a way to clean this customer data and identify possible duplicates in the data set.

Most of the data has been manually entered using an external system with no validation so a lot of our customers have ended up with more than one profile in our DB, sometimes with different data in each record.

For Instance We might have 5 different entries for a customer John Doe, each with different contact details.

We also have the case where multiple records that represent different customers match on key fields like email. For instance when a customer doesn't have an email address but the data entry system requires it our consultants will use a random email address, resulting in many different customer profiles using the same email address, same applies for phones, addresses etc.

All of our data is indexed in Elasticsearch and stored in a SQL Server Database. My first thought was to use Mahout as a machine learning platform (since this is a Java shop) and maybe use H-base to store our data (just because it fits with the Hadoop Ecosystem, not sure if it will be of any real value), but the more I read about it the more confused I am as to how it would work in my case, for starters I'm not sure what kind of algorithm I could use since I'm not sure where this problem falls into, can I use a Clustering algorithm or a Classification algorithm? and of course certain rules will have to be used as to what constitutes a profile's uniqueness, i.e what fields.

The idea is to have this deployed initially as a Customer Profile de-duplicator service of sorts that our data entry systems can use to validate and detect possible duplicates when entering a new customer profile and in the future perhaps develop this into an analytics platform to gather insight about our customers.

Any feedback will be greatly appreciated :)

Thanks.

Baseline answered 5/5, 2013 at 3:36 Comment(4)
sometimes with different data in each record., so how should a machine learning algorithm find duplicates? Also how do you know if John Doe is the same person if he was added with nearly the same data? IMHO you are throwing buzzwords arround and all you need is a tight relational model in your customer database.Seneca
@thomas Its true I am indeed throwing buzzwords, the truth is that I'm trying to get into big data and thought this would be a good opportunity to learn, that is why I said I didn't know if this would even work. The idea is that I would need to match on key fields like email for instance that represent uniqueness as far as the business goes, thought its not always true. Thanks for you input though.Baseline
Not sure what edition your Sql Server is but you may be able to take advantage of the data cleansing transformations in SSIS (fuzzy grouping and fuzzy lookup): msdn.microsoft.com/en-us/magazine/cc163731.aspxYuma
check this chairnerd.seatgeek.com/…Capreolate
H
16

There has actually been a lot of research on this, and people have used many different kinds of machine learning algorithms for this. I've personally tried genetic programming, which worked reasonably well, but personally I still prefer to tune matching manually.

I have a few references for research papers on this subject. StackOverflow doesn't want too many links, but here is bibliograpic info that should be sufficient using Google:

  • Unsupervised Learning of Link Discovery Configuration, Andriy Nikolov, Mathieu d’Aquin, Enrico Motta
  • A Machine Learning Approach for Instance Matching Based on Similarity Metrics, Shu Rong1, Xing Niu1, Evan Wei Xiang2, Haofen Wang1, Qiang Yang2, and Yong Yu1
  • Learning Blocking Schemes for Record Linkage, Matthew Michelson and Craig A. Knoblock
  • Learning Linkage Rules using Genetic Programming, Robert Isele and Christian Bizer

That's all research, though. If you're looking for a practical solution to your problem I've built an open-source engine for this type of deduplication, called Duke. It indexes the data with Lucene, and then searches for matches before doing more detailed comparison. It requires manual setup, although there is a script that can use genetic programming (see link above) to create a setup for you. There's also a guy who wants to make an ElasticSearch plugin for Duke (see thread), but nothing's done so far.

Anyway, that's the approach I'd take in your case.

Heckelphone answered 14/5, 2013 at 11:5 Comment(0)
W
11

Just came across similar problem so did a bit Google. Find a library called "Dedupe Python Library" https://dedupe.io/developers/library/en/latest/

The document for this library have detail of common problems and solutions when de-dupe entries as well as papers in de-dupe field. So even if you are not using it, still good to read the document.

Whatley answered 25/6, 2014 at 6:0 Comment(3)
I completely agree, Dedupe looks really good and the article written by the author is well worth a read if you want an introduction to the topic: cs.utexas.edu/~ml/papers/marlin-dissertation-06.pdfLiberty
Dedupe is actually a terrible library. Hard to install and get working and it crashes or freezes depending on the data set.Con
Yeah, it still crashes and really hard to set up.Resume

© 2022 - 2024 — McMap. All rights reserved.