Data Deduplication algorithm for large number of contacts

Asked 4/10, 2013 at 11:54 Answered 14/3, 2014 at 20:20

Solved algorithm duplicates fuzzy-logic record-linkage

I'm developing an application which must be able to find & merge duplicates in a Hundreds of thousands of contact information stored in sql server DB. I have to compare all the columns in the table, each column has a weight value. The comparison must work based on the weight value. Based on the comparison result & degree of equivalence i have to decide to merge the contacts automatically or request user attention. I know that there are number of fuzzy logic algorithms for deduplication.

Read about N-gram or Q-gram-based Algorithms in http://www.melissadata.com/. Is this algorithm feasible for a large set of data? If not can any one guide me with some algorithm or tel me where to start with?

An example of what i want to achieve,

Gonzales = Gonzalez (two different spelling of different name)
Smith = Smyth (Phonetic sound the same)
123 Main st = 123 Main street (abbrevation)
Bob Smith = Robert Smith (synonym)

Louise answered 4/10, 2013 at 11:54 Comment(4)

I need to perform a similarity compare. Like say if 2 persons have same e-mail address and have names like 'Pravin' & 'Praveen' i must be able to find these records. – Louise 4/10, 2013 at 12:4

Mr Smith & Smith are also a possible duplicate in my scenario. – Louise 4/10, 2013 at 12:7

Is the data structured in any way: Firstname: Bob, Surename: Smith, Address: 123 Main Street, E-Mail: [email protected] ? or is it just one single line: "Bob Smith, 123 Main Street, [email protected]" – Gormless 5/10, 2013 at 17:54

Yeah, as specified the data is a sql table in the form of (First Name, Last Name, Email, Phone, etc...) – Louise 7/10, 2013 at 4:26

Found a partial solution using simhash algorithm. Found a good example here http://simhash.codeplex.com/

Louise answered 8/10, 2013 at 10:43 Comment(0)

This whole area of research is generally known as record linkage (ironically, it has about a dozen duplicate names). There are quite a few tools out there that will let you configure matching for your specific data, churn through the data, and output the duplicates for you. Some tools will even create a matching for you, if you answer some questions about the correctness of potential matches.

Q/N-gram comparison (and indexing) is one possible way to solve this, but there is a whole raft of others. You'll quickly find that different types of comparators work well for different types of data. I haven't tried Q-gram indexing myself, but researchers in the field have described it to me as relatively slow.

As for comparison with phonetic key functions (like Soundex or Metaphone or whatever) this is only suitable when you have small text fields, like separate fields for given name, family name, middle name etc. Even then these functions tend to be rather coarse. And beware of Soundex. It not only is extremely coarse, turning very different names into the same key, but also misses a number of similar names that should be treated the same. Metaphone is much better.

The Wikipedia page for record linkage used to have a list of tools, but the editors removed it. I wrote an open source tool called Duke to solve this kind of thing. It has a genetic algorithm that can help you create a configuration, or you can write one manually. Other tools exist, too.

I would advise you to use one of the tools that already exist rather than attempting to solve this from scratch.

Ardussi answered 14/3, 2014 at 20:20 Comment(0)

I believe the best option for deduping names is by means of a phonetic encoder. A phonetic encoder will be able to dedup alternative spellings of the same name, here is an example with some common names:

Group: Kathryn names: [Kathryn, Katharine, Katherin, Katherynn, Kathrynn, Katherynne, Kathrynne, Catherine, Cathryn, Catharine, Catherin, Catherynn, Cathrynn, Catherynne, Cathrynne]
Group: Assaf names: [Assaf, Asaf]
Group: Megan names: [Megan, Meagan, Meghan, Meaghan]
Group: Allison names: [Allison, Alyson, Allyson, Alison, Allisyn]
==============================================================
Phonetic Encoder: Caverphone2
---- Names Group: Kathryn ----
Encoded names: {KTRN111111=16}
---- Names Group: Assaf ----
Encoded names: {ASF1111111=3}
---- Names Group: Megan ----
Encoded names: {MKN1111111=5}
---- Names Group: Allison ----
Encoded names: {ALSN111111=6}
==============================================================
Phonetic Encoder: DoubleMetaphone
---- Names Group: Kathryn ----
Encoded names: {K0RN=16}
---- Names Group: Assaf ----
Encoded names: {ASF=3}
---- Names Group: Megan ----
Encoded names: {MKN=5}
---- Names Group: Allison ----
Encoded names: {ALSN=6}
==============================================================
Phonetic Encoder: Nysiis
---- Names Group: Kathryn ----
Encoded names: {CATRYN=7, CATARA=6, CATARY=5}
---- Names Group: Assaf ----
Encoded names: {ASAF=3}
---- Names Group: Megan ----
Encoded names: {MAGAN=5}
---- Names Group: Allison ----
Encoded names: {ALASAN=3, ALYSAN=3, ALASYN=2}
==============================================================
Phonetic Encoder: Soundex
---- Names Group: Kathryn ----
Encoded names: {K365=8, C365=9}
---- Names Group: Assaf ----
Encoded names: {A210=3}
---- Names Group: Megan ----
Encoded names: {M250=5}
---- Names Group: Allison ----
Encoded names: {A425=6}
==============================================================
Phonetic Encoder: RefinedSoundex
---- Names Group: Kathryn ----
Encoded names: {C30609080=5, K3060908=5, K30609080=4, C3060908=5}
---- Names Group: Assaf ----
Encoded names: {A0302=3}
---- Names Group: Megan ----
Encoded names: {M80408=5}
---- Names Group: Allison ----
Encoded names: {A070308=6}
==============================================================

In the example you can see that for Caverphone and DoubleMetaphone all names were encoded to the same string. you should see what makes sense for your data, the encoder to use depends on the language and etymology of the names (i.e. english names, german names...)

Gormless answered 4/10, 2013 at 12:48 Comment(4)

what about 'Mr John' & 'John' will that count as same string? I think not when using Phonetic algorithm. And i'm not going to compare name alone. I have to compare all the columns in the table, each column has a weight value. Based on the comparison i have to decide to merge the contacts automatically or request user attention. – Louise 4/10, 2013 at 13:3

you don't really care about the title when comparing names. simply remove any title you find. there is no need for an algorithm to do that. Remove "Mr", "Mrs", "Ms", "Eng", "Dr", "Prf" or whatever else common title you know of from the start of the name – Gormless 4/10, 2013 at 13:14

phonetic similarity matching fails if you swap names. It is very common to write 'Mark Smith' as 'Smith, Mark'. You have to use a token based matching. – Sphericity 4/10, 2013 at 13:37

edited my question. does token based matching works on all those cases i specified? – Louise 5/10, 2013 at 10:16

Found a partial solution using simhash algorithm. Found a good example here http://simhash.codeplex.com/

Louise answered 8/10, 2013 at 10:43 Comment(0)

Recommended topics

Hot tags