Fuzzy matching deduplication in less than exponential time?
Asked Answered
E

6

20

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).

I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.

The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).

I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.

Any ideas?

P.S. I realize I may have used the wrong terms for time complexity - it is a concept that I have a basic grasp of, but not well enough so I could drop an algorithm into the proper category on the spot. If I used the terms wrong, I welcome corrections, but hopefully I got my point across at least.

Edit

Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.

Edva answered 25/8, 2011 at 19:25 Comment(2)
The second method would provide greater accuracy rather than first... But you wouldn't need that many calculations. Assuming you gave each record a unique identifier to begin with and changed these as you matched something the number of records would decrease over time.Intercontinental
But each record can have multiple duplicates, and I'm concerned with the run time of a single pass through. So I don't think changing row IDs as I go will save any time. Obviously, prevented future duplicates is easier, because its just comparing the newly inserted record against all the others - O(1) time. But the first pass is O(n^2) time, which, with 63 million records and severely limited hardware, we are stretching into the months of computing time to do a single pass. So I need a faster option ...Edva
A
13

Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.

A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"

Abampere answered 25/8, 2011 at 20:17 Comment(4)
Yes! I think this could be what I am looking for. I did some reading up on LSH, and I'm a little lost in terms of implementing it. The idea seems to be to calculate the hash of the values in such a way that similar values have a high probability of colliding hashes. Then you could run more precise algorithms on values with colliding hashes. Basically, reduce the dimensionality to reduce computational intensity. What I don't understand is what hashing function to use - aren't they designed to try to minimize collisions? Are we talking about hash functions like SHA or MD5?Edva
Good hash functions - such as SHA - are defined to look like random functions, which do produce fewer collisions than some very bad hash function. But the calculations that justify LSH assume perfectly random hash functions as building blocks, so that is fine. Bad hash functions produces collisions by hashing similar inputs to the same output, but LSH doesn't rely on this at all. It just relies on the hash function being consistent, so that if h(ACKOV) = 13 in one place, h(ACKOV) = 13 in another place. If you find a good practical writeup of LSH, use the hash function in the writeup.Abampere
I think that makes sense :) ... any chance you would know of such a writeup, or at least a direction to look? My attempts so far have been largely fruitless. I've turned up several highly technical resources, but they are more complex than I can handle given my current background.Edva
I have never done this in practice - just tried to keep up with the literature when it caught my attention. However, a quick websearch turned up a document at nlp.stanford.edu/IR-book/html/htmledition/… which looks not too unreadable. As always, remember that you can cut up documents that you don't want to eat for bait - that is, a document that is otherwise useless to you may still provide useful keywords for more web-searching.Abampere
A
3

I think you may have mis-calculated the complexity for all the combinations. If comparing one string with all other strings is linear, this means due to the small lengths, each comparison is O(1). The process of comparing each string with every other string is not exponential but quadratic, which is not all bad. In simpler terms you are comparing nC2 or n(n-1)/2 pairs of strings, so its just O(n^2)

I couldnt think of a way you can sort them in order as you cant write an objective comparator, but even if you do so, sorting would take O(nlogn) for merge sort and since you have so many records and probably would prefer using no extra memory, you would use quick sort, which takes O(n^2) in worst case, no improvement over the worst case time in brute force.

Almatadema answered 25/8, 2011 at 19:38 Comment(5)
I guess my "check-string" was my hope of an objective comparator. Do you think this isn't possible? I'm not sure about my own doubt mentioned in the question - is it true that such a comparator method could miss a similar pair? Also, thanks for the complexity times, I wasn't sure I was using the right terms. But even though the complexity is no improvement, wouldn't actual run times be improved? Because brute force has to do intensive string similarity calculations each cycle, but the sort is just comparing a similarity rating to the objective comparator, a much faster computation.Edva
Well i couldnt think of a comparator, consider 3 strings abcd, efcd and abef. Any two pairs are exactly 2 characters different, so which order would be place them in? All 6 permutations of them make sense. And yes, if you come up with a really smart comparator, then you can do O(nlgn) in the average case, i couldnt think of any may be you need to try some comparators on small dataset and observe, one comparator is edit distance, en.wikipedia.org/wiki/Edit_distanceAlmatadema
Well, given "abcd", "efcd", and "abef", I was thinking of comparing them against some kind of "neutral", arbitrary check-string that acts as a third party to the comparison. For example, compare each value to "abcdef", using pair distance as the comparison function (catalysoft.com/articles/StrikeAMatch.html). "abcd" has three pairs, while "efcd" and "abef" have two. So then you grab all the values with two pairs and run a precise comparison between them. Maybe rinse and repeat with a different check-string for greater accuracy.Edva
Yes sir, thats definetly a way to do things, its nothing but multiple times running the linear time algorithm you mentioned in your questionAlmatadema
So would you have any advice as to how to formulate this "check-string" most effectively? I was thinking of included every combination of letters in the alphabet and checking for similarity using the pair distance comparison function. I think this method would be the most conservative (it wouldn't miss any possible duplicates, but would turn up lots of false positives). But that check string would be over a thousand characters long - very inefficient I would think.Edva
C
3

You could use a Levenshtein transducer, which "accept[s] a query term and return[s] all terms in a dictionary that are within n spelling errors away from it". Here's a demo.

Cuirassier answered 16/2, 2016 at 1:54 Comment(0)
D
3

Pairwise comparisons of all the records is O(N^2) not exponential. There basically two ways to go to cut down on that complexity.

The first is blocking, where you only compare records that already have something in common that's easy to compute, like the first three letters or a common n-gram. This is basically the same idea as Locally Sensitive Hashing. The dedupe python library implements a number of blocking techniques and the documentation gives a good overview of the general approach.

In the worse case, pairwise comparisons with blocking is still O(N^2). In the best case it is O(N). Neither best or worst case are really met in practice. Typically, blocking reduces the number of pairs to compare by over 99.9%.

There are some interesting, alternative paradigms for record linkage that are not based on pairwise comparisons. These have better worse case complexity guarantees. See the work of Beka Steorts and Michael Wick.

Decurion answered 17/9, 2016 at 1:7 Comment(0)
O
1

I assume this is a one-time cleanup. I think the problem won't be having to do so many comparisons, it'll be having to decide what comparisons are worth making. You mention names and addresses, so see this link for some of the comparison problems you'll have.

It's true you have to do almost 500 billion brute-force compares for comparing a million records against themselves, but that's assuming you never skip any records previously declared a match (ie, never doing the "break" out of the j-loop in the pseudo-code below).

My pokey E-machines T6532 2.2gHz manages to do 1.4m seeks and reads per second of 100-byte text file records, so 500 billion compares would take about 4 days. Instead of spending 4 days researching and coding up some fancy solution (only to find I still need another x days to actually do the run), and assuming my comparison routine can't compute and save the keys I'd be comparing, I'd just let it brute-force all those compares while I find something else to do:

for i = 1 to LASTREC-1
  seektorec(i)
  getrec(i) into a
  for j = i+1 to LASTREC
    getrec(j) into b
    if similarrecs(a, b) then [gotahit(); break]

Even if a given run only locates easy-to-define matches, hopefully it reduces the remaining unmatched records to a more reasonable smaller set for which further brute-force runs aren't so time-consuming.

But it seems unlikely similarrecs() can't independently compute and save the portions of a + b being compared, in which case the much more efficient approach is:

for i = 1 to LASTREC
  getrec(i) in a
  write fuzzykey(a) into scratchfile
sort scratchfile
for i = 1 to LASTREC-1
  if scratchfile(i) = scratchfile(i+1) then gothit()

Most databases can do the above in one command line, if you're allowed to invoke your own custom code for computing each record's fuzzykey().

In any case, the hard part is going to be figuring out what makes two records a duplicate, per the link above.

Oblation answered 27/8, 2011 at 2:3 Comment(7)
What language / environment are you using to achieve 1.4m seeks & reads / second? The first code sample you posted is my current approach right now. Running on my secondary laptop (an admittedly underpowered setup) I estimated a one month run time to caclulate the 500 billion comparisons. Also, I don't understand your second code sample - what does the fuzzykey() function call write to the scratchfile? Why did you sort the scratchfile? A comparison of adjacent strings sorted alphabetically (your scratchfile) would miss duplicates off by only transposed characters (i.e. "foobar" vs. "ofobar")Edva
Also, it is unlikely (based on my knowledge of the data I'm working with) that the actual duplicate count is a large enough percentage of all the records in the DB that de-duplication on a rough first pass would significantly reduce the number of records to the point of making future brute force passes noticeably faster.Edva
I'm using Delphi, but the speed comes from using text files + avoiding database drivers. What to use for fuzzykey() is the $64 question. It's the criteria by which you decide two records are duplicates, and is the problem you'll end up spending most of your time on. Per the merge/purge strategy/tactics in the linked document, fuzzykey() might be SortedSoundex for names, CASS-scrubbed output for addresses, a digits-only filter for phone numbers, etc. Sorting clumps keys together. Eg, the SortedSoundex of "foobar" and "ofobar" are both the same, the CASS of PO BOX 1 and BOX 1 are the same, etc.Oblation
If your list is indeed names and addresses, I wouldn't touch the job unless I were allowed to CASS-certify the addresses. I'd (1) export RECORD ID, LASTNAME, ADDRESS, ZIP to a fixed-field text file, along with space for GROUP and MEMBER fields for use by DUPDTECT at semaphorecorp.com/mpdd/dupdtect.html, (2) CASS all the addresses, (3) let DUPDTECT assign group and member numbers of all dupes based on same name/address/ZIP, then (4) import the RECORD ID/GROUP/MEMBERS for review or deletions. If phone numbers are available, I'd include those to help the dupe detection quality.Oblation
CASS is not available for this job. Soundex and other pronunciation based fuzzy matching algos are out (i.e. a company name field where "Joe's Plumbing Inc." and "Joe's Plumbing Incorporated" are duplicates, and any situation in which words are out of order - pronunciation would miss this, wouldn't it?). Choosing a fuzzy matching algo is not the difficult part right now - either pair distance or Levenshtein probably. I just need a fast way to run the comparisons. Also, "I wouldn't touch the job" isn't an option.Edva
U can do quick+dirty address "standardization" instead of CASS by simply extracting only the digits + consonants from the address, optionally sorting them, then using only the 1st n chars. However, the more crude your technique, the more matches you'll miss. I'd rather spend $100 to acquire the appropriate proven tool (CASS) instead of spending forever attempting to reinvent that wheel (and probably not doing as good a job). Note that "Joe's Plumbing Inc." and "Joe's Plumbing Incorporated" ARE soundex dupes, but their edit distance is relatively big. 1 algo isn't going to work for all cases.Oblation
By definition, inter-record computations like edit distance between names are always going to force large numbers of record compares because 2 recs with small edit distances can still sort very far apart (eg, 2 identical names except one is prepended with the typo "A" and the other prepended with the typo "Z".) I'd still try to first throw out the easy-to-match records via one pass of finding equal fuzzy keys before comparing all records with each other for edit distance. To see typical approaches in the industry, census bureau, etc, google "record linking".Oblation
T
0

Equivalence relations are particularly nice kinds of matching; they satisfy three properties:

  • reflexivity: for any value A, A ~ A
  • symmetry: if A ~ B, then necessarily B ~ A
  • transitivity: if A ~ B and B ~ C, then necessarily A ~ C

What makes these nice is that they allow you to partition your data into disjoint sets such that each pair of elements in any given set are related by ~. So, what you can do is apply the union-find algorithm to first partition all your data, then pick out a single representative element from each set in the partition; this completely de-duplicates the data (where "duplicate" means "related by ~"). Moreover, this solution is canonical in the sense that no matter which representatives you happen to pick from each partition, you get the same number of final values, and each of the final values are pairwise non-duplicate.

Unfortunately, fuzzy matching is not an equivalence relation, since it is presumably not transitive (though it's probably reflexive and symmetric). The result of this is that there isn't a canonical way to partition the data; you might find that any way you try to partition the data, some values in one set are equivalent to values from another set, or that some values from within a single set are not equivalent.

So, what behavior do you want, exactly, in these situations?

Teacher answered 25/8, 2011 at 20:10 Comment(5)
I'm not sure I understood your answer, so sorry if this isn't what you are looking for! So what you're saying is: there is no guarantee when I divide up my data based on the results of some relationship between the data, because fuzzy matching fails transitivity, there is no guarantee that the pairs created will encompass all pairs that are in fact similar (satisfy that relationship); or, all pairs that satisfy will be present, but some false positives as well. I prefer accuracy in this case, so if there is a choice, I'd go for the latter. But I probably misunderstood your answer :)Edva
Here's the simplest example illustrating the dilemma. Suppose ~ is "are edit-distance 1 apart". We have the words "goo", "foo", and "food". How should they be de-duplicated? There are no duplicates in the result {"goo", "food"} (and the remainder, "foo", is a duplicate of at least one of these); there are no duplicates in the result {"foo"} (and the remainders, "goo" and "good", are each duplicates of at least one of these); but these two results are significantly different. Which one do you prefer? Can you say carefully why you prefer that one?Teacher
Given the words "goo", "foo", and "food", and a "duplicate" threshold of 1 edit distance, it seems to me that "goo" and "foo" should match (aka be marked as possible duplicates) and "foo" and "food" should as well, but "goo" and "food" should not be marked as duplicates, because they are 2 edit-distance apart. But this seems relatively straightforward, so I think I might not be understanding your comment (once again!). Thanks for your patience explaining this, I really appreciate it!Edva
I agree with the matches you claim. At the end of the deduplication process, which ones would have been removed based on those matches?Teacher
I should have made this part more clear (I will edit the question to include this): I don't plan on deleting any records automatically. I only want to flag records that are above a certain similarity threshold (i.e. a Levenshtein edit distance < 3) as potential duplications. I have a separate method of storing these flags - I just need a fast method of calculating the similarity metric between all the records (technically, all possible combinations of the records). The idea is to eventually flag these for human review, but the existing 63m records needs to be reduced to a manageable size.Edva

© 2022 - 2024 — McMap. All rights reserved.