How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

Asked 4/1, 2012 at 21:6 Answered 5/1, 2012 at 19:13

Solved ruby mongodb kdtree hamming-distance

I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:

0110101010101010110101010101

Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.

This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.

So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.

I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.

Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.

Anyone's thoughts would be appreciated.

Ambiguous answered 4/1, 2012 at 21:6 Comment(2)

Does MongoDB have an xor and "bit counting" function? If yes you could xor each number with what you want and count the number of bits set to 1, the lowest results would then be the lowest Hamming distances – Orpington 4/1, 2012 at 21:13

Nope. It doesn't. Or, I'd have gone for that. It's on the roadmap though as far as I know... – Ambiguous 4/1, 2012 at 22:37

I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).

Then, I used a BK Tree to compare the strings.

Ambiguous answered 5/1, 2012 at 19:13 Comment(0)

The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.

You can then apply this repeatedly until you have "enough" pairs.

Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.

Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than n. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at level n in any branch, then give up (in other words, never enter n + 1). If you are using a metric where splitting on one dimension doesn't always yield a distance of 1, you need to adjust the level of recursion where you give up.

Vengeance answered 4/1, 2012 at 21:18 Comment(6)

That depends on what exactly the OP wants to do, which isn't completely clear. There are e.g. better ways to find the n closest pairs, etc., but yes, in general you're right that applying repeatedly may be the wrong thing to do. – Vengeance 4/1, 2012 at 21:48

Hey, where did that other comment go? :) – Vengeance 4/1, 2012 at 21:49

What other comment? So, I'm trying to find near duplicates. The bits represent features in n-dimensional space (256 in this case). If you have a document of text, and you have a suggestion as to a better way to find near, but not exact duplicates, I"m all ears. :-) – Ambiguous 4/1, 2012 at 22:39

There was another comment to my answer, I was referring to that. The O(n log n) is the best way to find the closest pair. That's a clear definition. You may want to describe in more details what you mean by a "close duplicate", if closest pair doesn't cut it for you. I.e. pairs closer than some number n? 100 closest pairs, etc. – Vengeance 4/1, 2012 at 22:59

basically, I"m looking for documents that have no more than 5% or so of the content duplicated. So, if that was 100% of the documents, I'd want all of them. But, its unlikley, most of the time there are no duplicates. – Ambiguous 5/1, 2012 at 0:59

@Vengeance I'm not sure how you can use Hamming distance to sort the files. The distance only tells you how many bits file A is differ from file B. You can't use it to say whether A is before or after B. If you try to sort the all the distances between all pairs of files, then you end up with a brute force approach that Williamf is trying to avoid. – Wellrounded 18/9, 2013 at 16:19

As far as I could understand, you have an input string X and you want to query the database for a document containing string field b such that Hamming distance between X and document.b is less than some small number d.

You can do this in linear time, just by scanning all of your N=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller than d, you can give up comparison after d unmatched characters; you only need to compare all 256 characters if most of them match.

You can try to scan fewer than N documents, that is, to get better than linear time.

Let ones(s) be the number of 1s in string s. For each document, store ones(document.b) as a new indexed field ones_count. Then you can only query documents where number of ones is close enough to ones(X), specifically, ones(X) - d <= document.ones_count <= ones(X) + d. The Mongo index should kick in here.

If you want to find all close enough pairs in the set, see @Philippe's answer.

Start answered 4/1, 2012 at 22:27 Comment(3)

I'm doing something like this now, and this seems to work reasonably well, but is very slow. – Ambiguous 4/1, 2012 at 22:41

I though the problem was that you didn't know which X to compare all the other ones to, hence the n^2 complexity? – Vengeance 4/1, 2012 at 23:1

So, today, I do this NOT using the simhash/minhash, but by using the words on the page themselves. I grab a word, and go through eliminating based on the ones that don't have those words. This is the slow way. I am hoping to do it more quickly than this, and that's what i'm referring to. – Ambiguous 5/1, 2012 at 1:1

This sounds like an algorithmic problem of some sort. You could try comparing those with a similar number of 1 or 0 bits first, then work down through the list from there. Those that are identical will, of course, come out on top. I don't think having tons of RAM will help here.

You could also try and work with smaller chunks. Instead of dealing with 256 bit sequences, could you treat that as 32 8-bit sequences? 16 16-bit sequences? At that point you can compute differences in a lookup table and use that as a sort of index.

Depending on how "different" you care to match on, you could just permute changes on the source binary value and do a keyed search to find the others that match.

Origan answered 4/1, 2012 at 21:20 Comment(0)

Recommended topics

Hot tags