Efficiently find binary strings with low Hamming distance in large set
Asked Answered
F

7

85

Problem:

Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.

Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.

Example:

For a maximum Hamming Distance of 1 (values typically will be quite small)

And input: 
00001000100000000000000001111101

The values:
01001000100000000000000001111101 
00001000100000000010000001111101 

should match because there is only 1 position in which the bits are different.

11001000100000000010000001111101

should not match because 3 bit positions are different.

My thoughts so far:

For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.

If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.

How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.

Fungosity answered 17/6, 2011 at 18:1 Comment(5)
How about mutating the criteria by the expected hamming distance, a recurrent function can do that. Next step will be to get the union of the two list?.Fat
Here's a recent paper on this problem: Large scale Hamming distance query processing.Ramification
@Eric You said "For a maximum Hamming Distance of 1 (values typically will be quite small)". Can you tell what "quite small" meant?Unsparing
@Eric Also, were the ~100 million numbers all unique, or were there duplicates?Unsparing
@StefanPochmann: There are no duplicates. The largest distance of interest would be 4-5.Fungosity
H
117

Question: What do we know about the Hamming distance d(x,y)?

Answer:

  1. It is non-negative: d(x,y) ≥ 0
  2. It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
  3. It is symmetric: d(x,y) = d(y,x)
  4. It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)

Question: Why do we care?

Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.

You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.

Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:

static inline int distance(unsigned x, unsigned y)
{
    return __builtin_popcount(x^y);
}

If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.

Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.

Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.

The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.

  • Database: 100M pseudorandom integers
  • Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
  • Results: Average # of query hits (very approximate)
  • Speed: Number of queries per second
  • Coverage: Average percentage of database examined per query
                -- BK Tree --   -- VP Tree --   -- Linear --
Dist    Results Speed   Cov     Speed   Cov     Speed   Cov
1          0.90 3800     0.048% 4200     0.048%
2         11     300     0.68%   330     0.65%
3        130      56     3.8%     63     3.4%
4        970      18    12%       22    10%
5       5700       8.5  26%       10    22%
6       2.6e4      5.2  42%        6.0  37%
7       1.1e5      3.7  60%        4.1  54%
8       3.5e5      3.0  74%        3.2  70%
9       1.0e6      2.6  85%        2.7  82%
10      2.5e6      2.3  91%        2.4  90%
any                                             2.2     100%

In your comment, you mentioned:

I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.

I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.

A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.

Hag answered 17/6, 2011 at 19:7 Comment(22)
Hooray! My 10k rep is here ;-)Hag
I considered the metric space, but I dismissed it when I realized how close everything is together. Clearly, BK-tree is just brute force, and so it won't be an optimization. M-tree and VP-tree won't be an optimization either because of how close everything is together. (A hamming distance of 4 corresponds to a distance of two, whereas a hamming distance of 2 corresponds to a distance of root two.)Xavier
I don't think cover tree will help either because of the 2^i distance requirement. It's a nice idea, but I don't think it helps with this problem.Xavier
I don't see what makes a BK tree "brute force". Also, I don't know where you get the "root two" from... if you use the Hamming distance as your metric, the Hamming distance is your distance.Hag
Oh, I see, I assumed you were embedding the points into Z^32. I assumed that the BK-tree was brute force because: "The k-th subtree is recursively built of all elements b such that d(a,b) = k". I need to read up on them. Do you know of a good reference?Xavier
E.g., for a BK-tree (IIRC), you search for an object x near object y such that d(x,y) ≤ k by recursively following edges from the current node to its children which match, because all matches will be children of matching nodes or of the root node. This is clearly not brute force.Hag
Unfortunately, I don't have good references. I've been meaning to buy a book or two on spatial indexing but I haven't gotten around to it. At least you know what the data structures are called now ;-)Hag
Just as a further hint, the Hamming distance for fixed-width strings is also called the L1 norm, which makes it a very extensively studied metric space both in Mathematics and Computer Science.Hag
Right, I had assumed without reading that you were using an L2 normXavier
Anyway, I just figured out how the BK-tree works. It looks like they work well when the distance you want is small, and when the chosen root is close to the element that you're querying. For this problem, I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.Xavier
@Neil: I think you can get something like that with a VP tree, but the performance difference is not amazing. See edited answer.Hag
Hamming distance for fixed-size integers is identical to the L1 norm, if you consider integers to be strings of bits. Otherwise, the "standard" L1 norm between two strings is the sum of positive distances between the elements.Exclude
Can you post the code you used for the benchmark on github or something?Hostility
@DietrichEpp This is one of the most amazing answers I've ever found on SO. I was about to ask how long it takes to build the index, but then I saw you posted the code. Answer: on a 3.5Ghz i7-3770K, a 1M item BK Tree is built in 0.034s, and a 100M item BK tree is built in 13s. VP trees take about 4x longer to build, and make my fans start spinning loudly.Superbomb
@DietrichEpp This looks like a terrible answer. The question is about efficiency and all three of your analyzed solutions look very inefficient. And BK and VP are very complicated. Take distance 5. For a given target number, there are only 242825 numbers at distance 5 or less. They can easily and efficiently be generated and checked (e.g., against a simple bitset of 0.5 GB). That's only 0.24%, much better than the 26%, 22% and 100% by your solutions.Unsparing
@StefanPochmann: You seem to have confused the "Add Another Answer" button with the "Add Comment" button. Look at the bottom of the page, you'll find the "Add Another Answer" button there.Hag
@DietrichEpp You seem to have missed the point of something yet again.Unsparing
@StefanPochmann: Honestly, if you have a better answer, just write it as an answer. "I have a better answer" is fantastic, that's what this site is designed to support. "Your answer is terrible" is a different thing entirely, and piggybacking on a top-rated answer because you have a better one is far from ideal because all it does is bury your answer (which sounds like a good one!) in the comments.Hag
@StefanPochmann: You may want to look at the stackoverflow.com/tour page, which explains the purpose of answers and comments. "Use comments to ask for more information or clarify a question or answer."Hag
@DietrichEpp No, clearly that tour sentence is aimed at new members and their use cases and kept short instead of complete. Check the help page about commenting which says "You should submit a comment if you want to: [...] Leave constructive criticism that guides the author in improving the post". That's what I've done.Unsparing
@StefanPochmann: That's not really an "improvement", it's a completely different answer, completely unrelated to the answer here. Incorporating the other solution with comparative benchmarks would take a non-trivial amount of time. While it may be interesting, it would take a couple hours at least and if you're that interested, why not do it yourself? As for your answer getting buried—in my personal experience, if you take a question with a high-voted answer and post a better one years later, the new answer will get voted to the top.Hag
Let us continue this discussion in chat.Hag
U
14

I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.

For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.

I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:

Dist     BK Tree     VP Tree         Bitset   Linear

   1   10,133.83   15,773.69   1,905,202.76   4.73
   2      677.78    1,006.95     218,624.08   4.70
   3      113.14      173.15      27,022.32   4.76
   4       34.06       54.13       4,239.28   4.75
   5       15.21       23.81         932.18   4.79
   6        8.96       13.23         236.09   4.78
   7        6.52        8.37          69.18   4.77
   8        5.11        6.15          23.76   4.68
   9        4.39        4.83           9.01   4.47
  10        3.69        3.94           2.82   4.13

Prepare     4.1s       21.0s          1.52s  0.13s
times (for building the data structure before the queries)

For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.

I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.

My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the tree.c contains all the code and my test.bat shows how I compiled and ran (I used the flags from Dietrich's Makefile)). Shortcut to my solution.

One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).

Unsparing answered 21/9, 2016 at 4:20 Comment(1)
The largest distance of interest would probably be 4-5, so this solution is very interesting. There are no duplicates in the actual domain that inspired the question.Fungosity
K
4

A common approach (at least common to me) is to divide your bit string in several chunks and query on these chunks for an exact match as pre-filter step. If you work with files, you create as many files as you have chunks (e.g. 4 here) with each chunk permuted in front and then sort the files. You can use a binary search and you can even expand you search above and below a matching chunk for bonus.

You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

So to recap: Say you have a bunch of 32 bits strings in a DB or files and that you want to find every hash that are within a 3 bits hamming distance or less of your "query" bit string:

  1. create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4. Or if you use files, create four files, each being a permutation of the slices having one "islice" at the front of each "row"

  2. slice your query bit string the same way in qslice 1 to 4.

  3. query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every string that are within 7 bits (8 - 1) of the query string. If using a file, do a binary search in each of the four permuted files for the same results.

  4. for each returned bit string, compute the exact hamming distance pair-wise with you query bit string (reconstructing the index-side bit strings from the four slices either from the DB or from a permuted file)

The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table and is very efficient in practice. Furthermore, it is easy to shard the files in smaller files as need for more speed using parallelism.

Now of course in your case, you are looking for a self-join of sort, that is all the values that are within some distance of each other. The same approach still works IMHO, though you will have to expand up and down from a starting point for permutations (using files or lists) that share the starting chunk and compute the hamming distance for the resulting cluster.

If running in memory instead of files, your 100M 32 bits strings data set would be in the range of 4 GB. Hence the four permuted lists may need about 16GB+ of RAM. Though I get excellent results with memory mapped files instead and must less RAM for similar size datasets.

There are open source implementations available. The best in the space is IMHO the one done for Simhash by Moz, C++ but designed for 64 bits strings and not 32 bits.

This bounded happing distance approach was first described AFAIK by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

  1. APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE

[...]

Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+ ) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following:

For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q.

Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":

3.3 The Results for Algorithm C

We partitioned the bit string of each page into 12 non- overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one piece in common. This approach is guaranteed to find all pairs of pages with difference up to 11, i.e., C-similarity 373, but might miss some for larger differences.

This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

  1. THE HAMMING DISTANCE PROBLEM

Definition: Given a collection of f -bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint)

[...]

Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits in the table. A listing of these d-bit numbers amounts to “almost a counter” in the sense that (a) quite a few 2 d bit- combinations exist, and (b) very few d-bit combinations are duplicated. On the other hand, the least significant f − d bits are “almost random”.

Now choose d such that |d − d| is a small integer. Since the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions. Since |d − d| is small, the number of such matches is also expected to be small. For each matching fingerprint, we can easily figure out if it differs from F in at most k bit-positions or not (these differences would naturally be restricted to the f − d least-significant bit-positions).

The procedure described above helps us locate an existing fingerprint that differs from F in k bit-positions, all of which are restricted to be among the least significant f − d bits of F. This takes care of a fair number of cases. To cover all the cases, it suffices to build a small number of additional sorted tables, as formally outlined in the next Section.

Note: I posted a similar answer to a related DB-only question

Kulp answered 27/11, 2017 at 19:42 Comment(0)
P
2

You could pre-compute every possible variation of your original list within the specified hamming distance, and store it in a bloom filter. This gives you a fast "NO" but not necessarily a clear answer about "YES."

For YES, store a list of all the original values associated with each position in the bloom filter, and go through them one at a time. Optimize the size of your bloom filter for speed / memory trade-offs.

Not sure if it all works exactly, but seems like a good approach if you've got runtime RAM to burn and are willing to spend a very long time in pre-computation.

Platelet answered 17/6, 2011 at 18:52 Comment(1)
Isn't no going to be very unlikely? 2 percent of the entries are present.Xavier
E
1

How about sorting the list and then doing a binary search in that sorted list on the different possible values within you Hamming Distance?

Eyetooth answered 17/6, 2011 at 18:39 Comment(4)
For a hamming distance of 1, that is reasonable since there are 32 permutations of the original input (flip each bit in the original input once). For a hamming distance of 2, there are many more permuted input values that would have to be searched for.Fungosity
1024+32+1 searches is not a terribly large number of binary searches. Even 32^3 searches is not that many.Tse
@EricJ - There are, however, 100m pieces of data. It's still reasonable - given that the poster states "cost to build data structure is secondary" - for a reasonable hamming distance.Eyetooth
See bit-string-nearest-neighbour-searching, which uses various sorts, then binary search.Lymphangial
T
1

One possible approach to solve this problem is using a Disjoint-set data structure. The idea is merge list members with Hamming distance <= k in the same set. Here is the outline of the algorithm:

  • For each list member calculate every possible value with Hamming distance <= k. For k=1, there are 32 values (for 32-bit values). For k=2, 32 + 32*31/2 values.

    • For each calculated value, test if it is in the original input. You can use an array with size 2^32 or a hash map to do this check.

    • If the value is in the original input, do a "union" operation with the list member.

    • Keep the number of union operations executed in a variable.

You start the algorithm with N disjoint sets (where N is the number of elements in the input). Each time you execute an union operation, you decrease by 1 the number of disjoint sets. When the algorithm terminates, the disjoint-set data structure will have all the values with Hamming distance <= k grouped in disjoint sets. This disjoint-set data structure can be calculated in almost linear time.

Tervalent answered 6/4, 2015 at 15:25 Comment(3)
I don't understand. If your input set is {11000000, 0110000, 00110000, 00011000, 00001100, 00000110, 00000011} and k=2, I think your algorithm will unify each element with its next neighbor (they have Hamming distance 2), thus unifying all of them. But 11000000 and 00000011 do not have Hamming distance 2; their Hamming distance is 4. The fundamental problem with using disjoint set forests (union-find) is that nearness is not an equivalence relation.Subarctic
Good point! But you have to consider that each element is processed sequentially and once a match is found, the matched element is removed from the list. So, in your example, after the union operation between 11000000 and 01100000, the latter would not be available for union with 00110000. You would end up with 5 sets and you would compare the input only with one representative element of each set.Tervalent
I don't understand your suggestion. Maybe you could code it up (for some small value of n)? Here's a thing to test for: if you have four list members x, y, z, w, each with hamming distance 3 to the next, and your query hamming distance is 5, will x and y belong to the same equivalence class (i.e. union-find tree)? Will y and z? Will z and w? How do you use the equivalence classes to decide what to output? As far as I can tell, if you're using the union-find for anything you're using it to de-duplicate your output, which I think a hash-set can do just a well. But I'm not sure I understood?Subarctic
B
1

Here's a simple idea: do a byte-wise radix sort of the 100m input integers, most significant byte first, keeping track of bucket boundaries on the first three levels in some external structure.

To query, start with a distance budget of d and your input word w. For each bucket in the top level with byte value b, calculate the Hamming distance d_0 between b and the high byte of w. Recursively search that bucket with a budget of d - d_0: that is, for each byte value b', let d_1 be the Hamming distance between b' and the second byte of w. Recursively search into the third layer with a budget of d - d_0 - d_1, and so on.

Note that the buckets form a tree. Whenever your budget becomes negative, stop searching that subtree. If you recursively descend into a leaf without blowing your distance budget, that leaf value should be part of the output.

Here's one way to represent the external bucket boundary structure: have an array of length 16_777_216 (= (2**8)**3 = 2**24), where the element at index i is the starting index of the bucket containing values in range [256*i, 256*i + 255]. To find the index one beyond the end of that bucket, look up at index i+1 (or use the end of the array for i + 1 = 2**24).

Memory budget is 100m * 4 bytes per word = 400 MB for the inputs, and 2**24 * 4 bytes per address = 64 MiB for the indexing structure, or just shy of half a gig in total. The indexing structure is a 6.25% overhead on the raw data. Of course, once you've constructed the indexing structure you only need to store the lowest byte of each input word, since the other three are implicit in the index into the indexing structure, for a total of ~(64 + 50) MB.

If your input is not uniformly distributed, you could permute the bits of your input words with a (single, universally shared) permutation which puts all the entropy towards the top of the tree. That way, the first level of pruning will eliminate larger chunks of the search space.

I tried some experiments, and this performs about as well as linear search, sometimes even worse. So much for this fancy idea. Oh well, at least it's memory efficient.

Board answered 21/5, 2020 at 12:59 Comment(1)
Thanks for sharing this alternative. "Memory is cheap" in my environment, but a memory-efficient solution may benefit someone else.Fungosity

© 2022 - 2024 — McMap. All rights reserved.