Which data structure to store binary strings and query with hamming distane

Asked 24/2, 2016 at 13:23 Answered 24/2, 2016 at 23:11

I'm looking for a data structure to handle bilions of binary strings that contains 512 binary values.

My goal is to send querys to the structure and get a resultset which contains all data that are lower a distance.

My first idea was to use a kd tree. But those trees are very slow for a high dimension. My second idea is to use a lsh approach (minHash/ superbit) lsh. But for that I must also have a structure to perform efficient search

Any ideas how to handle these big data?

**Update ** Some detail notes :

for the hamming distance exists only a upper limit that is maybe 128. But in time I doesn't know the upper limit
a insertion or a deletion would be nice but I also can rebuild the graph (the data base only updated onces a week)
the result set must contain all relevant nodes (I'm not looking for knn)

Lahdidah answered 24/2, 2016 at 13:23 Comment(7)

Please clarify the needs. Does this result set have to contain all of the nodes within a given distance? Are there upper and lower limits on the distance? Can you sustain a large overhead to index and organize the data? Will there be any insertions or deletions? – Ramiform 24/2, 2016 at 20:50

Hello Prune, yes the result set should contain all the nodes which are under a upper distance limit. A lower limit does not exist. Insertion and deletions would be nice but I also can rebuild the graph. – Lahdidah 24/2, 2016 at 21:14

What is the ratio between queries and changes? The data base design might well depend on the relative frequency of changes (additions and deletions). – Ramiform 24/2, 2016 at 21:57

Also, what is the likely distance value? Is it worth optimizing on that basis? – Parodic 24/2, 2016 at 22:47

the database should udated weekly, so insertion / deletions i not so importend. In time I don't kno the maximum hamming distance. Maybe 128 of all 512 bits should be right – Lahdidah 25/2, 2016 at 9:22

What kind of data is this? Also, is it proprietary? If you've got a github project, I'd be interested. – Parodic 25/2, 2016 at 17:15

Hey, it's been a year! What ever became of this? If you can answer, I wonder what you were doing, and what solution you picked, and how it worked out? – Parodic 24/2, 2017 at 18:24

Without knowing your intended search parameters, it's hard to be too optimized. That said, I think a good approach would be to build a B- or T- tree and then optimize that structure for the binary nature of the data.

Specifically, you have 64 bytes of data as a 512 element bit-string. Your estimate is that you will have "bilions" of records. That's on the order of 2³² values, so 1/16^th of the space will be full? (Does this agree with your expectations?)

Anyway, try breaking the data into bytes, let each byte be a key level. You can probably compress the level records, if the probability of set bits is uniform. (If not, if say set bits are more likely at the front of the key, then you might want to just allocate 256 next-level pointers and have some be null. It's not always worth it.)

All of your levels will be the same- they will represent 8 more bits of the string. So compute a table that maps, for a byte, all the byte values that are within distance S from that byte, 0 <= S <= 8. Also, compute a table that maps two bytes to the distance E between them, hamming(a,b).

To traverse the tree, let your search distance be SD. Set D = SD. Read the top level block. Find all 8-bits values in the block less than distance min(8, D) from your query. For each value, compute the exact distance hamming(query, value) and recurse to the lower block with D = D - hamming(query, value) for that sub-tree.

Parodic answered 24/2, 2016 at 23:11 Comment(3)

helllo and thank you for this idea. But I doesn't understand how to interpret this tree. A nomal B-tree only has 2 childs. But how to split a byte into left and right node? – Lahdidah 3/3, 2016 at 12:54

A binary tree has left and right nodes. Despite the names, B-trees and T-trees are not binary. They are n-ary, typically depending on a practical (storage-based) relationship between storage structure and input data to determine n, which may vary from node to node. My suggestion in this case is to have the nodes be 256-ary, representing 8 bits of the string. – Parodic 3/3, 2016 at 14:38

ah of course...my mistake. Great idea! I will test it. – Lahdidah 3/3, 2016 at 18:40

The biggest design problem I see here is the closure requirement: we need to return all items within distance N of a given vector, for arbitrary N. The data space is sparse: "billions" is on the order of 2^33, but we have 512 bits of information, so there is only 1 entry per 2^(512-33) possibilities. For randomly distributed keys, the expected distance between any two nodes is 256; the expected nearest-neighbour distance is somewhere around 180.

This leads me to expect that your search will hinge on non-random clusters of data, and that your search will be facilitated by recognition of that clustering. This will be a somewhat painful pre-processing step on the initial data, but should be well worthwhile.

My general approach to this is to first identify those clusters in some generally fast way. Start with a hashing function that returns a very general distance metric. For instance, for any vector, compute the distances to each of a set of orthogonal reference vectors. For 16 bits, you might take the following set (listed in hex): 0000, 00FF, 0F0F, 3333, 5555, a successive "grain" of alternating bits". Return this hash as a simple tuple the 4-bit distances, a total of 20 bits (there are actual savings for long vectors, as one of the sizes is 2^(2^N)).

Now, this hash tuple allows you a rough estimate of the hamming distance, such that you can cluster the vectors more easily: vectors that are similar must have similar hash values.

Within each cluster, find a central element, and then characterize each element of the cluster by its distance from that center. For more speed, give each node a list of its closest neighbors with distances, all of them within the cluster. This gives you a graph for each cluster.

Similarly connect all the cluster centers, giving direct edges to the nearer cluster centers. If your data are reasonably amenable to search, then we'll be able to guarantee that, for any two nodes A, B with cluster centers Ac and Bc, we will have d(A, Ac) + d(B, Bc) < d(A, B). Each cluster is a topological neighbourhood.

The query process is now somewhat faster. For a target vector V, find the hash value. Find cluster centers that are close enough tot hat value that something in their neighbourhoods might match ([actual distance] - [query range] - [cluster radius]). This will allow you to eliminate whole clusters right away, and may give you an entire cluster of "hits". For each marginal cluster (some, but not all nodes qualify), you'll need to find a node that works by something close to brute force (start in the middle of the range of viable distances from the cluster center), and then do a breadth-first search of each node's neighbors.

I expect that this will give you something comparable to optimal performance. It also adapts decently to additions and deletions, so long as those are not frequent enough to change cluster membership for other nodes.

The set of vectors is straightforward. Write out the bit patterns for the 16-bit case:

0000 0000 0000 0000   16 0s
0000 0000 1111 1111    8 0s, 8 1s 
0000 1111 0000 1111    4 0s, 4 1s, repeat
0011 0011 0011 0011    2 0s, 2 1s, repeat
0101 0101 0101 0101    1 0s, 1 1s, repeat

Ramiform answered 24/2, 2016 at 23:11 Comment(4)

Obvious mental stupidity; thanks. Edit coming Real Soon Now. – Ramiform 24/2, 2016 at 23:18

Hi Prune, your approach sounds great! But why does you use only 16 bit for the first clustering? How can I describe such a hash metric? – Lahdidah 25/2, 2016 at 15:50

I used 16 only to illustrate the principal. Expand the case from 16 to 512 bits, using a set of 10 vectors, distances taking 9 bits, for a total of 90 bits in the hash key. In what sense do you need to "describe" the hash key? – Ramiform 25/2, 2016 at 16:24

Ah okay, thank you. Than I doesn't need some other hash function to describe the key. I will test these approach. Currently I doesn't understood how to generate the orthogonal vectors. Is there a algorithm or must I do this with brute forcing and test with dot product? – Lahdidah 25/2, 2016 at 16:32

Recommended topics

Hot tags