Bit string nearest neighbour searching
Asked Answered
M

2

4

I have hundreds of thousands of sparse bit strings of length 32 bits.

I'd like to do a nearest neighbour search on them and look-up performance is critical. I've been reading up on various algorithms but they seem to target text strings rather than binary strings. I think either locally sensitive hashing or spectral hashing seem good candidates or I could look into compression. Will any of these work well for my bit string problem ? Any direction or guidance would be greatly appreciated.

Mima answered 31/3, 2012 at 21:13 Comment(1)
Could you provide a little more information about the problem? Specifically, what do you mean by "nearest"? Also, are there interesting statistical properties of the bit strings, like number of zero or one bits, or their position within the 32-bit word? @Denis makes a good guess, but I'd like some confirmation before attempting an answer. p.s. Welcome to Stack Overflow!Ory
P
3

Here's a fast and easy method, then a variant with better performance at the cost of more memory.

In: array Uint X[], e.g. 1M 32-bit words
Wanted: a function near( Uint q ) --> j with small hammingdist( q, X[j] )
Method: binary search q in sorted X, then linear search a block around that.
Pseudocode:

def near( q, X, Blocksize=100 ):
    preprocess: sort X
    Uint* p = binsearch( q, X )  # match q in leading bits
    linear-search Blocksize words around p
    return the hamming-nearest of these.

This is fast -- Binary search 1M words + nearest hammingdist in a block of size 100 takes < 10 us on my Mac ppc. (This is highly cache-dependent — your mileage will vary.)

How close does this come to finding the true nearest X[j] ? I can only experiment, can't do the math:
for 1M random queries in 1M random words, the nearest match is on average 4-5 bits away, vs. 3 away for the true nearest (linear scan all 1M):

near32  N 1048576  Nquery 1048576  Blocksize 100 
binary search, then nearest +- 50
7 usec
distance distribution: 0 4481 38137 185212  443211 337321 39979 235  0

near32  N 1048576  Nquery 100  Blocksize 1048576 
linear scan all 1048576
38701 usec
distance distribution: 0 0 7 58  35 0

Run your data with blocksizes say 50 and 100 to see how the match distances drop.


To get even nearer, at the cost of twice the memory,
make a copy Xswap of X with upper / lower halfwords swapped, and return the better of
near( q, X, Blocksize )
near( swap q, Xswap, Blocksize )

With lots of memory, one can use many more bit-shuffled copies of X, e.g. 32 rotations.
I have no idea how performance varies with Nshuffle and Blocksize — a question for LSH theorists.


(Added): To near-match bit strings of say 320 bits, 10 words, make 10 arrays of pointers, sorted on word 0, word 1 ... and search blocks with binsearch as above:
nearest( query word 0, Sortedarray0, 100 ) -> min Hammingdist e.g. 42 of 320
nearest( query word 1, Sortedarray1, 100 ) -> min Hammingdist 37
nearest( query word 2, Sortedarray2, 100 ) -> min Hammingdist 50
...
-> e.g. the 37.

This will of course miss near-matches where no single word is close, but it's very simple, and sort and binsearch are blazingly fast. The pointer arrays take exactly as much space as the data bits. 100 words, 3200 bits would work in exactly the same way.
But: this works only if there are roughly equal numbers of 0 bits and 1 bits, not 99 % 0 bits.

Plasterwork answered 15/4, 2012 at 16:6 Comment(6)
"With lots of memory, one can use many more bit-shuffled copies of X, e.g. 32 rotations." You don't need extra memory for this. Just write a modified Quicksort that looks at the bits in a different order.Golliner
Yup, you're right. The problem for 1M x 1M bits, density .001 (your question on stats.stackexchange, re-ask here ?) is I think how to find slices or images with density ~ .5 . Wouldn't many pairs of rows with 100 bits in common imply many irrelevant features, columns all 0 or a single 1 ?Plasterwork
I've added a useful example application to my question: the features may be location-time pairs that describe the movement of people. An all-zero column would be a place that is empty at a given time. We can certainly drop these.Golliner
I came across a relevant article that suggests almost the same algorithm: dl.acm.org/citation.cfm?id=1219917. (I've added it as an answer too.) It's more interesting for my case than for OP's because it includes a transformation from a high dimensional space to a low dimensional space.Golliner
Cross-link to Daniel's question: stats.stackexchange.com/questions/171947/…Plasterwork
I don't think binary search is a right way for that. the bit string is not in 1-dim space. 1000 and 0001 seem similar enough, but if you use binary search and use 0001 as input, you may find 0011 as the nearest item,which is wrong.Dedans
G
3

I just came across a paper that addresses this problem.

Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering (Ravichandran et al, 2005)

The basic idea is similar to Denis's answer (sort lexicographically by different permutations of the bits) but it includes a number of additional ideas and further references for articles on the topic.

It is actually implemented in https://github.com/soundcloud/cosine-lsh-join-spark which is where I found it.

Golliner answered 19/2, 2016 at 11:45 Comment(1)
Test data, anyone ? sparsity patterns vary hugely, so method A may be better here, method B there. Also, real data motivates experimenters.Plasterwork

© 2022 - 2024 — McMap. All rights reserved.