Locality Sensitive Hashing of sparse numpy arrays
Asked Answered
N

1

13

I have a large sparse numpy/scipy matrix where each row corresponds to a point in high-dimensional space. I want make queries of the following kind:

Given a point P (a row in the matrix) and a distance epsilon, find all points with distance at most epsilon from P.

The distance metric I am using is Jaccard-similarity, so it should be possible to use Locality Sensitive Hashing tricks such as MinHash.

Is there an implementation of MinHash for sparse numpy arrays somewhere (I can't seem to find one) or is there an easy way to do this?

The reason I am not just pulling something built for non-sparse arrays off of Github is that the sparse data structures in scipy might cause explosions in time complexity.

Nucleoprotein answered 17/7, 2014 at 11:19 Comment(1)
So far I have made an implementation that uses github.com/go2starr/lshhdcNucleoprotein
S
7

If you have very large sparse datasets that are too large to be held in memory in a non-sparse format, I'd try out this LSH implementation that is built around the assumption of Scipy's CSR Sparse Matrices:

https://github.com/brandonrobertz/SparseLSH

It also hash support for disk-based key-value stores like LevelDB if you can't fit the tables in memory. From the docs:

from sparselsh import LSH
from scipy.sparse import csr_matrix

X = csr_matrix( [
    [ 3, 0, 0, 0, 0, 0, -1],
    [ 0, 1, 0, 0, 0, 0,  1],
    [ 1, 1, 1, 1, 1, 1,  1] ])

# One class number for each input point
y = [ 0, 3, 10]

X_sim = csr_matrix( [ [ 1, 1, 1, 1, 1, 1, 0]])

lsh = LSH( 4,
           X.shape[1],
           num_hashtables=1,
           storage_config={"dict":None})

for ix in xrange(X.shape[0]):
    x = X.getrow(ix)
    c = y[ix]
    lsh.index( x, extra_data=c)

# find points similar to X_sim
lsh.query(X_sim, num_results=1)

If you definitely only want to use MinHash, you could try out https://github.com/go2starr/lshhdc, but I haven't personally tested that one out for compatibility with sparse matrices.

Sard answered 26/7, 2014 at 22:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.