Efficient nearest neighbour search for sparse matrices

Asked 10/8, 2013 at 17:7 Answered 14/10, 2014 at 9:11

python scipy scikit-learn nearest-neighbor

I have a large corpus of data (text) that I have converted to a sparse term-document matrix (I am using scipy.sparse.csr.csr_matrix to store sparse matrix). I want to find, for every document, top n nearest neighbour matches. I was hoping that NearestNeighbor routine in Python scikit-learn library (sklearn.neighbors.NearestNeighbor to be precise) would solve my problem, but efficient algorithms that use space partitioning data structures such as KD trees or Ball trees do not work with sparse matrices. Only brute-force algorithm works with sparse matrices (which is infeasible in my case as I am dealing with large corpus).

Is there any efficient implementation of nearest neighbour search for sparse matrices (in Python or in any other language)?

Thanks.

Ahrens answered 10/8, 2013 at 17:7 Comment(0)

Late answer: Have a look at Locality-Sensitive-Hashing

Support in scikit-learn has been proposed here and here.

Gale answered 14/10, 2014 at 9:11 Comment(1)

I can confirm LSHForest is working and sparse matrix input is supported too in year 2016, which is great. Downside is suprising slowness of this implementation and it might need a version 2, although the output of version 1 is at least correct it seems. For better speed I am trying BallTree and using SVD first to reduce dimensionality like Mathieu suggested to support the dense matrix requirement of BallTree. SVD might also help me find latent features to aid clustering with DBSCAN. Finally, cosine distance metric is LSHForest's only distance metric, which is good for many data but not all. – Blumenthal 7/12, 2016 at 14:39

You can try to transform your high-dimensional sparse data to low-dimensional dense data using TruncatedSVD then do a ball-tree.

Confiture answered 13/8, 2013 at 5:45 Comment(1)

Are you sure a ball tree will behave nicely with SVD output? Typically for text data, you'd want SVD to keep some 100-200 dimensions... – Miniskirt 14/8, 2013 at 18:1

Recommended topics

Hot tags