Efficient nearest neighbour search for sparse matrices
Asked Answered
A

2

9

I have a large corpus of data (text) that I have converted to a sparse term-document matrix (I am using scipy.sparse.csr.csr_matrix to store sparse matrix). I want to find, for every document, top n nearest neighbour matches. I was hoping that NearestNeighbor routine in Python scikit-learn library (sklearn.neighbors.NearestNeighbor to be precise) would solve my problem, but efficient algorithms that use space partitioning data structures such as KD trees or Ball trees do not work with sparse matrices. Only brute-force algorithm works with sparse matrices (which is infeasible in my case as I am dealing with large corpus).

Is there any efficient implementation of nearest neighbour search for sparse matrices (in Python or in any other language)?

Thanks.

Ahrens answered 10/8, 2013 at 17:7 Comment(0)
G
5

Late answer: Have a look at Locality-Sensitive-Hashing

Support in scikit-learn has been proposed here and here.

Gale answered 14/10, 2014 at 9:11 Comment(1)
I can confirm LSHForest is working and sparse matrix input is supported too in year 2016, which is great. Downside is suprising slowness of this implementation and it might need a version 2, although the output of version 1 is at least correct it seems. For better speed I am trying BallTree and using SVD first to reduce dimensionality like Mathieu suggested to support the dense matrix requirement of BallTree. SVD might also help me find latent features to aid clustering with DBSCAN. Finally, cosine distance metric is LSHForest's only distance metric, which is good for many data but not all.Blumenthal
C
3

You can try to transform your high-dimensional sparse data to low-dimensional dense data using TruncatedSVD then do a ball-tree.

Confiture answered 13/8, 2013 at 5:45 Comment(1)
Are you sure a ball tree will behave nicely with SVD output? Typically for text data, you'd want SVD to keep some 100-200 dimensions...Miniskirt

© 2022 - 2024 — McMap. All rights reserved.