I have a large corpus of data (text) that I have converted to a sparse term-document matrix (I am using scipy.sparse.csr.csr_matrix
to store sparse matrix). I want to find, for every document, top n nearest neighbour matches. I was hoping that NearestNeighbor
routine in Python scikit-learn
library (sklearn.neighbors.NearestNeighbor
to be precise) would solve my problem, but efficient algorithms that use space partitioning data structures such as KD trees
or Ball trees
do not work with sparse matrices. Only brute-force algorithm works with sparse matrices (which is infeasible in my case as I am dealing with large corpus).
Is there any efficient implementation of nearest neighbour search for sparse matrices (in Python or in any other language)?
Thanks.