kNN with big sparse matrices in Python
Asked Answered
T

4

9

I have two large sparse matrices:

In [3]: trainX
Out[3]: 
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
        with 286674296 stored elements in Compressed Sparse Row format>

In [4]: testX
Out[4]: 
<2013337x755258 sparse matrix of type '<type 'numpy.float64'>'
        with 95423596 stored elements in Compressed Sparse Row format>

About 5 GB RAM in total to load. Note these matrices are HIGHLY sparse (0.0062% occupied).

For each row in testX, I want to find the Nearest Neighbor in trainX and return its corresponding label, found in trainY. trainY is a list with the same length as trainX and has many many classes. (A class is made up of 1-5 separate labels, each label is one of 20,000, but the number of classes is not relevant to what I am trying to do right now.)

I am using sklearn's KNN algorithm to do this:

from sklearn import neighbors

clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(trainX, trainY)
clf.predict(testX[0])

Even predicting for 1 item of testX takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible). My laptop with 16GB of RAM starts to swap a bit, but does manage to complete for 1 item in testX.

My questions is, how can I do this so it will finish in reasonable time? Say one night on a large EC2 instance? Would just having more RAM and preventing the swapping speed it up enough (my guess is no). Maybe I can somehow make use of the sparsity to speed up the calculation?

Thank you.

Tusche answered 2/12, 2013 at 16:31 Comment(1)
The doc indicates when fitting sparse matrix, the implementation uses "brute force", which calculates distance of all pairs.Hydrolysate
K
11

Classic kNN data structures such as the KD tree used in sklearn become very slow when the dimension of the data increases. For very high-dimensional problems it is advisable to switch algorithm class and use approximate nearest neighbour (ANN) methods, which sklearn seems to be lacking, unfortunately. See links below for papers on algorithms and theory why approximate nearest neighbors is so much faster in these cases.

  • A well-known ANN library in the C++ world, widely used in Computer Vision for nearest neighbors in feature descriptor spaces, is FLANN. The homepage says it contains Python bindings (I have never worked with then).

  • Another popular alternative is the ANN library with Python wrapper here, although the newer FLANN seems to be more popular at the moment.

  • See also this answer (but some links are dead).

One caveat: Your data seems to be very high dimensional - I don't known how these libraries perform for you. They should still beat sklearn.

Koziarz answered 2/12, 2013 at 19:45 Comment(4)
in OpenCV: FLANN based matching and source code in C++Stench
Thanks for that. My data is text data - I've performed LSA on it now to get it down to 100 dimensions. sklearn takes about 0.5 secs per test case now but that is still too slow for 2mil records. Am exploring FLANN now and will revert back on how it works.Tusche
100 dimensions should be perfect for FLANN. The descriptor spaces typically used in vision have 128 dim. Please upvote/accept if my answer solves your problem (many people seem to forget this lately).Koziarz
@Koziarz Haven't forgotten, was still trying it out. It works nicely now, thank you.Tusche
B
7

Even predicting for 1 item of testX takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible).

That's exactly why all scikit-learn estimators take batches of samples in their predict method. If you pass multiple samples in one call, the cost of input validation and Python's slow loops gets smaller, so the time per samples becomes a lot less than the cost of one sample times the number of samples.

>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.neighbors import KNeighborsClassifier
>>> data = fetch_20newsgroups_vectorized()
>>> X, y = data['data'], data['target']
>>> X = TruncatedSVD(n_components=100).fit_transform(X)
>>> clf = KNeighborsClassifier(n_neighbors=1).fit(X, y)
>>> %timeit clf.predict(X[0])
1000 loops, best of 3: 766 us per loop
>>> %timeit clf.predict(X[0:10])
100 loops, best of 3: 2.44 ms per loop
>>> %timeit clf.predict(X[0:100])
100 loops, best of 3: 14.2 ms per loop
>>> %timeit clf.predict(X[0:1000])
10 loops, best of 3: 117 ms per loop

Maybe I can somehow make use of the sparsity to speed up the calculation?

You can sample from the training set instead of using it all. k-NN performance depends on the size of the training set, which is why the vanilla k-NN algorithm isn't a very good choice for text classification.

(A favorite trick in the text processing field is to use an on-disk index to build a k-NN classifier, e.g. Lucene: use an entire document as a query, retrieve the top k documents, determine the label from those.)

Berretta answered 3/12, 2013 at 13:51 Comment(5)
I don't think batch prediction will help a lot in this case, according to the documentation for algorithm: fitting on sparse input will override the setting of this parameter, using brute force.Osbert
@Matt: it does because it performance input validation once per batch instead of once per sample, and it avoids Python's function call overhead -- it has nothing to do with sparse vs. dense. Also, the OP says they have done LSA on the data, and the output of LSA (TruncatedSVD) is always a dense array.Berretta
Sorry, missed that one. I only looked at trainX in the OP's question.Osbert
@larsmans Thanks for the Lucene tip, I've never heard of that before. If you have good examples/docs for using it with Python, I'd love to see it.Tusche
@mchangun: there's pylucene, and there's my Lucene+Jython tutorial.Berretta
M
2

As far as I know, neither FLANN nor ANN handle sparse data very well. I've just released a new C++ library for K-NN search for generic data type and generic similarity measure at www.kgraph.org. All you have to do is to plug in your function of computing a similarity between object i and object j and the library will do the rest of the magic. The downside is that you probably won't be able to gain much by using python. As the similarity computing code will be invoked extremely frequently, it doesn't make much sense to add a python API for user-provided metric.

Maice answered 15/3, 2014 at 20:2 Comment(0)
B
0

If you're looking for scalable ANN algorithms, another avenue are the locality-sesitive hashing (LSH) approaches, like ITQ (http://www.cs.unc.edu/~lazebnik/publications/cvpr11_small_code.pdf). Along with the paper is some MATLAB code, but I have translated it before into python. See: https://github.com/Kitware/SMQTK/blob/master/python/smqtk/algorithms/nn_index/lsh/functors/itq.py

Belfort answered 23/6, 2016 at 15:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.