I have two large sparse matrices:
In [3]: trainX
Out[3]:
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
with 286674296 stored elements in Compressed Sparse Row format>
In [4]: testX
Out[4]:
<2013337x755258 sparse matrix of type '<type 'numpy.float64'>'
with 95423596 stored elements in Compressed Sparse Row format>
About 5 GB RAM in total to load. Note these matrices are HIGHLY sparse (0.0062% occupied).
For each row in testX
, I want to find the Nearest Neighbor in trainX
and return its corresponding label, found in trainY
. trainY
is a list with the same length as trainX
and has many many classes. (A class is made up of 1-5 separate labels, each label is one of 20,000, but the number of classes is not relevant to what I am trying to do right now.)
I am using sklearn
's KNN algorithm to do this:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(trainX, trainY)
clf.predict(testX[0])
Even predicting for 1 item of testX
takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible). My laptop with 16GB of RAM starts to swap a bit, but does manage to complete for 1 item in testX
.
My questions is, how can I do this so it will finish in reasonable time? Say one night on a large EC2 instance? Would just having more RAM and preventing the swapping speed it up enough (my guess is no). Maybe I can somehow make use of the sparsity to speed up the calculation?
Thank you.