The code below causes my system to run out of memory before it completes.
Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the one below?
I would like to have the cosine similarity computed for each of the 65000 rows in my original matrix (mat
) relative to all of the others so that the result is a 65000 x 65000 matrix where each element is the cosine similarity between two rows in the original matrix.
import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity
mat = np.random.rand(65000, 10)
sparse_mat = sparse.csr_matrix(mat)
similarities = cosine_similarity(sparse_mat)
After running that last line I always run out of memory and the program either freezes or crashes with a MemoryError. This occurs whether I run on my 8 gb local RAM or on a 64 gb EC2 instance.
sparse
has its ownrandom
function, that can create a matrix with lots of zeros. – Gunshot