cosine similarity on large sparse matrix with numpy

Asked 1/12, 2016 at 0:22 Answered 2/8, 2017 at 18:47

python numpy memory matrix cosine-similarity

The code below causes my system to run out of memory before it completes.

Can you suggest a more efficient means of computing the cosine similarity on a large matrix, such as the one below?

I would like to have the cosine similarity computed for each of the 65000 rows in my original matrix (mat) relative to all of the others so that the result is a 65000 x 65000 matrix where each element is the cosine similarity between two rows in the original matrix.

import numpy as np
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity

mat = np.random.rand(65000, 10)

sparse_mat = sparse.csr_matrix(mat)

similarities = cosine_similarity(sparse_mat)

After running that last line I always run out of memory and the program either freezes or crashes with a MemoryError. This occurs whether I run on my 8 gb local RAM or on a 64 gb EC2 instance.

Unrighteous answered 1/12, 2016 at 0:22 Comment(1)

sparse has its own random function, that can create a matrix with lots of zeros. – Gunshot 1/12, 2016 at 3:47

Same problem here. I've got a big, non-sparse matrix. It fits in memory just fine, but cosine_similarity crashes for whatever unknown reason, probably because they copy the matrix one time too many somewhere. So I made it compare small batches of rows "on the left" instead of the entire matrix:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def cosine_similarity_n_space(m1, m2, batch_size=100):
    assert m1.shape[1] == m2.shape[1]
    ret = np.ndarray((m1.shape[0], m2.shape[0]))
    for row_i in range(0, int(m1.shape[0] / batch_size) + 1):
        start = row_i * batch_size
        end = min([(row_i + 1) * batch_size, m1.shape[0]])
        if end <= start:
            break # cause I'm too lazy to elegantly handle edge cases
        rows = m1[start: end]
        sim = cosine_similarity(rows, m2) # rows is O(1) size
        ret[start: end] = sim
    return ret

No crashes for me; YMMV. Try different batch sizes to make it faster. I used to only compare 1 row at a time, and it took about 30X as long on my machine.

Stupid yet effective sanity check:

import random
while True:
    m = np.random.rand(random.randint(1, 100), random.randint(1, 100))
    n = np.random.rand(random.randint(1, 100), m.shape[1])
    assert np.allclose(cosine_similarity(m, n), cosine_similarity_n_space(m, n))

Monoplane answered 20/7, 2017 at 0:3 Comment(1)

This is a solution that worked very well for me. Thanks. – Pd 10/11, 2020 at 18:5

You're running out of memory because you're trying to store a 65000x65000 matrix. Note that the matrix you're constructing is not sparse at all. np.random.rand generates a random number between 0 and 1. So there aren't enough zeros for csr_matrix to actually compress your data. In fact, there are almost surely no zeros at all.

If you look closely at your MemoryError traceback, you can see that cosine_similarity tries to use the sparse dot product if possible:

MemoryError                  Traceback (most recent call last)
    887         Y_normalized = normalize(Y, copy=True)
    888 
--> 889     K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
    890 
    891     return K

So the problem isn't with cosine_similarity, it's with your matrix. Try initializing an actual sparse matrix (with 1% sparsity, for example) like this:

>>> a = np.zeros((65000, 10))
>>> i = np.random.rand(a.size)
>>> a.flat[i < 0.01] = 1        # Select 1% of indices and set to 1
>>> a = sparse.csr_matrix(a)

Then, on a machine with 32GB RAM (8GB RAM was not enough for me), the following runs with no memory error:

>>> b = cosine_similarity(a)
>>> b
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

Miocene answered 1/12, 2016 at 3:25 Comment(2)

my mistake - i didn't intend for it to be a sparse matrix. i intended to take my original matrix (mat) and use that to compute a 65000 x 65000 matrix where each element represents the cosine similarity between two rows. i've updated my question to reflect this change. do you know how I can compute the cosine similarities on all of the rows with respect to each other on the original matrix? – Unrighteous 1/12, 2016 at 5:13

@Unrighteous Well, a 65000x65000 matrix has a size of about 31.5GiB, so on a 64GB machine, you should be fine. But you wouldn't be able to store two such matrices. So if python tries to copy the matrix during its construction, you'll be in trouble. Those instances where the computer seems to freeze are probably cases where it's swapping (or paging). Maybe you should just let it run for a while, say overnight... as far as computing the matrix is concerned. But to actually do anything with the matrix, you'll have a hard time. – Miocene 1/12, 2016 at 6:8

I would run it in chunks like this

from sklearn.metrics.pairwise import cosine_similarity

# Change chunk_size to control resource consumption and speed
# Higher chunk_size means more memory/RAM needed but also faster 
chunk_size = 500 
matrix_len = your_matrix.shape[0] # Not sparse numpy.ndarray

def similarity_cosine_by_chunk(start, end):
    if end > matrix_len:
        end = matrix_len
    return cosine_similarity(X=your_matrix[start:end], Y=your_matrix) # scikit-learn function

for chunk_start in xrange(0, matrix_len, chunk_size):
    cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size)
    # Handle cosine_similarity_chunk  ( Write it to file_timestamp and close the file )
    # Do not open the same file again or you may end up with out of memory after few chunks

Ahlers answered 2/8, 2017 at 18:47 Comment(3)

For Python3 replace xrange by range. – Stooge 30/11, 2017 at 20:19

I am confused with this line cosine_similarity_chunk = similarity_cosine_by_chunk(chunk_start, chunk_start+chunk_size) why we are overwriting cosine_similarity_chunk variable again and again ? – Currish 28/12, 2018 at 9:7

@ShivamAgrawal . That line is calculating cosine similarity in chunks. With every pass you would have cosine similarity for that chunk only. – Ahlers 30/10, 2019 at 0:50

Recommended topics

Hot tags