Scipy cosine similarity vs sklearn cosine similarity

setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]" setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]" import1 = "from sklearn.metrics.pairwise import cosine_similarity" stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]" import2 = "from scipy.spatial.distance import cosine" stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]" import timeit print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000)) print("scipy: ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))

As mentioned in the comments section, I don't think the comparison is fair mainly because the sklearn.metrics.pairwise.cosine_similarity is designed to compare pairwise distance/similarity of the samples in the given input 2-D arrays. On the other hand, scipy.spatial.distance.cosine is designed to compute cosine distance of two 1-D arrays.

Maybe a more fair comparison is to use scipy.spatial.distance.cdist vs. sklearn.metrics.pairwise.cosine_similarity, where both computes pairwise distance of samples in the given arrays. However, to my surprise, that shows the sklearn implementation is much faster than the scipy implementation (which I don't have an explanation for that currently!). Here is the experiment:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cdist

x = np.random.rand(1000,1000)
y = np.random.rand(1000,1000)

def sklearn_cosine():
    return cosine_similarity(x, y)

def scipy_cosine():
    return 1. - cdist(x, y, 'cosine')

# Make sure their result is the same.
assert np.allclose(sklearn_cosine(), scipy_cosine())

And here is the timing result:

%timeit sklearn_cosine()
10 loops, best of 3: 74 ms per loop

%timeit scipy_cosine()
1 loop, best of 3: 752 ms per loop

Recommended topics

Hot tags