Scipy cosine similarity vs sklearn cosine similarity
Asked Answered
R

1

8

I noticed that both scipy and sklearn have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors:

setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"

import1 = "from sklearn.metrics.pairwise import cosine_similarity"
stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]"

import2 = "from scipy.spatial.distance import cosine"
stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]"

import timeit
print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000))
print("scipy:   ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))
sklearn:  11.072769448000145
scipy:    1.9755544730005568

sklearn runs almost 10 times slower than scipy (even if you remove the array reshape for the sklearn example and generate data that's already in the right shape). Why is one significantly slower than the other?

Roxannaroxanne answered 28/4, 2020 at 21:34 Comment(2)
I am not familiar with inner workings of sklearn or scipy; however, beside the fact that you are reshaping the arrays in one experiment and not in the other, I don't think it's a fair comparison because the cosine_similarity computes pairwise cosine distance of all the samples in the two input arrays (although you are invoking it on arrays of one sample), but the cosine function in scipy works only on 1D-arrays and therefore might have a much more efficient implementation.Furtive
@Furtive Even if you get rid of the array reshaping (create the arrays using np.random.rand(1, 400) instead of np.random.rand(400) to prevent the reshape), sklearn is still slower. I suspect the fact that sklearn is designed for 2d-arrays might have something to do with it, but still, the performance difference is quite a lot.Roxannaroxanne
F
18

As mentioned in the comments section, I don't think the comparison is fair mainly because the sklearn.metrics.pairwise.cosine_similarity is designed to compare pairwise distance/similarity of the samples in the given input 2-D arrays. On the other hand, scipy.spatial.distance.cosine is designed to compute cosine distance of two 1-D arrays.

Maybe a more fair comparison is to use scipy.spatial.distance.cdist vs. sklearn.metrics.pairwise.cosine_similarity, where both computes pairwise distance of samples in the given arrays. However, to my surprise, that shows the sklearn implementation is much faster than the scipy implementation (which I don't have an explanation for that currently!). Here is the experiment:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cdist

x = np.random.rand(1000,1000)
y = np.random.rand(1000,1000)

def sklearn_cosine():
    return cosine_similarity(x, y)

def scipy_cosine():
    return 1. - cdist(x, y, 'cosine')

# Make sure their result is the same.
assert np.allclose(sklearn_cosine(), scipy_cosine())

And here is the timing result:

%timeit sklearn_cosine()
10 loops, best of 3: 74 ms per loop

%timeit scipy_cosine()
1 loop, best of 3: 752 ms per loop
Furtive answered 29/4, 2020 at 1:23 Comment(1)
I'm doing some work with cosine similarity at the moment. Scipy appears to run the job in a couple of Python loops, whereas Sklearn appears to use vectorized functions on the entire matrix. If you're doing a really small job, it will actually be quicker to use Scipy, but if both X and Y are large, you'll want SklearnGesner

© 2022 - 2024 — McMap. All rights reserved.