I noticed that both scipy
and sklearn
have a cosine similarity/cosine distance functions. I wanted to test the speed for each on pairs of vectors:
setup1 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
setup2 = "import numpy as np; arrs1 = [np.random.rand(400) for _ in range(60)];arrs2 = [np.random.rand(400) for _ in range(60)]"
import1 = "from sklearn.metrics.pairwise import cosine_similarity"
stmt1 = "[float(cosine_similarity(arr1.reshape(1,-1), arr2.reshape(1,-1))) for arr1, arr2 in zip(arrs1, arrs2)]"
import2 = "from scipy.spatial.distance import cosine"
stmt2 = "[float(1 - cosine(arr1, arr2)) for arr1, arr2 in zip(arrs1, arrs2)]"
import timeit
print("sklearn: ", timeit.timeit(stmt1, setup=import1 + ";" + setup1, number=1000))
print("scipy: ", timeit.timeit(stmt2, setup=import2 + ";" + setup2, number=1000))
sklearn: 11.072769448000145
scipy: 1.9755544730005568
sklearn
runs almost 10 times slower than scipy
(even if you remove the array reshape for the sklearn example and generate data that's already in the right shape). Why is one significantly slower than the other?
sklearn
orscipy
; however, beside the fact that you are reshaping the arrays in one experiment and not in the other, I don't think it's a fair comparison because thecosine_similarity
computes pairwise cosine distance of all the samples in the two input arrays (although you are invoking it on arrays of one sample), but thecosine
function inscipy
works only on 1D-arrays and therefore might have a much more efficient implementation. – Furtivenp.random.rand(1, 400)
instead ofnp.random.rand(400)
to prevent the reshape), sklearn is still slower. I suspect the fact that sklearn is designed for 2d-arrays might have something to do with it, but still, the performance difference is quite a lot. – Roxannaroxanne