Cosine Similarity between 2 Number Lists
Asked Answered
Y

18

250

I want to calculate the cosine similarity between two lists, let's say for example list 1 which is dataSetI and list 2 which is dataSetII.

Let's say dataSetI is [3, 45, 7, 2] and dataSetII is [2, 54, 13, 15]. The length of the lists are always equal. I want to report cosine similarity as a number between 0 and 1.

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]

def cosine_similarity(list1, list2):
  # How to?
  pass

print(cosine_similarity(dataSetI, dataSetII))
Yama answered 24/8, 2013 at 23:37 Comment(0)
G
323

Another version based on numpy only

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))
Gnash answered 27/3, 2017 at 9:50 Comment(4)
FYI this solution is significantly faster on my system than using scipy.spatial.distance.cosine.Cogan
Even shorter: cos_sim = (a @ b.T) / (norm(a)*norm(b))Garrick
On my system, SciPy is roughly the same speed as this, for tiny/dummy use-cases. SciPy is optimised for efficiently comparing large lists of vectors. For large lists of vectors, SciPy is orders of magnitude fasterCuthbert
Worth noting that this is 1 - scipy.spatial.distance.cosine(a, b)Meteoritics
C
273

You should try SciPy. It has a bunch of useful scientific routines for example, "routines for computing integrals numerically, solving differential equations, optimization, and sparse matrices." It uses the superfast optimized NumPy for its number crunching. See here for installing.

Note that spatial.distance.cosine computes the distance, and not the similarity. So, you must subtract the value from 1 to get the similarity.

from scipy import spatial

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
result = 1 - spatial.distance.cosine(dataSetI, dataSetII)
Cowhide answered 25/8, 2013 at 1:56 Comment(0)
O
119

You can use cosine_similarity function form sklearn.metrics.pairwise docs

In [23]: from sklearn.metrics.pairwise import cosine_similarity

In [24]: cosine_similarity([[1, 0, -1]], [[-1,-1, 0]])
Out[24]: array([[-0.5]])
Orin answered 20/11, 2014 at 17:40 Comment(0)
G
54

I don't suppose performance matters much here, but I can't resist. The zip() function completely recopies both vectors (more of a matrix transpose, actually) just to get the data in "Pythonic" order. It would be interesting to time the nuts-and-bolts implementation:

import math
def cosine_similarity(v1,v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/{||v1||*||v2||)"
    sumxx, sumxy, sumyy = 0, 0, 0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumyy)

v1,v2 = [3, 45, 7, 2], [2, 54, 13, 15]
print(v1, v2, cosine_similarity(v1,v2))

Output: [3, 45, 7, 2] [2, 54, 13, 15] 0.972284251712

That goes through the C-like noise of extracting elements one-at-a-time, but does no bulk array copying and gets everything important done in a single for loop, and uses a single square root.

ETA: Updated print call to be a function. (The original was Python 2.7, not 3.3. The current runs under Python 2.7 with a from __future__ import print_function statement.) The output is the same, either way.

CPYthon 2.7.3 on 3.0GHz Core 2 Duo:

>>> timeit.timeit("cosine_similarity(v1,v2)",setup="from __main__ import cosine_similarity, v1, v2")
2.4261788514654654
>>> timeit.timeit("cosine_measure(v1,v2)",setup="from __main__ import cosine_measure, v1, v2")
8.794677709375264

So, the unpythonic way is about 3.6 times faster in this case.

Guyette answered 25/8, 2013 at 2:3 Comment(0)
N
29

without using any imports

math.sqrt(x)

can be replaced with

x** .5

without using numpy.dot() you have to create your own dot function using list comprehension:

def dot(A,B): 
    return (sum(a*b for a,b in zip(A,B)))

and then its just a simple matter of applying the cosine similarity formula:

def cosine_similarity(a,b):
    return dot(a,b) / ( (dot(a,a) **.5) * (dot(b,b) ** .5) )
Nickienicklaus answered 5/3, 2018 at 16:30 Comment(0)
P
17

I did a benchmark based on several answers in the question and the following snippet is believed to be the best choice:

def dot_product2(v1, v2):
    return sum(map(operator.mul, v1, v2))


def vector_cos5(v1, v2):
    prod = dot_product2(v1, v2)
    len1 = math.sqrt(dot_product2(v1, v1))
    len2 = math.sqrt(dot_product2(v2, v2))
    return prod / (len1 * len2)

The result makes me surprised that the implementation based on scipy is not the fastest one. I profiled and find that cosine in scipy takes a lot of time to cast a vector from python list to numpy array.

enter image description here

Pander answered 17/11, 2015 at 10:30 Comment(2)
Please advise on the tool you are using to get visualized representation of time consumed per line of code.Havre
@Havre It has been too long. Probably github.com/uber-archive/pyflamePander
F
17

Python code to calculate:

  • Cosine Distance
  • Cosine Similarity
  • Angular Distance
  • Angular Similarity

import math

from scipy import spatial


def calculate_cosine_distance(a, b):
    cosine_distance = float(spatial.distance.cosine(a, b))
    return cosine_distance


def calculate_cosine_similarity(a, b):
    cosine_similarity = 1 - calculate_cosine_distance(a, b)
    return cosine_similarity


def calculate_angular_distance(a, b):
    cosine_similarity = calculate_cosine_similarity(a, b)
    angular_distance = math.acos(cosine_similarity) / math.pi
    return angular_distance


def calculate_angular_similarity(a, b):
    angular_similarity = 1 - calculate_angular_distance(a, b)
    return angular_similarity

Similarity Search:

If you want to find closest cosine similarity in array of embeddings, you can use Tensorflow, like the following code.

In my testing, closeset value to an embedding with the shape of 1x512 found in 1M embeddings (1'000'000 x 512) in less than a second (using GPU).

import time

import numpy as np  # np.__version__ == '1.23.5'
import tensorflow as tf  # tf.__version__ == '2.11.0'

EMBEDDINGS_LENGTH = 512
NUMBER_OF_EMBEDDINGS = 1000 * 1000


def calculate_cosine_similarities(x, embeddings):
    cosine_similarities = -1 * tf.keras.losses.cosine_similarity(x, embeddings)
    return cosine_similarities.numpy()


def find_closest_embeddings(x, embeddings, top_k=1):
    cosine_similarities = calculate_cosine_similarities(x, embeddings)
    values, indices = tf.math.top_k(cosine_similarities, k=top_k)
    return values.numpy(), indices.numpy()


def main():
    # x shape: (512)
    # Embeddings shape: (1000000, 512)
    x = np.random.rand(EMBEDDINGS_LENGTH).astype(np.float32)
    embeddings = np.random.rand(NUMBER_OF_EMBEDDINGS, EMBEDDINGS_LENGTH).astype(np.float32)

    print('Embeddings shape: ', embeddings.shape)

    n = 100
    sum_duration = 0
    for i in range(n):
        start = time.time()
        best_values, best_indices = find_closest_embeddings(x, embeddings, top_k=1)
        end = time.time()

        duration = end - start
        sum_duration += duration

        print('Duration (seconds): {}, Best value: {}, Best index: {}'.format(duration, best_values[0], best_indices[0]))

    # Average duration (seconds): 1.707 for Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz
    # Average duration (seconds): 0.961 for NVIDIA 1080 ti
    print('Average duration (seconds): ', sum_duration / n)


if __name__ == '__main__':
    main()

For more advanced similarity search, you can use Milvus, Weaviate or Faiss.


Flimsy answered 16/5, 2021 at 18:13 Comment(0)
S
11
import math
from itertools import izip

def dot_product(v1, v2):
    return sum(map(lambda x: x[0] * x[1], izip(v1, v2)))

def cosine_measure(v1, v2):
    prod = dot_product(v1, v2)
    len1 = math.sqrt(dot_product(v1, v1))
    len2 = math.sqrt(dot_product(v2, v2))
    return prod / (len1 * len2)

You can round it after computing:

cosine = format(round(cosine_measure(v1, v2), 3))

If you want it really short, you can use this one-liner:

from math import sqrt
from itertools import izip

def cosine_measure(v1, v2):
    return (lambda (x, y, z): x / sqrt(y * z))(reduce(lambda x, y: (x[0] + y[0] * y[1], x[1] + y[0]**2, x[2] + y[1]**2), izip(v1, v2), (0, 0, 0)))
Stound answered 24/8, 2013 at 23:46 Comment(0)
V
8

You can use this simple function to calculate the cosine similarity:

def cosine_similarity(a, b):
  return sum([i*j for i,j in zip(a, b)])/(math.sqrt(sum([i*i for i in a]))* math.sqrt(sum([i*i for i in b])))
Vallejo answered 18/4, 2016 at 11:59 Comment(0)
B
4

You can do this in Python using simple function:

def get_cosine(text1, text2):
  vec1 = text1
  vec2 = text2
  intersection = set(vec1.keys()) & set(vec2.keys())
  numerator = sum([vec1[x] * vec2[x] for x in intersection])
  sum1 = sum([vec1[x]**2 for x in vec1.keys()])
  sum2 = sum([vec2[x]**2 for x in vec2.keys()])
  denominator = math.sqrt(sum1) * math.sqrt(sum2)
  if not denominator:
     return 0.0
  else:
     return round(float(numerator) / denominator, 3)
dataSet1 = [3, 45, 7, 2]
dataSet2 = [2, 54, 13, 15]
get_cosine(dataSet1, dataSet2)
Brianbriana answered 14/10, 2015 at 15:37 Comment(3)
This is a text implementation of cosine. It will give the wrong output for numerical input.Tole
Can you explain why you used set in the line "intersection = set(vec1.keys()) & set(vec2.keys())".Flanker
Also your function seems to be expecting maps but you are sending it lists of integers.Flanker
W
3

Using numpy compare one list of numbers to multiple lists(matrix):

def cosine_similarity(vector,matrix):
   return ( np.sum(vector*matrix,axis=1) / ( np.sqrt(np.sum(matrix**2,axis=1)) * np.sqrt(np.sum(vector**2)) ) )[::-1]
Weig answered 28/2, 2017 at 22:14 Comment(0)
D
3

Another version, if you have a scenario where you have list of vectors and a query vector and you want to compute the cosine similarity of query vector with all the vectors in the list, you can do it in one go in the below fashion:

>>> import numpy as np

>>> A      # list of vectors, shape -> m x n
array([[ 3, 45,  7,  2],
       [ 1, 23,  3,  4]])

>>> B      # query vector, shape -> 1 x n
array([ 2, 54, 13, 15])

>>> similarity_scores = A.dot(B)/ (np.linalg.norm(A, axis=1) * np.linalg.norm(B))

>>> similarity_scores
array([0.97228425, 0.99026919])
Desdemona answered 22/9, 2020 at 18:26 Comment(0)
H
2

If you happen to be using PyTorch already, you should go with their CosineSimilarity implementation.

Suppose you have two n-dimensional numpy.ndarrays, v1 and v2, i.e. their shapes are both (n,). Here's how you get their cosine similarity:

import torch
import torch.nn as nn

cos = nn.CosineSimilarity()
cos(torch.tensor([v1]), torch.tensor([v2])).item()

Or suppose you have two numpy.ndarrays w1 and w2, whose shapes are both (m, n). The following gets you a list of cosine similarities, each being the cosine similarity between a row in w1 and the corresponding row in w2:

cos(torch.tensor(w1), torch.tensor(w2)).tolist()
Hear answered 13/9, 2019 at 22:33 Comment(1)
I suggest using the functional implementation of the cosine similarity directly (torch.nn.functional.cosine_similarity), instead of instantiating the module implementation and applying the instance of your tensor.Simpson
W
2

You can use SciPy (easiest way):

from scipy import spatial

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
print(1 - spatial.distance.cosine(dataSetI, dataSetII))

Note that spatial.distance.cosine() gives you a dissimilarity (distance) value, and thus to get the similarity, you need to subtract that value from 1.

Another way to get to the solution is to write the function yourself that even contemplates the possibility of lists with different lengths:

def cosineSimilarity(v1, v2):
  scalarProduct = moduloV1 = moduloV2 = 0

  if len(v1) > len(v2):
    v2.extend(0 for _ in range(len(v1) - len(v2)))
  else:
    v2.extend(0 for _ in range(len(v2) - len(v1)))

  for i in range(len(v1)):
    scalarProduct += v1[i] * v2[i]
    moduloV1 += v1[i] * v1[i]
    moduloV2 += v2[i] * v2[i]

  return round(scalarProduct/(math.sqrt(moduloV1) * math.sqrt(moduloV2)), 3)

dataSetI = [3, 45, 7, 2]
dataSetII = [2, 54, 13, 15]
print(cosineSimilarity(dataSetI, dataSetII))
Wean answered 3/8, 2022 at 11:7 Comment(0)
K
1

We can easily calculate cosine similarity with simple mathematics equations. Cosine_similarity = 1- (dotproduct of vectors/(product of norm of the vectors)). We can define two functions each for calculations of dot product and norm.

def dprod(a,b):
    sum=0
    for i in range(len(a)):
        sum+=a[i]*b[i]
    return sum

def norm(a):

    norm=0
    for i in range(len(a)):
    norm+=a[i]**2
    return norm**0.5

    cosine_a_b = 1-(dprod(a,b)/(norm(a)*norm(b)))
Kirovabad answered 23/3, 2021 at 16:22 Comment(0)
V
1

Should be the most efficient way to find cosine similarities between two matrixes of numbers - no for/loops, all in numpy.

def cos_sim(a: np.ndarray, b: np.ndarray):
    
    out_dim = 2        
    if len(b.shape) == 1:
        b = b.reshape([1,-1])
        out_dim -=1
    if len(a.shape) == 1:
        a = a.reshape([1,-1])
        out_dim -=1
        
    norm1 = norm(a.astype(float), axis=1, keepdims=True)
    norm2 = norm(b.astype(float), axis=1, keepdims=True)
    similarity = 1 - a.dot(b.T) / norm1.dot(norm2.T)
    
    ## by default outputs 2 x 2 matrix
    if out_dim == 0:
        return similarity[0,0]
    elif out_dim == 1:
        return similarity[:,0]

    return similarity
Virgule answered 24/7, 2023 at 19:37 Comment(0)
S
0

Here is an implementation that would work for matrices as well. Its behaviour is exactly like sklearn cosine similarity:

def cosine_similarity(a, b):    
    return np.divide(
        np.dot(a, b.T),
        np.linalg.norm(
            a,
            axis=1,
            keepdims=True
        ) 
        @ # matrix multiplication
        np.linalg.norm(
            b,
            axis=1,
            keepdims=True
        ).T
    )

The @ symbol stands for matrix multiplication. See What does the "at" (@) symbol do in Python?

Spencerianism answered 27/5, 2022 at 5:55 Comment(0)
T
-3

All the answers are great for situations where you cannot use NumPy. If you can, here is another approach:

def cosine(x, y):
    dot_products = np.dot(x, y.T)
    norm_products = np.linalg.norm(x) * np.linalg.norm(y)
    return dot_products / (norm_products + EPSILON)

Also bear in mind about EPSILON = 1e-07 to secure the division.

Tittle answered 4/3, 2020 at 13:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.