cosine similarity built-in function in matlab

B

2

9

I want to calculate cosine similarity between different rows of a matrix in matlab. I wrote the following code in matlab:

for i = 1:n_row
    for j = i:n_row
        S2(i,j) = dot(S1(i,:), S1(j,:)) / (norm_r(i) * norm_r(j));
        S2(j,i) = S2(i,j);

matrix S1 is 11000*11000 and the code execution is very time consuming. So, I want to know Is there any function in matlab to calculate the cosine similarity between matrix rows faster than the above code?

Brewmaster answered 4/1, 2018 at 18:36 Comment(4)

I think you are looking for S2 = 1 - pdist(S1, 'cosine') , ch.mathworks.com/help/stats/pdist.html – Saponify 4/1, 2018 at 18:57

S2 = 1 - pdist(S1, 'cosine') returns one number while i need a n-by-n matrix that each element of it holds S2(i,j) =dot(S1(i,:), S1(j,:)) / (norm_r(i) * norm_r(j)) – Brewmaster 4/1, 2018 at 19:27

What is norm_r? – Horologium 4/1, 2018 at 19:33

@Cris Luengo norm_r = sqrt(sum(abs(S1).^2,2)); – Brewmaster 4/1, 2018 at 19:34

H

6

Your code loops over all rows, and for each row loops over (about) half the rows, computing the dot product for each unique combination of rows:

n_row = size(S1,1);
norm_r = sqrt(sum(abs(S1).^2,2)); % same as norm(S1,2,'rows')
S2 = zeros(n_row,n_row);
for i = 1:n_row
  for j = i:n_row
    S2(i,j) = dot(S1(i,:), S1(j,:)) / (norm_r(i) * norm_r(j));
    S2(j,i) = S2(i,j);
  end
end

(I've taken the liberty to complete your code so it actually runs. Note the initialization of S2 before the loop, this saves a lot of time!)

If you note that the dot product is a matrix product of a row vector with a column vector, you can see that the above, without the normalization step, is identical to

S2 = S1 * S1.';

This runs much faster than the explicit loop, even if it is (maybe?) not able to use the symmetry. The normalization is simply dividing each row by norm_r and each column by norm_r. Here I multiply the two vectors to produce a square matrix to normalize with:

S2 = (S1 * S1.') ./ (norm_r * norm_r.');

Horologium answered 4/1, 2018 at 19:56 Comment(0)

S

8

Short version by calculating the similarity with pdist:

S2 = squareform(1-pdist(S1,'cosine')) + eye(size(S1,1));

Explanation:

pdist(S1,'cosine') calculates the cosine distance between all combinations of rows in S1. Therefore the similarity between all combinations is 1 - pdist(S1,'cosine') .

We can turn that into a square matrix where element (i,j) corresponds to the similarity between rows i and j with squareform(1-pdist(S1,'cosine')).

Finally we have to set the main diagonal to 1 because the similaritiy of a row with itself is obviously 1 but that is not explicitly calculated by pdist.

Saponify answered 4/1, 2018 at 20:3 Comment(1)

There's a pdist2 function that returns exactly this square matrix. – Christogram 26/7, 2019 at 16:31

H

6