Find cosine similarity between two arrays
Asked Answered
J

7

28

I'm wondering if there is a built in function in R that can find the cosine similarity (or cosine distance) between two arrays?

Currently, I implemented my own function, but I can't help but think that R should already come with one.

Jacqualinejacquard answered 29/3, 2010 at 0:44 Comment(3)
Does R really need a new function just for x %*% y / sqrt(x%*%x * y%*%y)?Recover
This post shows how to create a cooccurrence matrix and then calculate the cosine similarity - https://mcmap.net/q/386511/-creating-co-occurrence-matrixMariselamarish
Also check out #8159367Scary
P
70

These sort of questions come up all the time (for me--and as evidenced by the r-tagged SO question list--others as well):

is there a function, either in R core or in any R Package, that does x? and if so,

where can i find it among the +2000 R Packages in CRAN?

short answer: give the sos package a try when these sort of questions come up

One of the earlier answers gave cosine along with a link to its help page. This is probably exactly what the OP wants. When you look at the linked-to page you see that this function is in the lsa package.

But how would you find this function if you didn't already know which Package to look for it in?

you can always try the standard R help functions (">" below just means the R command line):

> ?<some_name>

> ??<some_name>

> *apropos*<some_name>

if these fail, then install & load the sos package, then

***findFn***

findFn is also aliased to "???", though i don't often use that because i don't think you can pass in arguments other than the function name

for the question here, try this:

> library(sos)

> findFn("cosine", maxPages=2, sortby="MaxScore")

The additional arguments passed in ("maxPages=2" and "sortby="MaxScore") just limits the number of results returned, and specifies how the results are ranked, respectively--ie, "find a function named 'cosine' or that has the term 'cosine' in the function description, only return two pages of results, and order them by descending relevance score"

The findFn call above returns a data frame with nine columns and the results as rows--rendered as HTML.

Scanning the last column, Description and Link, item (row) 21 you find:

Cosine Measures (Matrices)

this text is also a link; clicking on it takes you to the help page for that function in the Package which contains that function--in other words

using findFn, you can pretty quickly find the function you want even though you have no idea which Package it's in

Prehuman answered 29/3, 2010 at 18:5 Comment(0)
S
22

It looks like a few options are already available, but I just stumbled across an idiomatic solution I like so I thought I'd add it to the list.

install.packages('proxy') # Let's be honest, you've never heard of this before.
library('proxy') # Library of similarity/dissimilarity measures for 'dist()'
dist(m, method="cosine")
Stillhunt answered 9/1, 2014 at 5:33 Comment(1)
Yes I do not know proxy package before, but I do not think this is a necessary package...Milliken
C
16

Taking the comment from Jonathan Chang I wrote this function to mimic dist. No extra packages to load.

cosineDist <- function(x){
  as.dist(1 - x%*%t(x)/(sqrt(rowSums(x^2) %*% t(rowSums(x^2))))) 
}
Cartesian answered 23/10, 2013 at 19:32 Comment(2)
why did you do 1- x*t(x)/(...) ? is that value of dissimilarity rather than similarity?Moorer
@Moorer the cosine formula gives a similarity. It is 1 if the vectors are pointing in the same direction. Distance measures need the value to be 0 when vectors are the same so 1 - similarity = distance. Many uses need distance rather than similarity (hclust for instance). Adding the as.dist formats the matrix as a nice R distance (basically a triangular matrix). Hope that helps.Cartesian
K
9

Check these functions lsa::cosine(), clv::dot_product() and arules::dissimilarity()

Killebrew answered 29/3, 2010 at 6:36 Comment(0)
B
5

You can also check the vegan package: http://cran.r-project.org/web/packages/vegan//index.html

The function vegdist in this package has a variety of dissimilarity (distance) functions, such as manhattan, euclidean, canberra, bray, kulczynski, jaccard, gower, altGower, morisita, horn,mountford, raup , binomial, chao or cao. Please check the .pdf in the package for a definition or consult references https://stats.stackexchange.com/a/33001/12733.

Blabber answered 25/7, 2012 at 15:51 Comment(0)
E
0

If you have a dot product matrix, you can use this function to compute the cosine similarity matrix:

get_cos = function(S){
  doc_norm = apply(as.matrix(dt),1,function(x) norm(as.matrix(x),"f")) 
  divide_one_norm = S/doc_norm 
  cosine = t(divide_one_norm)/doc_norm
  return (cosine)
}

Input S is the matrix of dot product. Simply, S = dt %*% t(dt), where dt is your dataset.

This function is basically to divide the dot product by the norms of vectors.

Eelworm answered 31/3, 2016 at 14:54 Comment(0)
R
-1

The cosine similarity is not invariant to shift. The correlation similarity maybe a better choice because fixes this problem and it is also connected to squared Euclidean distances (if data are standardized)

If you have two objects described by p-dimensional vectors of features, x1 and x2 both of dimension p, you can compute the correlation similarity by cor(x1, x2).

Note that in statistics correlation is used as a scaled moment notion, so it is naturally thought as correlation between random variables. The cor(dataset) function will compute correlations between columns of the data matrix.

In a typical situation with a (n x p) data matrix X, with units (or objects) on its rows, and variables (or features) on its columns you can compute the correlation similarity matrix simply by computing cor on the transpose of X, and giving the result object a dist class

as.distance(cor(t(X)))

By the way you can compute correlation dissimilarity matrix the same way. The following make a distinction about the size of the angle and the orientation between objects' vectors

1 - cor(t(X))

This one doesn't care about the orientation, only size of the angle

1 - abs(cor(t(X)))
Retractor answered 3/12, 2020 at 17:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.