How to efficiently retrieve top K-similar vectors by cosine similarity using R?
Asked Answered
M

1

7

I'm working on a high-dimensional problem (~4k terms) and would like to retrieve top k-similar (by cosine similarity) and can't afford to do a pair-wise calculation.

My training set is 6million x 4k matrix and I would like to make predictions for 600k x 4k matrix.

What is the most efficient way to retrieve the k-similar items for each item in my 600k x 4k matrix?

Ideally, I would like to get a matrix which is 600k x 10 (i.e., top 10-similar items for each of the 600k items).

ps: I've researched the SO website and found almost all "cosine similarity in R" questions refer to cosine_sim(vector1, vector2). But this question refers to cosine_sim(matrix1, matrix2).

Update The following code uses a naive method to find the cosine similarity between each row in the testset and every row in the training set.

set.seed(123)
train<-matrix(round(runif(30),0),nrow=6,ncol=5)
set.seed(987)
test<-matrix(round(runif(20),0),nrow=4,ncol=5)
train

[1,]    0    1    1    0    1    
[2,]    1    1    1    1    1    
[3,]    0    1    0    1    1    
[4,]    1    0    1    1    1    
[5,]    1    1    0    1    0    
[6,]    0    0    0    1    0

test

[1,]    0    1    1    0    0
[2,]    1    0    1    0    1
[3,]    1    0    0    0    0
[4,]    1    0    0    1    1

coSim<-function(mat1, mat2, topK){
require(plyr)
#mat2: is the testset
#mat1: is the training set. We will find cosine similarity between each row in testset and every row in trainingset.
#topK: user-input. for each row in testset we will return 'topk' similar rows(index) from the testset

#set up an empty result matrix. nrow(result) will be the same as the cartesian product between mat1 & mat2.
result<-matrix(rep(NA, nrow(mat1)*nrow(mat2)), nrow=nrow(mat1)*nrow(mat2), ncol=3)
k=1
for(i in 1:nrow(mat2)){
    for(j in 1:nrow(mat1)){
    result[k,1]<-i
    result[k,2]<-j
    result[k,3]<-crossprod(mat1[j,], mat2[i,])/sqrt(crossprod(mat1[j,]) * crossprod(mat2[i,]))
    k<-k+1
        }
    }
#sort the result matrix by cosine similarity found for each row in testset. not sure how to keep topK from each group so convert to df
result<-as.data.frame(result)
colnames(result)<-c("testRowId", "trainRowId","CosineSimilarity")
result<-ddply(result, "testRowId", function(x) head(x[order(x$CosineSimilarity, decreasing = TRUE) , ], topK))
resultMat<-matrix(result$trainRowId, nrow=nrow(mat2), ncol=topK,byrow=T)
finalResult<-list(similarity=result, index=resultMat)
}

system.time(cosineSim<-coSim(train, test, topK=2)) #0.12 secs
cosineSim
$similarity
  testRowId trainRowId CosineSimilarity
1         1          1        0.8164966
2         1          2        0.6324555
3         2          4        0.8660254
4         2          2        0.7745967
5         3          5        0.5773503
6         3          4        0.5000000
7         4          4        0.8660254
8         4          2        0.7745967

$index
     [,1] [,2]
[1,]    1    2
[2,]    4    2
[3,]    5    4
[4,]    4    2


set.seed(123)
train<-matrix(round(runif(1000000),0),nrow=5000,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)
system.time(cosineSim<-coSim(train, test, topK=50)) #380secs

When I run the same function with 5000x200 matrix for training and 2000x200 matrix for testing, it took over 380secs.

Ideally, I would like to see some ideas where I do not have to calculate the similarity between each and every row. If that is not possible, some pointers on how to vectorise the above code will be helpful.

Moonier answered 22/9, 2013 at 17:54 Comment(3)
@to the person who down voted this question: If you are going to down-vote, perhaps it would be helpful if you could add a comment as to why you are doing so.Moonier
I didn't downvote. But, you should read this before posting the question.Carmarthenshire
@Metrics: thanks, I've updated my question with some code. Hopefully the question is clear now.Moonier
F
4

No need to compute the similarity for every row. You can use this instead:

coSim2<-function(mat1, mat2, topK){
    #similarity computation:

    xy <- tcrossprod(mat1, mat2)
    xx <- rowSums(mat1^2)
    yy <- rowSums(mat2^2)
    result <- xy/sqrt(outer(xx,yy))

    #top similar rows from train (per row in test):

    top <- apply(result, 2, order, decreasing=TRUE)[1:topK,]
    result_df <- data.frame(testRowId=c(col(top)), trainRowId=c(top))
    result_df$CosineSimilarity <- result[as.matrix(result_df[,2:1])]
    list(similarity=result_df, index=t(top))
}

Test data (I've reduced your train matrix)

set.seed(123)
train<-matrix(round(runif(100000),0),nrow=500,ncol=200)
set.seed(987)
test<-matrix(round(runif(400000),0),nrow=2000,ncol=200)

Result:

> system.time(cosineSim<-coSim(train, test, topK=50)) #380secs
   user  system elapsed 
  41.71    1.59   43.72 

> system.time(cosineSim2<-coSim2(train, test, topK=50)) #380secs
   user  system elapsed 
   0.46    0.02    0.49 

Using your full 5000 x 200 train matrix, coSim2 runs in 7.8 sec.

Also note:

> any(cosineSim$similarity != cosineSim2$similarity)
[1] FALSE
> any(cosineSim$index != cosineSim2$index)
[1] FALSE

You can't use identical because my function returns integers instead of doubles for row IDs.

Fancher answered 22/9, 2013 at 22:57 Comment(3)
This looks excellent. I will have a closer look after work and report back (most likely, accept your answer!).Moonier
My apologies. I up-voted instead of clicking the tick mark. All set now.Moonier
Hi, I get the same problem, i use python rather than R, so can you explain your code? thanks.Calculator

© 2022 - 2024 — McMap. All rights reserved.