How to efficiently compute the cosine similarity between millions of strings [closed]
Asked Answered
M

2

8

I need to compute the cosine similarity between strings in a list. For example, I have a list of over 10 million strings, each string has to determine similarity between itself and every other string in the list. What is the best algorithm I can use to efficiently and quickly do such task? Is the divide and conquer algorithm applicable?

EDIT

I want to determine which strings are most similar to a given string and be able to have a measure/score associated with the similarity. I think what I want to do falls in line with clustering where the number of clusters are not initially known.

Madai answered 23/2, 2013 at 14:34 Comment(7)
By definition of your problem, you will have a complexity of O(n²) executions of the cosine similarity computation.Durrett
@Durrett Yes, is this acceptable for such a large data? I do not think it isMadai
You have to employ dynamic programming for that. See this linkDuplessismornay
The question is: why would you need to know all of the cosine similarities in the first place? If you have large scale text matching tasks then the standard book that I would recommend is Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology by Dan Gusfield. Don't be fooled by its name. Computational Biology is mostly about (DNA) string matching. The queries in this field are much more refined than the famous google search.Charitycharivari
@ExtremeCoders Thanks, I will study itMadai
10 million times 10 million = 1 thousand billions. If each application of the cosine similarity requires 50 microseconds then you get a minimum time of about 50 million seconds, which is about one and a half years. If the cosine similarity takes just 1 nanosecond you'd still need 1000 seconds. So even with dynamic programming you can't do much. You should find a way to avoid all these computations of the cosine similarity.Capriccio
@Durrett well, you can however save computing 0*something, if you are smart. If you exploit sparsity, it's only quadratic on the non-sparse parts. Which is much better.Dania
D
0

Work with the transposed matrix. That is what Mahout does on Hadoop to do this kind of task fast (or just use Mahout).

Essentially, computing cosine similarity the naive way is bad. Because you end up computing a lot of 0 * something. Instead, you better work in columns, and leave away all 0s there.

Dania answered 23/2, 2013 at 15:50 Comment(0)
F
0

You could try SimString.

It is a C++ library (with Python or Ruby bindings) for approximate string matching.

It claims to find strings with high cosine similarity in under 1 millisecond for a database of 13 million strings.

The algorithm used is described here based on pruning of inverted lists.

Fibril answered 23/2, 2013 at 19:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.