Finding the best cosine similarity in a set of vectors
Asked Answered
W

2

11

I have n vectors, each with m elements (real number). I want to find the pair where there cosine similarity is maximum among all pairs.

The straightforward solution would require O(n2m) time.

Is there any better solution?

update

Cosine similarity / distance and triangle equation Inspires me that I could replace "cosine similarity" with "chord length" which loses precision but increases speed a lot. ( there are many existing solutions solving Nearest Neighbor in metric space, like ANN )

Wean answered 1/12, 2012 at 16:39 Comment(3)
@Wean Are there any restrictions on the elements of your vectors? E.g. are they always binary (0 or 1)?Furbish
@robmayoff No, elements are real ( float )Wean
@robmayoff If elements are binary, this problem is equivalent to find a pair of 01 strings that have the most same bits .Wean
P
18

Cosine similarity sim(a,b) is related to Euclidean distance |a - b| by

|a - b|² = 2(1 - sim(a,b))

for unit vectors a and b.

That means cosine similarity is greatest when Euclidean distance is smallest after normalizing by the L2 norm, and the problem reduces to the closest pair of points problem, which can be solved in O(n lg n) time.

Pyroxylin answered 1/12, 2012 at 17:58 Comment(2)
Great answer! Giving a clear relationship between Cosine similarity and Euclidean distance.Wean
beautiful answer!Stern
E
0

You can check with the project simbase https://github.com/guokr/simbase , it is a vector similarity nosql database.

Simbase use below concepts:

  • Vector set: a set of vectors
  • Basis: the basis for vectors, vectors in one vector set have same basis
  • Recommendation: a one-direction binary relationship between two vector sets which have the same basis

You can use redis-cli directly for administration tasks, or you can use redis client bindings in different language directly in a programming way. Here is a Python example

    import redis

    dest = redis.Redis(host='localhost', port=7654)
    schema = ['a', 'b', 'c']
    dest.execute_command('bmk', 'ba', *schema)
    dest.execute_command('vmk', 'ba', 'va')
    dest.execute_command('rmk', 'va', 'va', 'cosinesq')
Erbium answered 12/6, 2014 at 15:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.