Maximal optimization for cosine similarity search
Asked Answered
B

0

7

I have pre-made database full of 512 dimensional vectors and want to implement an efficient searching algorithm over them.


Research

Cosine similarity:

The best algorithm in this case would consist of cosine similarity measure, which is basically a normalized dot product, which is:

def cossim(a, b): numpy.inner(a, b)/(numpy.linalg.norm(a)*numpy.linalg.norm(b))

In Python.

Linear search:

Most obvious and simple search for this case would be linear search O(n), which iterates the whole database and eventually picks the most similar result:

def linear_search(query_text, db):  # where db is set of 512D vectors
    most_similar = ("", 0)  # placeholder
    for query in db:
        current_sim = cossim(query_text, query)  # cossim function defined above
        if current_sim > most_similar[1]:
            most_similar = (query, current_sim)
    return most_similar[0] 

As you can see, the whole database should be scanned, which might be quite inefficient if database contains hundreds of thousands of vectors.

Quasilinear search: (partially resolved)

There is a fundamental relation between Cosine similarity and Euclidean distance (explained very well in this answer) - we can derive Euclidean distance from following equation:

|a - b|² = 2(1 - cossim(a,b))

As mentioned in the answer, Euclidean distance will get smaller as cosine between two vectors get larger, therefore we can turn this into the closest pairs of points problem, which can be solved in quasilinear O(n log n) time using recursive divide and conquer algorithm.

Thus I have to implement my own divide and conquer algorithm that will find closest pair of 512 dimensional vectors.

But unfortunately, this problem can't be directly solved due to high dimensionality of vectors. Classical divide and conquer algorithm is only specialized for two dimensions.

Indexing for binary search (unresolved):

The best way to optimize cosine similarity search in terms of speed from my knowledge would be indexing and then performing binary search.

The main problem here is that indexing 512 dimensional vectors is quite difficult, and I'm not yet aware of anything other than locality sensitive hashing that may, or may not be useful for indexing my database (main concern is dimensionality reduction, which might possibly cause consequential decrease in accuracy).

There is a new Angular Multi-index Hashing method, which unfortunately only works for binary based vectors and dimension independent similarity computation if vectors are sparse, but they are not.

Finally, there also is An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions, which at first glance might perhaps be the best solution, but in the document it is stated:

Unfortunately, exponential factors in query time do imply that our algorithm is not practical for large values of d. However, our empirical evidence in Section 6 shows that the constant factors are much smaller than the bound given in Theorem 1 for the many distributions that we have tested. Our algorithm can provide significant improvements over brute-force search in dimensions as high as 20, with a relatively small average error.

We are trying to perform query over 20 * 25.6 = 512 dimensional vectors, which will make the algorithm above highly inefficient.

There was a similar question that contains similar concerns, but unfortunately the solution for indexing was yet not found.


Problem

Is there any way to optimize cosine similarity search for such vectors other than quasilinear search? Perhaps there is some other way of indexing high-dimensional vectors? I believe something like this has already been done before.

Closest solution

I believe I have found solution that might potentially be a solution, it includes randomized partition trees for indexing couple hundred dimensional vector databases, which I believe is what I exactly need. (see here)

Thank you!

Bilow answered 28/10, 2018 at 10:56 Comment(3)
Hi, do you have an update on this topic? I'm looking for a solution to this problem as well, and everything I've stumbled upon are approximate methods, which are not applicable to my case. My case is actually more relaxed, as I need to find most similar within certain threshold.Diacid
@AntonioSimunovic A month after this question I ended up seeking the approximation methods as well - and you must probably be aware of locality sensitive hashing algorithm which currently seems optimal in terms of simplicity and efficiency. You might look for random projection for dimension reduction and k-d trees for high-dimensional indexing (this one is good especially for your case). I am almost certain however, that some kind of numerical analysis must be involved to achieve efficiency in this specific task.Bilow
I'm curious: how does your problem not match those solved by ANN used for, say, image search? There's benchmarks on vectors of dimension 960.Metencephalon

© 2022 - 2024 — McMap. All rights reserved.