I have pre-made database full of 512 dimensional vectors and want to implement an efficient searching algorithm over them.
Research
Cosine similarity:
The best algorithm in this case would consist of cosine similarity measure, which is basically a normalized dot product, which is:
def cossim(a, b): numpy.inner(a, b)/(numpy.linalg.norm(a)*numpy.linalg.norm(b))
In Python.
Linear search:
Most obvious and simple search for this case would be linear search O(n), which iterates the whole database and eventually picks the most similar result:
def linear_search(query_text, db): # where db is set of 512D vectors
most_similar = ("", 0) # placeholder
for query in db:
current_sim = cossim(query_text, query) # cossim function defined above
if current_sim > most_similar[1]:
most_similar = (query, current_sim)
return most_similar[0]
As you can see, the whole database should be scanned, which might be quite inefficient if database contains hundreds of thousands of vectors.
Quasilinear search: (partially resolved)
There is a fundamental relation between Cosine similarity and Euclidean distance (explained very well in this answer) - we can derive Euclidean distance from following equation:
|a - b|² = 2(1 - cossim(a,b))
As mentioned in the answer, Euclidean distance will get smaller as cosine between two vectors get larger, therefore we can turn this into the closest pairs of points problem, which can be solved in quasilinear O(n log n)
time using recursive divide and conquer algorithm.
Thus I have to implement my own divide and conquer algorithm that will find closest pair of 512 dimensional vectors.
But unfortunately, this problem can't be directly solved due to high dimensionality of vectors. Classical divide and conquer algorithm is only specialized for two dimensions.
Indexing for binary search (unresolved):
The best way to optimize cosine similarity search in terms of speed from my knowledge would be indexing and then performing binary search.
The main problem here is that indexing 512 dimensional vectors is quite difficult, and I'm not yet aware of anything other than locality sensitive hashing that may, or may not be useful for indexing my database (main concern is dimensionality reduction, which might possibly cause consequential decrease in accuracy).
There is a new Angular Multi-index Hashing method, which unfortunately only works for binary based vectors and dimension independent similarity computation if vectors are sparse, but they are not.
Finally, there also is An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions, which at first glance might perhaps be the best solution, but in the document it is stated:
Unfortunately, exponential factors in query time do imply that our algorithm is not practical for large values of d. However, our empirical evidence in Section 6 shows that the constant factors are much smaller than the bound given in Theorem 1 for the many distributions that we have tested. Our algorithm can provide significant improvements over brute-force search in dimensions as high as 20, with a relatively small average error.
We are trying to perform query over 20 * 25.6 = 512
dimensional vectors, which will make the algorithm above highly inefficient.
There was a similar question that contains similar concerns, but unfortunately the solution for indexing was yet not found.
Problem
Is there any way to optimize cosine similarity search for such vectors other than quasilinear search? Perhaps there is some other way of indexing high-dimensional vectors? I believe something like this has already been done before.
Closest solution
I believe I have found solution that might potentially be a solution, it includes randomized partition trees for indexing couple hundred dimensional vector databases, which I believe is what I exactly need. (see here)
Thank you!