Database supporting fast approximate nearest neighbor queries [closed]

Asked 15/9, 2013 at 22:13 Answered 14/4, 2020 at 6:52

Is there a database that supports fast approximate nearest neighbor queries in high-dimensional vector spaces?

I'm looking for a database that would fit the following use case:

Works for millions of points
Works for hundreds-thousands of dimensions
Potentially uses cover trees or locality sensitive hashing for indexing

Does a robust implementation of this exist?

Nunatak answered 15/9, 2013 at 22:13 Comment(3)

I know this over 6 years old, but thought this might be useful for anyone arriving recently: github.com/netrasys/pgANN/blob/master/README.md – Dicho 3/7, 2019 at 15:15

Although an old post, Weaviate solves this problem, you can learn more about it in the docs or on Github: semi.technology/developers/weaviate/current – Muller 20/1, 2021 at 10:57

I'm not sure why this question has been closed. Meanwhile, Elasticsearch supports vectro-based search through the Dense vector field type: elastic.co/guide/en/elasticsearch/reference/master/…. See also this: elastic.co/blog/…, and this: elastiknn.com. – Woodyard 26/3, 2021 at 9:39

There's the ANN library which works pretty well for high-dimensional large datasets, but it's not a full "database" and is not a distributed solution.

There's a startup called SpaceCurve (no relation to me) working on a commercial spatial database, so depending on your needs and budget they might be worth looking into.

As a piece of advice: you should think deeply about what "nearest neighbor" really means when you're talking about "hundreds-thousands of dimensions". If you take a million random points in a 20-dimensional cube, the average distance between any two nearest neighbors is already about half the length of an edge of the cube.

This only gets worse exponentially as you add dimensions. Once you're talking about hundreds dimensions you really need impossibly large amounts of points (like > 10³⁰) if they're somewhat uniformly distributed; and if they're differently distributed you're better off with other classification approaches.

Steep answered 16/9, 2013 at 0:51 Comment(1)

why "the average distance between any two nearest neighbors is already about half the length of an edge of the cube"? @Steep – Copilot 25/9, 2017 at 13:41

Take a look at AnnDB.

It's a distributed approximate nearest neighbors database that allows you to horizontally scale to millions of high-dimensional vectors.

Disclaimer: I'm the author of AnnDB.

Hyperaesthesia answered 14/4, 2020 at 6:52 Comment(0)

You might want to look at Facebook's Faiss.

From the documentation:

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM

Notice, it is only applicable for L2 (Euclidean) distances, and dot products.

Link to the project - Faiss

Cermet answered 2/5, 2019 at 11:54 Comment(0)

Recommended topics

Hot tags