How to efficiently find k-nearest neighbours in high-dimensional data?

Asked 18/10, 2010 at 19:46 Answered 3/9, 2017 at 7:25

Solved algorithm data-structures computational-geometry nearest-neighbor dimensionality-reduction

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)

My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.

My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?

Junie answered 18/10, 2010 at 19:46 Comment(6)

I think this is similar to one of these: facebook.com/careers/puzzles.php – Enchiridion 18/10, 2010 at 19:54

As far as exact solutions I suspect the answer is no, but I wanted to suggest random projections and the Johnson Lindenstrauss theorem might help – Trudytrue 18/10, 2010 at 19:55

+1 for the previous comment which is along the same lines as LSH and ANN. – Tatman 18/10, 2010 at 21:57

I would recommend Hanan Samet's book, Foundations of Multidimensional and Metric Data Structures. It has some chapters specifically on structures suited to high-dimensional data. – Modulation 18/10, 2010 at 22:1

My library actually has this, so I'll definitely have a look. Most of the answers seem to involve a good bit of reading, so I guess it'll take some time until i can accept one :) – Junie 18/10, 2010 at 23:15

I've removed my answer, as thinking about it made me realize it was both misleading and incorrect. – Indigestion 19/10, 2010 at 13:5

The Wikipedia article for kd-trees has a link to the ANN library:

ANN is a library written in C++, which supports data structures and algorithms for both exact and approximate nearest neighbor searching in arbitrarily high dimensions.

Based on our own experience, ANN performs quite efficiently for point sets ranging in size from thousands to hundreds of thousands, and in dimensions as high as 20. (For applications in significantly higher dimensions, the results are rather spotty, but you might try it anyway.)

As far as algorithm/data structures are concerned:

The library implements a number of different data structures, based on kd-trees and box-decomposition trees, and employs a couple of different search strategies.

I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).

Amphictyon answered 18/10, 2010 at 21:30 Comment(5)

+1 for ANN, a pretty hot topic in information retrieval lately. Search "FLANN" for one implementation by Lowe, and "ANN" by David Mount. Also search for the related "locality sensitive hashing". – Tatman 18/10, 2010 at 21:46

@Steve: Tnx for the FLANN & locality sensitive hashing tips, very interesting stuff! – Amphictyon 18/10, 2010 at 21:58

Thats what I ended up using, and (along with a re-write of the program in c++) its pretty fast (<1 second). – Junie 25/10, 2010 at 13:50

@Benno: Glad it worked for you. As a curiosity, did you applied it to the original data (75 dimensions) or on what resulted after doing PCA/ICA? – Amphictyon 25/10, 2010 at 18:32

I did apply PCA, but only to change base vectors, not to reduce the dimension. It has its limits though, for another dataset with 240,000 points it took about 15 minutes. – Junie 27/10, 2010 at 17:56

use a kd-tree

Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.

reduce the number of dimensions

Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.

By accuracy I mean finding the exact Nearest Neighbor (NN).

Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.

Is there some clever algorithm or data structure to solve this exactly in reasonable time?

Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).

That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.

You could read more about ANNS in the introduction our kd-GeRaF paper.

A good idea is to combine ANNS with dimensionality reduction.

Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).

FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.

Warfeld answered 3/9, 2017 at 7:25 Comment(0)

You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.

Deuterium answered 18/10, 2010 at 20:7 Comment(2)

Thats 16,000 comparisons for each point, so 16,000^2 comparisons overall. Its not taking forever, but at least hours to days. I'll read up on Morton codes, though. – Junie 18/10, 2010 at 20:19

@Benno: Ah, I assumed that you were looking at nearest neighbors for arriving points. If you're looking to order by locality, then Morton Codes are probably the way to go, large or not. You should Google for searching Z-order curves, as there's some weirdness to it. And you might want to look at the Wikipedia article on GeoHash, which is essentially a text translation of a Morton Code. (and finally, note that Morton codes as described are for 2 dimensions, but there's no real limit to the number of dimensions you can interleave) – Deuterium 18/10, 2010 at 20:22

No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.

Redundant answered 18/10, 2010 at 21:20 Comment(0)

BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.

Nevil answered 18/10, 2010 at 21:28 Comment(0)

One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point. As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.

@Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.

If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.

if you need sorted K neighbours, sort the output again

see

Documentation for argpartition

Margarettmargaretta answered 31/5, 2015 at 15:6 Comment(0)

Recommended topics

Hot tags