Comparison of the runtime of Nearest Neighbor queries on different data structures
Asked Answered
S

2

2

Given n points in d-dimensional space, there are several data structures, such as Kd-Trees, Quadtrees, etc. to index the points. On these data structures it is possible to implement straight-forward algorithm for nearest neighbor queries around a given input point. Is there a book, paper, survey, ... that compares the theoretical (mostly expected) runtime of the nearest neighbor query on different data structures? The data I am looking at is composed of fairly small point clouds, so it can all be processed in main memory. For the sake of simplicity, I assume the data to be uniformly distributed. That is, im am not interested in real-world performance, but rather theoretical results

Simonetta answered 7/10, 2014 at 13:48 Comment(2)
There are a couple of interesting articles, one in sci-kit learn, scikit-learn.org/stable/modules/… and see the references in the en.wikipedia.org/wiki/Ball_tree article. Kd trees can behave badly for higher-dimensional data, though are excellent for 2d, generally.Undercut
I have implemented kd trees in 2d and one observation is that if you have points on a regular grid, the algorithm fails when you search, as you can't decide on left/right, up/down when points are identical. I solved it by adding a small random number to each point first. I also suspect that while the randomly spaced grid assumption might be a good starting point, you won't be able to cleanly extrapolate from this to real-world point clouds, which most likely will be very random and this is where different data structures and indexing methods really come into their own.Undercut
I
4

You let the dimension of the points undefined and you just give an approximation for the number of points. What does small means? It's relative what one person means by small.

What you search, of course, doesn't exist. Your question is pretty much this:


Question:

For a small (whatever does small means to you) dataset, of any dimension with data that follow a uniform distribution, what's the optimal data structure to use?

Answer:

There is no such data structure.


Wouldn't it be too strange to have an answer on that? A false analogy would be to put as a synonym of this question, the "Which is the optimal programming language?" question that most of the first year undergrads have. Your question is not that naive, but it's walking on the same track.


Why there is no such data structure?

Because, the dimension of the dataset is variable. This means, that you might have a dataset in 2 dimensions, but it could also mean that you have a dataset in 1000 dimensions, or even better a dataset in 1000 dimensions, with an intrinsic dimension that is much less than 1000. Think about it, could one propose a data structure that would behave equally good for the three datasets I mentioned it? I doubt it.

In fact, there are some data structures that behave really nicely in low dimensions (quadtrees and KD-trees for example), while others are doing much better in higher dimensions (RKD-tree forest for instance).

Moreover, the algorithms and the expectations used for Nearest Neighbour search are heavily dependent on the dimension of the dataset (as well as the size of the dataset and the nature of the queries (for example a query that is too far from the dataset or equidistant from the points of the dataset will probably result in a slow search performance)).

In lower dimensions, one would perform a k-Nearest Neighbour(k-NN) search. In higher dimensions, it would be more wise to perform k-Approximate NN search. In this case, we follow the following trade-off:

Speed VS accuracy

What happens is that we achieve faster execution of the program, by sacrificing the correctness of our result. In other words, our search routine will be relatively fast, but it may (the possibility of this depends on many parameters, such as the nature of your problem and the library you are using) not return the true NN, but an approximation of the exact NN. For example it might not find the exact NN, but the third NN to the query point. You could also check the approximate-nn-searching wiki tag.

Why not always searching for the exact NN? Because of the curse of dimensionality, which results in the solutions provided in the lower dimensions to behave as good as the brute force would do (search all the points in the dataset for every query).


You see my answer already got big, so I should stop here. Your question is too broad, but interesting, I must admit. :)


In conclusion, which would be the optimal data structure (and algorithm) to use depends on your problem. The size of the dataset you are handling, the dimension and the intrinsic dimension of the points play a key role. The number and the nature of the queries also play an important role.

Inoculum answered 8/10, 2014 at 20:58 Comment(4)
Nice answer. Sadly, the OP is interested in theoretical, not real-world, results, using an evenly distributed point cloud, on an unspecified number of dimensions, which is somewhat hard to answer.Undercut
Couldn't agree more @JohnBarça! Thanks :)Inoculum
Thanks for this answer. Guess I was not really aware of the role the dimension plays in this. For this (I realize now way too general question) I will leave it at this and rather post a more narrowed down question. Thanks heaps for bringing the dimension issue up!Simonetta
There was a time I didn't know that too @Skrodde, so you are welcomed. :)Inoculum
T
1

For nearest neighbor searches of potentially non-uniform point data I think a kd-tree will give you the best performance in general. As far as broad overviews and theoretical cost analysis I think Wikipedia is an OK place to start, but keep in mind it does not cover much real-world optimization:

http://en.wikipedia.org/wiki/Nearest_neighbor_search

http://en.wikipedia.org/wiki/Space_partitioning

Theoretical performance is one thing but real world performance is something else entirely. Real world performance depends as much on the details of the data structure implementation as it does on the type of data structure. For example, a pointer-less (compact array) implementation can be many times faster than a pointer-based implementation because of improved cache coherence and faster data allocation. Wider branching may be slower in theory but faster in practice if you leverage SIMD to test several branches simultaneously.

Also the exact nature of your point data can have a big impact on performance. Uniform distributions are less demanding and can be handled quickly with simpler data structures. Non-uniform distributions require more care. (Kd-trees work well for both uniform and non-uniform data.) Also, if your data is too large to process in-core then you will need to take an entirely different approach compared to smaller data sets.

Telegonus answered 7/10, 2014 at 14:13 Comment(3)
Thanks for the answer atb. But wikipedia is not a source I can quote. I was rather looking for a scientific publication to use. For this, I am not interested in real-world runtime, but really in the theoretically expected runtime on a uniform distribution that is small enough to fit into main memory.Simonetta
Ah, OK, I haven't seen anything like that. It isn't that hard to do the complexity analysis yourself if necessary.Telegonus
Just wondered whether there is any table or something I could quote. Since it is only remotely related to the work I do, I'd rather quote it than do it myself, but if there's nothing out there, I guess I have no choice.Simonetta

© 2022 - 2024 — McMap. All rights reserved.