Efficient method for finding KNN of all nodes in a KD-Tree

Asked 26/3, 2010 at 14:3 Answered 18/12, 2012 at 9:4

I'm currently attempting to find K Nearest Neighbor of all nodes of a balanced KD-Tree (with K=2).

My implementation is a variation of the code from the Wikipedia article and it's decently fast to find KNN of any node O(log N).

The problem lies with the fact that I need to find KNN of each node. Coming up with about O(N log N) if I iterate over each node and perform the search.

Is there a more efficient way to do this?

Ought answered 26/3, 2010 at 14:3 Comment(3)

Do you want to store result in some list or iterate over the the tuples (t, knn1, knn2)? – Sitarski 26/3, 2010 at 14:56

Just iterating. Though I'm curious, what would be the difference in approach? – Ought 26/3, 2010 at 14:59

The main difference between the KNN-search and your search is that all your search values are already in the tree. So your search starts in a node that is not the root node. Starting from every node you can traverse the tree, find 2 candidates and traverse until there cannot be another nearer candidate. This may safe some node traversals but is still O(n log n) if the tree is balanced. Maybe there is a way to reuse computations (which will still be O(n log n)). – Sitarski 26/3, 2010 at 15:32

Depending on your needs, you may want to experiment with approximate techniques. For details, checkout Arya and Mount's work on the subject. A key paper is here. BigO complexity details are located in their '98 paper.

A graphical illustration of the work is shown below:

alt text

^{Source: http://www.cs.umd.edu/~mount/ANN/Images/annspeckle.gif}

I have used their library on very high dimensional datasets with hundreds of thousands of elements. It's faster than anything else I found. The library handles both exact and approximate searches. The package contains some CLI utilities that you can use to easily experiment with your dataset; and even visualize the kd-tree ( see above ).

FWIW: I used the R Bindings.

From ANN's manual:

...it has been shown by Arya and Mount [AM93b] and Arya, et al. [AMN+98] that if the user is willing to tolerate a small amount of error in the search (returning a point that may not be the nearest neighbor, but is not significantly further away from the query point than the true nearest neighbor) then it is possible to achieve significant improvements in run- ning time. ANN is a system for answering nearest neighbor queries both exactly and approximately.

Campestral answered 26/3, 2010 at 17:46 Comment(3)

Wow, thanks for the research, Ryan. Sadly I'm looking for accurate results. If KNN using a KD-Tree is limited at this speed, maybe I'm going about this search with the wrong data structures. Any alternate suggestions? – Ought 26/3, 2010 at 18:11

As the last sentence of that quote from their manual points out, you can do exact searches as well with this library. "ANN is a system for answering nearest neighbor queries both exactly and approximately" – Campestral 26/3, 2010 at 19:15

Approximate search is sometimes helpful. Try searching the likely path first and using a distance calculation that knows about hyperplanes and points along the path. If the final point isn't that close to any hyperplane then its usually the nearest neighbour. – Benefaction 16/11, 2013 at 7:26

I used cover tree for this problem. Here is the link: http://hunch.net/~jl/projects/cover_tree/cover_tree.html

In a data set for 50M size(All kNN query, k=100), cover tree took 5.5s for creation, and 120s for querying. Ann lib took 3.3s for creating the tree, and 138s for querying.

updated:The nearest neighbor is not a symmetric relation. Consider this:A(0,0) B(1,0) C(3,0). B is the nearest for C while C is not the nearest one for B

Jacobjacoba answered 8/11, 2011 at 3:23 Comment(1)

Is all data needed to fit in RAM or only tree? – Wittgenstein 17/4, 2013 at 14:13

If the nodes themselves are query points, then the search time might be lower. You can start with backtracking stage and the first nodes tested are already near the query point. Then large areas of the tree can be pruned soon.

The nearest neighbor is a symmetric relation (if n1 is a nearest neighbor of n2, the same applies to n2) so you only need to search half the nodes skipping all the nodes already marked as nearest neighbors. Just an idea.

You can also try KD-Tree BBF (Best-Bin First) search, which will help you search the nearest nodes (bins) sooner. I have implemented this in C#, so write me if you are interested in the source code.

Of course, the actual running time depends on dimensionality, KD-Tree structure and distribution of points in your dataset.

Clustering of the points might also be appropriate.

Hydrodynamics answered 3/12, 2010 at 15:18 Comment(0)

The term to search for is knn join. More precisely, you probably want to do a self-join.

Maybe these search results help:

http://scholar.google.com/scholar?q=k-d-tree+knn+join

I have only seen knn join algorithms for the R*-tree. However, in my own experiments, they were not able to outperform a repeated query. I might be missing some implementation ideas. But in general, holding the data appropriately for a tree join is much more tricky than a single knn query.

Rectilinear answered 18/12, 2012 at 9:4 Comment(0)

Recommended topics

Hot tags