Best Performance-Critical Algorithm for Solving Nearest Neighbor

Asked 28/10, 2009 at 20:3 Answered 31/10, 2009 at 0:41

Solved algorithm machine-learning spatial spatial-index spatial-query

We have a list of x,y pairs. Every pair represents a point on a 2D space. I want to find the closest point from this list, to a specific point xq,yq. What is the best performance-critical algorithm for this problem? Lisp of points is not going to change; which means I do not need to perform insertion and deletion. I want just to find the nearest neighbor of a target xq,yq point in this set.

Edit 1: Thanks to all! As Stephan202 has guessed correctly, I want to do this repeatedly; like a function. An the list is not necessarily sorted (In fact I do not understand how can it be sorted? Like a table with a primary key of 2 columns a and y? If that helps then I will sort it).

I will construct the data structure based on the list one time, and then I will use this generated data structure in the function (if this process itself is relevant).

Thank you Jacob; It seems that KD-Tree data structure is a good candidate for being the answer (And I feel it is. I will update when I get some relevant results).

Edit 2: I have found that, this problem is named "nearest neighbor"!

Edit 3: First title was "In Search of an Algorithm (for Spatial-Querying and Spatial-Indexing) (Nearest Neighbor)"; I have chose a new title: "Best Performance-Critical Algorithm for Solving Nearest Neighbor". Since I do not want to perform insertion and deletion operation on my initial data and I want just the nearest one from them to a new point (which is not going to be inserted), I chose to (currently) working on KD-Trees. Thanks to all!

Shirleneshirley answered 28/10, 2009 at 20:3 Comment(3)

Is there some structure in the list (is it e.g. sorted)? Do you want to repeat this operation, or will it be performed once? This is relevant information which people will need to answer your question. – Vindicable 28/10, 2009 at 20:6

Do you have access to a spatial database? – Chiromancy 28/10, 2009 at 20:8

If the list is unsorted and the operation will be performed only once, you'll have to perform a linear search over the list and can thus not do better than O(n). If you're going to repeat the operation, you'll have to create a suitable (tree) representation of the list, based on the element's x and y values. – Vindicable 28/10, 2009 at 20:9

As Stephan202 noted, if you plan to find the closest-match for more than one point, you should use a tree.

I would recommend a KD-tree, whose implementation can be easily found in several packages like OpenCV 2.0. Or you could implement one yourself!

EDIT: I had asked a question about kd-tree implementations here - might be useful.

EDIT: KD-trees have been widely used successfully for NN searches :) - Also, if you're willing to accept approximate matches, you can use Fast Library for Approximate Nearest Neigbor (FLANN). The FLANN implementation is present in OpenCV 2.0.

If you don't want approximate answers you can tweak the FLANN parameters to search the entire tree.

Loutitia answered 28/10, 2009 at 20:22 Comment(4)

I was thinking of suggesting them as well, glad I took the time to look at the answers already suggested :) – Jabe 29/10, 2009 at 8:31

KD trees aren't built for this in the same way as some data structures are. If you find out that the query point is in the cell for point P, you still need to check all of the neighboring KD-tree cells, since any of those could also be the closest point. – Glavin 29/10, 2009 at 15:12

Thanks for responding jprete! I have described my problem in more details which is in fact a "Nearest Neighbor" problem. What algorithm/data structure are you pointing to? (Which you think are better than KD-Trees for solving this). – Shirleneshirley 29/10, 2009 at 20:58

I have implemented my KD Tree in C# and for uniform GIS position data it is pretty fast (.002 ms on my 2.66 6MB 4GB Studio XPS 16). But for real data, if the query position is far from real areas it goes slow (uo to 500 ms). But since I needed answers in a specific area, I have tuned the KD Tree to apply a max distance constraint; so it remained pretty fast! Of course my problem has it's own attributes: in-memory, no change in set of GIS data (no insert, update, delete node); yet I'v got a more than acceptable result! – Shirleneshirley 4/11, 2009 at 12:46

If the query point (xq, yq) varies and the list doesn't, you need to calculate the Voronoi diagram of the list of points. This will give you a set of polygons or "cells" (some of which are infinite); each polygon corresponds to a point from the original list, called the "site" of that cell. Any point which lies entirely inside of one polygon is closer to the site of that polygon than it is to the other sites on the original list. Any point on a boundary between two polygons lies equally distant from each site.

Once you've gotten that far, you need an easy way to figure out which polygon you're in. This is known as the point location problem.

A really, really good book for this kind of thing is Computational Geometry: Algorithms and Applications. They discuss both the Voronoi diagram calculation and the trapezoidal slab method of point location in detail.

If you don't want to do the code yourself, and you shouldn't, then try to get a library like CGAL that will do most of the work for you. This probably applies to the KD-tree answer as well, but I don't specifically know.

Glavin answered 28/10, 2009 at 20:29 Comment(0)

You need a spatial index.

If you roll your own, you can do a lot worse than picking the R-Tree or Quad-tree algorithms.

Militia answered 28/10, 2009 at 20:42 Comment(2)

I had not time to read about quadtree much but as far as I studied R-Tree; It is for indexing multi-dimensional data which 1) will be persisted (like in a database, not in-memory) 2) and set of data change (insert, update and delete); neither of them was properties of my problem (KD-Trees are hard to balance too; so they are not proper instead of R-Trees and vice versa). Thanks – Shirleneshirley 7/11, 2009 at 11:57

I think you should take more time to read about the R-Tree, and then look at the quadtree. If you can't roll your own, just use someone else's. Lots of databases offer GIS functionality. – Militia 7/11, 2009 at 17:45

I would go with a quadtree. It is the most simple spatial structure. In 2 dimensions i would generally recommend quadtree instead of kd-tree, because it is simpler, faster. Its downside is more memory consumption if the number of dimensions is high, but in case of 2 dimensions the difference is not significant.

There is a nice optimization trick if your coordinates are floating point typed: In a query you first will have to find the leaf-node that contains the point to which the most near point is asked. To do this you will have to go in the tree from the root to the leaf - in each iteration deciding which child-node to step on. Store the identifiers/addresses of the child-nodes in a 4-sized array in the Node structure. Digitize the point coordinates in the query algorithm. Then you will be able to find the proper sub-node just by indexing the array by 2 proper bits of the digitized point coordinates. Digitizing is fast: implement it with a simple static_cast.

But first implement the quadtree without optimization because it is easy to make a bug with the bit-operations. Even without this optimization, it still will be the fastest solution.

Gabardine answered 31/10, 2009 at 0:41 Comment(0)

Iterate through every other point using the distance formula to find the minimum distance from Q (xq,yq).

However, you haven't given enough information for a performance-critical answer.

For example, if Q is a VERY common point, you might want to calculate the distance to Q and store it with each point.

Second example, if you have a huge number of points, you might organize the points into sections and start with points only in the same section and adjacent sections to the section containing Q.

Microhenry answered 28/10, 2009 at 20:10 Comment(0)

Recommended topics

Hot tags