Remove points to maximize shortest nearest neighbor distance

Asked 12/11, 2013 at 17:27 Answered 12/11, 2013 at 18:40

If I have a set of N points in 2D space, defined by vectors X and Y of their locations. What is an efficient algorithm that will

Select a fixed number (M) points to remove so as to maximize the shortest nearest-neighbor distance among the remaining points.
Remove a minimum number of points so that the shortest nearest-neighbor distance among the remaining points is greater than a fixed distance (D).

Sorting by points by their shortest nearest neighbor distance and removing the points with the smallest values does not give the correct answer, as then you remove both points of close pairs, while you may only need to remove one of the points in those pairs.

For my case, I am usually dealing with 1,000-10,000 points, and I may remove 50-90% of points.

Katharinakatharine answered 12/11, 2013 at 17:27 Comment(6)

One probably imperfect algorithm: Find the closest two points, remove the point that has the smallest next NND, repeat until you reach M or D. – Katharinakatharine 12/11, 2013 at 17:53

Another, suggested by @berkeleymalagon in twitter: calculate all the pairwise distances once, sort all pairs by distance, and remove points from shortest pairs that are in fewest number of pairs. – Katharinakatharine 12/11, 2013 at 17:55

It occurs to me that for case (2), if I calculate the full distance matrix, each point will be a member of a number of pairs with distances <D. Then I need to select the points that are in the greatest number of pairs, in a way that is complementary so as to minimize the number of points selected. – Katharinakatharine 13/11, 2013 at 15:4

OK, so that means case (2) is a classic set cover problem. But what about case (1)? – Katharinakatharine 13/11, 2013 at 15:49

Case (1) strikes me as being a kind of knapsack problem played with non-integral weights. – Wylen 24/5, 2016 at 16:10

Do you know anything else about your points? Are they distributed uniform-random, in a grid-like fashion, do you expect clustering, &c? – Wylen 24/5, 2016 at 16:12

Noam

One method is to break your 2D space into N partitions. Within each partition, determine an average position for each X,Y. Then perform the nearest neighbor algorightm on the averaged points. Then repeat the nearest neighbor test on the full point set of the partitions that matched.

Here's the catch. The larger the partitions, the fewer points you will have but the less accurate. The smaller the partitions, it will be more accurate but with more points to process.

Eloyelreath answered 12/11, 2013 at 17:37 Comment(0)

You shouldn't need to store (or compute) the entire distance matrix: a Delaunay triangulation should efficiently (O(n log n) worst case) give you the closest neighbors of your point set. You should also be able to update it efficiently as you delete points.

For most cases of close pairs, you should be able to check to see which of the pair would be farthest from its neighbors if the other is removed. This is not an exact solution; especially if you remove a large proportion of points, removing a locally optimum point may exclude the globally optimum solution. Also, you should be able to deal with clusters of 3 or more locally close points. However, if you are only removing a small proportion of points from a randomly distributed set, both these cases may be relatively rare.

There may or may not be a better way (i.e., an exact and efficient algorithm) to solve your problem, but the above suggestions should lead to an approximate and/or combinatorial approach which works best when the points that need deleting are sparsely distributed.

Quarry answered 12/11, 2013 at 18:1 Comment(1)

I will try this. However, in my case I may remove 50-90% of the points. I'll see how the method holds up in this case. – Katharinakatharine 12/11, 2013 at 21:10

I can't think of anything other than a brute force approach. But you can probably shorten the data set you are looking at significantly before any analysis.

So, what I would do is. First work out the nearest neighbour distance for each point. Let's call that P_in. Then work out the maximum distance of each point to its M nearest neighbours, call it P_iM. If P_in is greater than P_iM for any point then it can be excluded from the analysis. Basically if you have one point that is a distance of 10 from any other point, and you have another point that is a distance of 9 from the nearest M points then you should remove the first point.

Depending on the level of clustering or how big M is, this might reduce your data set quite a bit.

Mastitis answered 12/11, 2013 at 18:40 Comment(0)

Recommended topics

Hot tags