Nearest Neighbor Search: Python
Asked Answered
A

2

33

I have a 2 dimensional array:

MyArray = array([6588252.24, 1933573.3, 212.79, 0, 0],
                [6588253.79, 1933602.89, 212.66, 0, 0],
                 etc...)

The first two elements MyArray[0] and MyArray[1] are the X and Y coordinates of the points.

For every element in the array, I would like to find the quickest way to return its single nearest neighbor in a radius of X units. We are assuming this is in 2D space.

lets say for this example X = 6.

I have solved the problem by comparing every element to every other element, but this takes 15 minutes or so when your list is 22k points long. We hope to eventually run this on lists of about 30million points.

I have read about K-d trees and understand the basic concept, but have had trouble understanding how to script them.

Arbalest answered 16/10, 2012 at 21:20 Comment(5)
What's a "Kt tree"? You mean "k-d tree"? For two-dimensional points you only need a quadtree. There was an earlier question looking for quadtree implementations in Python: #6060802Secret
There's a k-d tree implementation in the scipy.spatial moduleCommissariat
Note the cKDTree, its much faster.Coggins
@Dlinet: Your solution won't give the closest result, but rather itself since the distance to itself is 0! You should instead use k=2 and take the second closest result.Zestful
Use scikit-learn NearestNeighbors, then you can choose among kd_tree, ball_tree, and brute force. The default is auto: attempt to decide the most appropriate algorithm based on the values passed to fit method. See my example below.Koestler
A
38

Thanks to John Vinyard for suggesting scipy. After some good research and testing, here is the solution to this question:

Prerequisites: Install Numpy and SciPy

  1. Import the SciPy and Numpy Modules

  2. Make a copy of the 5 dimensional array including just the X and Y values.

  3. Create an instance of a cKDTree as such:

    YourTreeName = scipy.spatial.cKDTree(YourArray, leafsize=100)
    #Play with the leafsize to get the fastest result for your dataset
    
  4. Query the cKDTree for the Nearest Neighbor within 6 units as such:

    for item in YourArray:
        TheResult = YourTreeName.query(item, k=1, distance_upper_bound=6)
    

    for each item in YourArray, TheResult will be a tuple of the distance between the two points, and the index of the location of the point in YourArray.

Arbalest answered 26/10, 2012 at 0:7 Comment(2)
How about just the nearest to one particular point, rather than a collection?Lanita
@SteveYeago query_ball_point seem to be available for this.Loon
K
2

Use sklearn.neighbors

from sklearn.neighbors import NearestNeighbors

#example dataset
coords_vect = np.vstack([np.sin(range(10)), np.cos(range(10))]).T 

knn = NearestNeighbors(n_neighbors=3)
knn.fit(coords_vect)
distance_mat, neighbours_mat = knn.kneighbors(coords_vect)

The code above finds nearest neighbors in a simple example dataset of 10 points which are located on a unit circle. The following explains the nearest neighbor algorithm results for this dataset.

Results explained:

neighbours_mat = array([[0, 6, 7],
                        [1, 7, 8],
                        [2, 8, 9],
                        [3, 9, 4],
                        [4, 3, 5],
                        [5, 6, 4],
                        [6, 0, 5],
                        [7, 1, 0],
                        [8, 2, 1],
                        [9, 3, 2]], dtype=int64)

The values of the result matrix, neighbours_mat, are indices of the elements (rows) in the input vector, coords_vect. In our example, reading the first row of the result neighbours_mat, the point in index 0 of coords_vect is closest to itself (index 0) then to the point in the index 6 of coords_vect, and then to the point in index 7 -> this can be verified with "The coordinates of the input vector coords_vect" plot below. The second raw in the result neighbours_mat indicates that the point in index 1 is closest to itself, then to the point in index 7, then to the point in index 8, and so on.
Notes: The first column in the result neighbours_mat is the node we measure distances from, the second column is its nearest neighbor the third column is the second nearest neighbor. You can get more neighbours by increasing n_neighbors @ NearestNeighbors(n_neighbors=3) initialization. The distance_mat are the distances of each node from its neighbors, notice that every node has distance of 0 to itself, therefore the first column is always zeros:

distance_mat = array([[0.        , 0.28224002, 0.70156646],
                      [0.        , 0.28224002, 0.70156646],
                      [0.        , 0.28224002, 0.70156646],
                      [0.        , 0.28224002, 0.95885108],
                      [0.        , 0.95885108, 0.95885108],
                      [0.        , 0.95885108, 0.95885108],
                      [0.        , 0.28224002, 0.95885108],
                      [0.        , 0.28224002, 0.70156646],
                      [0.        , 0.28224002, 0.70156646],
                      [0.        , 0.28224002, 0.70156646]])

Plotting the points:

x_vect, y_vect = np.sin(range(10)), np.cos(range(10))
plt.figure()
plt.scatter(x_vect, y_vect)
for x,y, label in zip (x_vect, y_vect, range(number_of_points)):
    plt.text(x,y,label)

The coordinates of the input vector coords_vect:

Points' Location

Koestler answered 28/3, 2022 at 9:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.