Nearest neighbor 1 dimensional data with a specified range

Asked 15/9, 2013 at 20:12 Answered 16/9, 2013 at 0:56

Solved python algorithm numpy scipy nearest-neighbor

I have two nested lists A and B:

A = [[50,140],[51,180],[54,500],......]

B = [[50.1, 170], [51,200],[55,510].....]

The 1st element in each inner list runs from 0 to around 1e5, the 0th element runs from around 50 up to around 700, these elements are unsorted. What i want to do, is run through each element in A[n][1] and find the closest element in B[n][1], but when searching for the nearest neighbor i want to only search within an interval defined by A[n][0] plus or minus 0.5.

I have been using the function:

def find_nearest_vector(array, value): 
   idx = np.array([np.linalg.norm(x+y) for (x,y) in array-value]).argmin()
   return array[idx]

Which finds the nearest neighbor between the coordinates A[0][:]and B[0][:], for example. However, this I need to confine the search range to a rectangle around some small shift in the value A[0][0]. Also, I do not want to reuse elements - I want a unique bijection between each value A[n][1] to B[n][1] within the interval A[n][0] +/- 0.5.

I have been trying to use Scipy's KDTree, but this reuses elements and I don't know how to confine the search range. Effectively, I want to do a one dimensional NNN search on a two dimensional nested list along a specific axis where the neighborhood in which the NNN search is within a hyper-rectangle defined by the 0th element in each inner list plus or minus some small shift.

Rabelaisian answered 15/9, 2013 at 20:12 Comment(3)

Let's see if I got your question straight: you want the find_nearest_vector() function to be confined to some interval (2D interval?). Why not just use [np.linalg.norm(x+y) if array[0][0]-0.5<np.linalg.norm(x+y)<array[0][0]+0.5 else 999999 for (x,y) in array-value] – Typehigh 15/9, 2013 at 20:31

(Maybe I have some syntax issues, but I hope I conveyed my thoughts) – Typehigh 15/9, 2013 at 20:32

Not quite, each window is defined by the 0th element in the inner list. So for each iteration, a window is defined about A[n][0] +/- 0.5 and the element A[n][1] is then matched to the closest element in B[n]. I hope this makes more sense..? – Rabelaisian 15/9, 2013 at 21:21

I use numpy.argsort(), numpy.searchsorted(), numpy.argmin() to do the search.

%pylab inline
import numpy as np
np.random.seed(0)
A = np.random.rand(5, 2)
B = np.random.rand(100, 2)
xaxis_range = 0.02
order = np.argsort(B[:, 0])
bx = B[order, 0]
sidx = np.searchsorted(bx, A[:, 0] - xaxis_range, side="right")
eidx = np.searchsorted(bx, A[:, 0] + xaxis_range, side="left")
result = []
for s, e, ay in zip(sidx, eidx, A[:, 1]):
    section = order[s:e]
    by = B[section, 1]
    idx = np.argmin(np.abs(ay-by))
    result.append(B[section[idx]])
result = np.array(result)

I plot the result as following:

plot(A[:, 0], A[:, 1], "o")
plot(B[:, 0], B[:, 1], ".")
plot(result[:, 0], result[:, 1], "x")

the output:

enter image description here

Cleaves answered 16/9, 2013 at 0:56 Comment(2)

I have a problem - when the list length of A and B gets larger i keep getting the error - ValueError: attempt to get argmax/argmin of an empty sequence - do you know what the cause of this is? – Rabelaisian 16/9, 2013 at 15:50

This means that there is no B between A[:,0] - 0.5 ~ A[:,0] + 0.5, this will cause s == e, and section and by are empty. – Cleaves 16/9, 2013 at 22:1

My understanding of your problem is that you are trying to find the closest elements for each A[n][1] in another set of points (B[i][1] restricted to points where if A[n][0] is within +/- 0.5 of B[i][0]).

(I'm not familiar with numpy or scipy, and I'm sure that there's a better way to do this with their algorithms.)

That being said, here's my naive implementation in O(a*b*log(a*b)) time.

def main(a,b):
    for a_bound,a_val in a:
        dist_to_valid_b_points = {abs(a_val-b_val):(b_bound,b_val) for b_bound,b_val in b if are_within_bounds(a_bound,b_bound)}
        print get_closest_point((a_bound, a_val),dist_to_valid_b_points)

def are_within_bounds(a_bound, b_bound):
    return abs(b_bound-a_bound) < 0.5

def get_closest_point(a_point, point_dict):
    return (a_point, None if not point_dict else point_dict[min(point_dict, key=point_dict.get)])

main([[50,140],[51,180],[54,500]],[[50.1, 170], [51,200],[55,510]]) yields the following output:

((50, 140), (50.1, 170))
((51, 180), (51, 200))
((54, 500), None)

Azpurua answered 15/9, 2013 at 21:4 Comment(2)

This seems to work, thank you very much - I will have to spend some time just checking things before i accept the answer. I have, however, one question - how come in the output the third element is ((54, 500), None) and not ((54, 500), (55,510))..? – Rabelaisian 15/9, 2013 at 21:35

The distance between 54 and 55 is greater than 0.5, so (55,510) is excluded by are_within_bounds. – Azpurua 15/9, 2013 at 21:52

Recommended topics

Hot tags