Nearest neighbor 1 dimensional data with a specified range
Asked Answered
R

2

6

I have two nested lists A and B:

A = [[50,140],[51,180],[54,500],......]

B = [[50.1, 170], [51,200],[55,510].....]

The 1st element in each inner list runs from 0 to around 1e5, the 0th element runs from around 50 up to around 700, these elements are unsorted. What i want to do, is run through each element in A[n][1] and find the closest element in B[n][1], but when searching for the nearest neighbor i want to only search within an interval defined by A[n][0] plus or minus 0.5.

I have been using the function:

def find_nearest_vector(array, value): 
   idx = np.array([np.linalg.norm(x+y) for (x,y) in array-value]).argmin()
   return array[idx]

Which finds the nearest neighbor between the coordinates A[0][:]and B[0][:], for example. However, this I need to confine the search range to a rectangle around some small shift in the value A[0][0]. Also, I do not want to reuse elements - I want a unique bijection between each value A[n][1] to B[n][1] within the interval A[n][0] +/- 0.5.

I have been trying to use Scipy's KDTree, but this reuses elements and I don't know how to confine the search range. Effectively, I want to do a one dimensional NNN search on a two dimensional nested list along a specific axis where the neighborhood in which the NNN search is within a hyper-rectangle defined by the 0th element in each inner list plus or minus some small shift.

Rabelaisian answered 15/9, 2013 at 20:12 Comment(3)
Let's see if I got your question straight: you want the find_nearest_vector() function to be confined to some interval (2D interval?). Why not just use [np.linalg.norm(x+y) if array[0][0]-0.5<np.linalg.norm(x+y)<array[0][0]+0.5 else 999999 for (x,y) in array-value]Typehigh
(Maybe I have some syntax issues, but I hope I conveyed my thoughts)Typehigh
Not quite, each window is defined by the 0th element in the inner list. So for each iteration, a window is defined about A[n][0] +/- 0.5 and the element A[n][1] is then matched to the closest element in B[n]. I hope this makes more sense..?Rabelaisian
C
2

I use numpy.argsort(), numpy.searchsorted(), numpy.argmin() to do the search.

%pylab inline
import numpy as np
np.random.seed(0)
A = np.random.rand(5, 2)
B = np.random.rand(100, 2)
xaxis_range = 0.02
order = np.argsort(B[:, 0])
bx = B[order, 0]
sidx = np.searchsorted(bx, A[:, 0] - xaxis_range, side="right")
eidx = np.searchsorted(bx, A[:, 0] + xaxis_range, side="left")
result = []
for s, e, ay in zip(sidx, eidx, A[:, 1]):
    section = order[s:e]
    by = B[section, 1]
    idx = np.argmin(np.abs(ay-by))
    result.append(B[section[idx]])
result = np.array(result)

I plot the result as following:

plot(A[:, 0], A[:, 1], "o")
plot(B[:, 0], B[:, 1], ".")
plot(result[:, 0], result[:, 1], "x")

the output:

enter image description here

Cleaves answered 16/9, 2013 at 0:56 Comment(2)
I have a problem - when the list length of A and B gets larger i keep getting the error - ValueError: attempt to get argmax/argmin of an empty sequence - do you know what the cause of this is?Rabelaisian
This means that there is no B between A[:,0] - 0.5 ~ A[:,0] + 0.5, this will cause s == e, and section and by are empty.Cleaves
A
0

My understanding of your problem is that you are trying to find the closest elements for each A[n][1] in another set of points (B[i][1] restricted to points where if A[n][0] is within +/- 0.5 of B[i][0]).

(I'm not familiar with numpy or scipy, and I'm sure that there's a better way to do this with their algorithms.)

That being said, here's my naive implementation in O(a*b*log(a*b)) time.

def main(a,b):
    for a_bound,a_val in a:
        dist_to_valid_b_points = {abs(a_val-b_val):(b_bound,b_val) for b_bound,b_val in b if are_within_bounds(a_bound,b_bound)}
        print get_closest_point((a_bound, a_val),dist_to_valid_b_points)

def are_within_bounds(a_bound, b_bound):
    return abs(b_bound-a_bound) < 0.5

def get_closest_point(a_point, point_dict):
    return (a_point, None if not point_dict else point_dict[min(point_dict, key=point_dict.get)])

main([[50,140],[51,180],[54,500]],[[50.1, 170], [51,200],[55,510]]) yields the following output:

((50, 140), (50.1, 170))
((51, 180), (51, 200))
((54, 500), None)
Azpurua answered 15/9, 2013 at 21:4 Comment(2)
This seems to work, thank you very much - I will have to spend some time just checking things before i accept the answer. I have, however, one question - how come in the output the third element is ((54, 500), None) and not ((54, 500), (55,510))..?Rabelaisian
The distance between 54 and 55 is greater than 0.5, so (55,510) is excluded by are_within_bounds.Azpurua

© 2022 - 2024 — McMap. All rights reserved.