How can I speed up nearest neighbor search with python?
Asked Answered
S

2

7

I have a code, which calculates the nearest voxel (which is unassigned) to a voxel ( which is assigned). That is i have an array of voxels, few voxels already have a scalar (1,2,3,4....etc) values assigned, and few voxels are empty (lets say a value of '0'). This code below finds the nearest assigned voxel to an unassigned voxel and assigns that voxel the same scalar. So, a voxel with a scalar '0' will be assigned a value (1 or 2 or 3,...) based on the nearest voxel. This code below works, but it takes too much time. Is there an alternative to this ? or if you have any feedback on how to improve it further?

""" #self.voxels is a 3D numpy array"""

def fill_empty_voxel1(self,argx, argy, argz):
""" where # argx, argy, argz are the voxel location where the voxel is zero"""
    argx1, argy1, argz1 = np.where(self.voxels!=0)   # find the non zero voxels
    a = np.column_stack((argx1, argy1, argz1)) 
    b = np.column_stack((argx, argy, argz))
    tree = cKDTree(a, leafsize=a.shape[0]+1)
    distances, ndx = tree.query(b, k=1, distance_upper_bound= self.mean) # self.mean is a mean radius search value
    argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
    self.voxels[argx,argy,argz] = self.voxels[argx2,argy2,argz2] # update the voxel array

Example

""" Here is a small example with small dataset:"""

import numpy as np
from scipy.spatial import cKDTree
import timeit

voxels = np.zeros((10,10,5), dtype=np.uint8)
voxels[1:2,:,:] = 5.
voxels[5:6,:,:] = 2.
voxels[:,3:4,:] = 1.
voxels[:,8:9,:] = 4.
argx, argy, argz = np.where(voxels==0)

tic=timeit.default_timer()
argx1, argy1, argz1 = np.where(voxels!=0)   # non zero voxels
a = np.column_stack((argx1, argy1, argz1)) 
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5.)
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
voxels[argx,argy,argz] = voxels[argx2,argy2,argz2]
toc=timeit.default_timer()
timetaken = toc - tic #elapsed time in seconds
print '\nTime to fill empty voxels', timetaken

for visualization:

from mayavi import mlab
data = voxels.astype('float')
scalar_field = mlab.pipeline.scalar_field(data)
iso_surf = mlab.pipeline.iso_surface(scalar_field)
surf = mlab.pipeline.surface(scalar_field)  
vol = mlab.pipeline.volume(scalar_field,vmin=0,vmax=data.max())  
mlab.outline()    
mlab.show()    

Now, if I have the dimension of the voxels array as something like (500,500,500), then the time it takes to compute the nearest search is no longer efficient. How can I overcome this? Could parallel computation reduce the time (I have no idea whether I can parallelize the code, if you do, please let me know)?

A potential fix:

I could substantially improve the computation time by adding the n_jobs = -1 parameter in the cKDTree query.

distances, ndx = tree.query(b, k=1, distance_upper_bound= 5., n_jobs=-1)

I was able to compute the distances in less than a hour for an array of (400,100,100) on a 13 core CPU. I tried with 1 processor and it takes around 18 hours to complete the same array. Thanks to @gsamaras for the answer!

Subzero answered 23/9, 2016 at 12:38 Comment(3)
As an assumption, you can try methods from scikit-learn.org/stable/modules/generated/… , but I think the problem is in the memory capacity. (500,500,500) is really giant object.Encephalitis
Hello @gsamaras, Thank you for the answer. I did tr with the sklearn neighbor approach, but the time of computation doesnt seem to have much bigger impact. I was waiting if anyone else could come with any other answer, before I accept your answer as the final one. Since your answer is much closer to what I have asked for, I will accept your answer. Thank you again!Subzero
Hmm, I see @RavirajpurohitPurushottamr, thanks for your input, good luck!Becky
B
2

It would be interesting to try sklearn.neighbors.NearestNeighbors, which offers n_jobs parameter:

The number of parallel jobs to run for neighbors search.

This package also provides the Ball Tree algorithm, which you can test versus the kd-tree one, however my hunch is that the kd-tree will be better (but that again does depend on your data, so research that!).


You might also want to use dimensionality reduction, which is easy. The idea is that you reduce your dimensions, thus your data contain less info, so that tackling the Nearest Neighbour Problem can be done much faster. Of course, there is a trade off here, accuracy!

You might/will get less accuracy with dimensionality reduction, but it might worth the try. However, this usually applies in a high dimensional space, and you are just in 3D. So I don't know if for your specific case it would make sense to use sklearn.decomposition.PCA.


A remark:

If you really want high performance though, you won't get it with , you could switch to , and use CGAL for example.

Becky answered 24/9, 2016 at 19:54 Comment(0)
B
8

You can switch to approximate nearest neighbors (ANN) algorithms which usually take advantage of sophisticated hashing or proximity graph techniques to index your data quickly and perform faster queries. One example is Spotify's Annoy. Annoy's README includes a plot which shows precision-performance tradeoff comparison of various ANN algorithms published in recent years. The top-performing algorithm (at the time this comment was posted), hnsw, has a Python implementation under Non-Metric Space Library (NMSLIB).

Bamberger answered 5/12, 2017 at 16:46 Comment(0)
B
2

It would be interesting to try sklearn.neighbors.NearestNeighbors, which offers n_jobs parameter:

The number of parallel jobs to run for neighbors search.

This package also provides the Ball Tree algorithm, which you can test versus the kd-tree one, however my hunch is that the kd-tree will be better (but that again does depend on your data, so research that!).


You might also want to use dimensionality reduction, which is easy. The idea is that you reduce your dimensions, thus your data contain less info, so that tackling the Nearest Neighbour Problem can be done much faster. Of course, there is a trade off here, accuracy!

You might/will get less accuracy with dimensionality reduction, but it might worth the try. However, this usually applies in a high dimensional space, and you are just in 3D. So I don't know if for your specific case it would make sense to use sklearn.decomposition.PCA.


A remark:

If you really want high performance though, you won't get it with , you could switch to , and use CGAL for example.

Becky answered 24/9, 2016 at 19:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.