MemoryError in Python while using cKDTree().query_ball_tree
Asked Answered
F

1

6

I have large 2D arrays with unsorted (X,Y) points, for which I need to know which points are in close proximity to each other (nearest-neighbor lookup). I have used cKDTree and query_ball_tree with succes for arrays with around 500,000 (X,Y) points. However, when I try the same algorithm for datasets of more than 1,000,000 points, query_ball_tree results in a MemoryError.

I use 64-bit Windows with 16Gb of internal memory, and have tried both 32-bit and 64-bit versions of Python and the extension modules (scipy and numpy).

def Construct_SearchTree(AllXyPoints):
    KDsearch = cKDTree(AllXyPoints)  
    return KDsearch.query_ball_tree(KDsearch, Maxdist)

My questions:

1) does anybody know of an alternative to cKDTree / query_ball_tree that consumes less memory? Speed is less important in this case that memory usage.

2) I hoped that switching from 32-bit to 64-bit python & extensions would solve the MemoryError. What could be the reason that it didn't?

Thanks for your incoming help and advice.

Flatways answered 6/8, 2013 at 11:27 Comment(0)
L
6

I experienced a MemoryError with SciPy's cKDTree during construction and scikit-learn's KDTree when calling .query_radius(). I found that Scikit-learn's BallTree was more memory efficient, and using a BallTree solved the issue for me. I've tested BallTree with 1 million data points on my 64-bit system. It still consumes all my available memory (12GB) and some swap space, but I don't get a MemoryError.

Queries on a BallTree will not be as fast as a KDTree since your data is 2D, and BallTrees are slower than KDTrees when d <= 3 (see explanation here). However, given that cKDtree and scikit-learn's KDTree both raise MemorErrors (on my system, anyway), the simplest solution is to use a BallTree.

from sklearn.neighbors import BallTree
import numpy as np

max_dist = .1
points = np.random.normal(size=2000000).reshape(1000000, 2) #1 million points
ball_tree = BallTree(points)

neighbors = ball_tree.query_radius(points, max_dist)

Depending on your Maxdist, the returned result can consume a lot of memory (up to O(n^2)), but scikit-learn's BallTree.query_radius() returns an np.array of np.arrays rather than a list of lists so it should save you some memory there (see this answer for an explanation).

Larocca answered 6/8, 2013 at 14:50 Comment(4)
A Ball-tree is also not a KD-tree (and the effects are unmentioned; despite saying it's working with some given dimensions). But i'm also recommending sklearn for these data-stuctures.Roberge
@sascha, I've updated my answer. Apart from performance, did you have other effects in mind?Larocca
@Larocca Rather than querying the nearest neighbors within a single list Is it possible to use the ball tree method to compute the nearest distance neighbours for coordinates between two nested lists: a and b, where a dimension = 1000,2, b dimension = 2000,2: e.g. a=[ [1.2, 1.3], [6.5, 10] ... ] and b= [ [11,1.2] , [9.6, 8]... ]. I've done this using scipy.spatial.kdtree by doing: tree = spatial.KDTree(a) and closest_distance, closest_index = tree.query(b, k=1, p=2). But I'm not sure is possible with ball tree. Is?Kamakura
@Kamakura Yes. From the docs query() takes an array-like (read list, numpy.array or anything that quacks like an array) as the first argument. In my example, the points I passed to query_radius() did not have to be the same points used in construction--the same goes for query()Larocca

© 2022 - 2024 — McMap. All rights reserved.