How to find the closest 2 points in a 100 dimensional space with 500,000 points?
Asked Answered
I

5

18

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?

Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.

Ingrate answered 10/10, 2010 at 5:6 Comment(11)
Out of interest, where did you get a 100-dimensional space?Hartmann
the question lacks clarity. is this a mathematical question?Sonja
@Sonja The question may lack many things, but it does have clarity: after reading 1 sentence I understand the problem completely. (although the type of space isn't mentioned, Euclidian is usually assumed by default)Grube
related: https://mcmap.net/q/102710/-millions-of-3d-points-how-to-find-the-10-of-them-closest-to-a-given-point Beware that KDTree's approach in your case requires ~5 days of computations.Steelman
what kind of performance are you looking for?Mcnutt
@J.F.Sebastian: I am going for an approximate k-NN solution. Using the ANN library suggested by @dalle below. But I plan to reduce the dimensions to 20 using PCA first as per @Hasan Khan's suggestions.Ingrate
@Unreason: The performance I am looking for is something that can run within a week or so.Ingrate
@louzer, well if you brute force it you would need 500 000 * 499 999 / 2 comparissons which is 1.25e11 comparisons, with each taking 400 floating point operations, which brings it to 5e13. In one day you have around 86400 seconds, so you would need 5.8e8 FLOPS and I think current processors are approximately there. With some cycles spent on other things I guess you could brute force it in a week.Mcnutt
@louzer: here's brute force approach using KDTree and multiprocessing ideone.com/Z7uSc (you could test it against your solution for small number of points)Steelman
@J.F.Sebastian, you should put it as an answer...Mcnutt
@Will, I got a 104 dimensional space when I plotted all proteins according to certain functional & structural parameters. I was able to derive a very small dataset with just the closest evolutionary relatives by solving the k-NN problem. I ran a classifier on those relatives to derive high precision & high recall signatures common to all proteins of certain properties.Ingrate
V
6

You could try the ANN library, but that only gives reliable results up to 20 dimensions.

Visser answered 10/10, 2010 at 5:50 Comment(2)
Thanks. ANN is just what I was looking for. Hopefully it can hold everything in RAM.Ingrate
ANN is easy to use, but it should be noted that it is an approximate nearest neighbor implementation, so isn't guaranteed to be correct.Frankfort
G
17

There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.

Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.

edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.

Grube answered 10/10, 2010 at 5:39 Comment(1)
Divide and conquer only works well when you have much more points then dimensions. You safe most from limiting the cross search over both halves by the shortest you found in the both halves. But here is the problem the shortest found might be longer then the point farthest away from the border chosen. And in the end you save nearly nothing until still go n^2 with even more overhead.Headmistress
Z
7

Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.

http://en.wikipedia.org/wiki/Kd-tree

P.S. Fun problem!

Zamir answered 10/10, 2010 at 6:5 Comment(0)
V
6

You could try the ANN library, but that only gives reliable results up to 20 dimensions.

Visser answered 10/10, 2010 at 5:50 Comment(2)
Thanks. ANN is just what I was looking for. Hopefully it can hold everything in RAM.Ingrate
ANN is easy to use, but it should be noted that it is an approximate nearest neighbor implementation, so isn't guaranteed to be correct.Frankfort
M
6

Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.

Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.

Measures answered 10/10, 2010 at 6:0 Comment(2)
Thanks. I am reducing the dimensions as per your suggestions.Ingrate
If you do run PCA 100 -> 20 dimensions, be sure to check the fraction of variance, sum( 20 eigenvalues ) / sum(all).Veneration
H
4

Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.

http://en.wikipedia.org/wiki/Kd-tree.

My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.

Here's his published paper: http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

Haakon answered 10/10, 2010 at 6:19 Comment(3)
kdtrees make it easy to find a nearest neighbor to a given point in O(log n) time, as I remember. Is there an optimization to find the closest pair of points in less than O(n log n)?Jingo
-1, also according to wikipedia kD-tree are efficient if N >> 2^k (where k is dimensions and N number of points; in this case 2^100 >> 5e5 and the answer is completely misleading)Mcnutt
10d is not 100d. Even if the data points lie roughly in a 10-d plane in 100d, kd-tree can't work (imho): think of a kd-tree 100 s deep.Veneration

© 2022 - 2024 — McMap. All rights reserved.