Millions of 3D points: How to find the 10 of them closest to a given point?
Asked Answered
E

12

71

A point in 3-d is defined by (x,y,z). Distance d between any two points (X,Y,Z) and (x,y,z) is d= Sqrt[(X-x)^2 + (Y-y)^2 + (Z-z)^2]. Now there are a million entries in a file, each entry is some point in space, in no specific order. Given any point (a,b,c) find the nearest 10 points to it. How would you store the million points and how would you retrieve those 10 points from that data structure.

Evenfall answered 21/3, 2010 at 5:47 Comment(3)
Do you create and fill the data structure before or after you're told what the point (a,b,c) is? David's answer, for example, doesn't work if you create the data structure first, and then a user types in the (a,b,c) and wants an answer instantly.Afterlife
Good point (no pun intended!) Of course, if (a,b,c) is not known in advance, it's more a problem of optimizing the existing list of points for searching by 3D location, rather than actually doing the search.Bevy
It really should be clarified if the cost of preparing the data structure and storing the million points in that data structure needs to be taken into account or only retrieval performance counts. If that cost does not matter then regardless of how many times you will retrieve the points kd-tree will win. If that cost does matter then you should also specify how many times you expect to run the search (for small number of searches brute force will win, for larger kd will win).Senzer
A
97

Million points is a small number. The most straightforward approach works here (code based on KDTree is slower (for querying only one point)).

Brute-force approach (time ~1 second)

#!/usr/bin/env python
import numpy

NDIM = 3 # number of dimensions

# read points into array
a = numpy.fromfile('million_3D_points.txt', sep=' ')
a.shape = a.size / NDIM, NDIM

point = numpy.random.uniform(0, 100, NDIM) # choose random point
print 'point:', point
d = ((a-point)**2).sum(axis=1)  # compute distances
ndx = d.argsort() # indirect sort 

# print 10 nearest points to the chosen one
import pprint
pprint.pprint(zip(a[ndx[:10]], d[ndx[:10]]))

Run it:

$ time python nearest.py 
point: [ 69.06310224   2.23409409  50.41979143]
[(array([ 69.,   2.,  50.]), 0.23500677815852947),
 (array([ 69.,   2.,  51.]), 0.39542392750839772),
 (array([ 69.,   3.,  50.]), 0.76681859086988302),
 (array([ 69.,   3.,  50.]), 0.76681859086988302),
 (array([ 69.,   3.,  51.]), 0.9272357402197513),
 (array([ 70.,   2.,  50.]), 1.1088022980015722),
 (array([ 70.,   2.,  51.]), 1.2692194473514404),
 (array([ 70.,   2.,  51.]), 1.2692194473514404),
 (array([ 70.,   3.,  51.]), 1.801031260062794),
 (array([ 69.,   1.,  51.]), 1.8636121147970444)]

real    0m1.122s
user    0m1.010s
sys 0m0.120s

Here's the script that generates million 3D points:

#!/usr/bin/env python
import random
for _ in xrange(10**6):
    print ' '.join(str(random.randrange(100)) for _ in range(3))

Output:

$ head million_3D_points.txt

18 56 26
19 35 74
47 43 71
82 63 28
43 82 0
34 40 16
75 85 69
88 58 3
0 63 90
81 78 98

You could use that code to test more complex data structures and algorithms (for example, whether they actually consume less memory or faster then the above simplest approach). It is worth noting that at the moment it is the only answer that contains working code.

Solution based on KDTree (time ~1.4 seconds)

#!/usr/bin/env python
import numpy

NDIM = 3 # number of dimensions

# read points into array
a = numpy.fromfile('million_3D_points.txt', sep=' ')
a.shape = a.size / NDIM, NDIM

point =  [ 69.06310224,   2.23409409,  50.41979143] # use the same point as above
print 'point:', point


from scipy.spatial import KDTree

# find 10 nearest points
tree = KDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query([point], k=10)

# print 10 nearest points to the chosen one
print a[ndx]

Run it:

$ time python nearest_kdtree.py  

point: [69.063102240000006, 2.2340940900000001, 50.419791429999997]
[[[ 69.   2.  50.]
  [ 69.   2.  51.]
  [ 69.   3.  50.]
  [ 69.   3.  50.]
  [ 69.   3.  51.]
  [ 70.   2.  50.]
  [ 70.   2.  51.]
  [ 70.   2.  51.]
  [ 70.   3.  51.]
  [ 69.   1.  51.]]]

real    0m1.359s
user    0m1.280s
sys 0m0.080s

Partial sort in C++ (time ~1.1 seconds)

// $ g++ nearest.cc && (time ./a.out < million_3D_points.txt )
#include <algorithm>
#include <iostream>
#include <vector>

#include <boost/lambda/lambda.hpp>  // _1
#include <boost/lambda/bind.hpp>    // bind()
#include <boost/tuple/tuple_io.hpp>

namespace {
  typedef double coord_t;
  typedef boost::tuple<coord_t,coord_t,coord_t> point_t;

  coord_t distance_sq(const point_t& a, const point_t& b) { // or boost::geometry::distance
    coord_t x = a.get<0>() - b.get<0>();
    coord_t y = a.get<1>() - b.get<1>();
    coord_t z = a.get<2>() - b.get<2>();
    return x*x + y*y + z*z;
  }
}

int main() {
  using namespace std;
  using namespace boost::lambda; // _1, _2, bind()

  // read array from stdin
  vector<point_t> points;
  cin.exceptions(ios::badbit); // throw exception on bad input
  while(cin) {
    coord_t x,y,z;
    cin >> x >> y >> z;    
    points.push_back(boost::make_tuple(x,y,z));
  }

  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;  // 1.14s

  // find 10 nearest points using partial_sort() 
  // Complexity: O(N)*log(m) comparisons (O(N)*log(N) worst case for the implementation)
  const size_t m = 10;
  partial_sort(points.begin(), points.begin() + m, points.end(), 
               bind(less<coord_t>(), // compare by distance to the point
                    bind(distance_sq, _1, point), 
                    bind(distance_sq, _2, point)));
  for_each(points.begin(), points.begin() + m, cout << _1 << "\n"); // 1.16s
}

Run it:

g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )
point: (69.0631 2.23409 50.4198)
(69 2 50)
(69 2 51)
(69 3 50)
(69 3 50)
(69 3 51)
(70 2 50)
(70 2 51)
(70 2 51)
(70 3 51)
(69 1 51)

real    0m1.152s
user    0m1.140s
sys 0m0.010s

Priority Queue in C++ (time ~1.2 seconds)

#include <algorithm>           // make_heap
#include <functional>          // binary_function<>
#include <iostream>

#include <boost/range.hpp>     // boost::begin(), boost::end()
#include <boost/tr1/tuple.hpp> // get<>, tuple<>, cout <<

namespace {
  typedef double coord_t;
  typedef std::tr1::tuple<coord_t,coord_t,coord_t> point_t;

  // calculate distance (squared) between points `a` & `b`
  coord_t distance_sq(const point_t& a, const point_t& b) { 
    // boost::geometry::distance() squared
    using std::tr1::get;
    coord_t x = get<0>(a) - get<0>(b);
    coord_t y = get<1>(a) - get<1>(b);
    coord_t z = get<2>(a) - get<2>(b);
    return x*x + y*y + z*z;
  }

  // read from input stream `in` to the point `point_out`
  std::istream& getpoint(std::istream& in, point_t& point_out) {    
    using std::tr1::get;
    return (in >> get<0>(point_out) >> get<1>(point_out) >> get<2>(point_out));
  }

  // Adaptable binary predicate that defines whether the first
  // argument is nearer than the second one to given reference point
  template<class T>
  class less_distance : public std::binary_function<T, T, bool> {
    const T& point;
  public:
    less_distance(const T& reference_point) : point(reference_point) {}

    bool operator () (const T& a, const T& b) const {
      return distance_sq(a, point) < distance_sq(b, point);
    } 
  };
}

int main() {
  using namespace std;

  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;

  const size_t nneighbours = 10; // number of nearest neighbours to find
  point_t points[nneighbours+1];

  // populate `points`
  for (size_t i = 0; getpoint(cin, points[i]) && i < nneighbours; ++i)
    ;

  less_distance<point_t> less_distance_point(point);
  make_heap  (boost::begin(points), boost::end(points), less_distance_point);

  // Complexity: O(N*log(m))
  while(getpoint(cin, points[nneighbours])) {
    // add points[-1] to the heap; O(log(m))
    push_heap(boost::begin(points), boost::end(points), less_distance_point); 
    // remove (move to last position) the most distant from the
    // `point` point; O(log(m))
    pop_heap (boost::begin(points), boost::end(points), less_distance_point);
  }

  // print results
  push_heap  (boost::begin(points), boost::end(points), less_distance_point);
  //   O(m*log(m))
  sort_heap  (boost::begin(points), boost::end(points), less_distance_point);
  for (size_t i = 0; i < nneighbours; ++i) {
    cout << points[i] << ' ' << distance_sq(points[i], point) << '\n';  
  }
}

Run it:

$ g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )

point: (69.0631 2.23409 50.4198)
(69 2 50) 0.235007
(69 2 51) 0.395424
(69 3 50) 0.766819
(69 3 50) 0.766819
(69 3 51) 0.927236
(70 2 50) 1.1088
(70 2 51) 1.26922
(70 2 51) 1.26922
(70 3 51) 1.80103
(69 1 51) 1.86361

real    0m1.174s
user    0m1.180s
sys 0m0.000s

Linear Search -based approach (time ~1.15 seconds)

// $ g++ -O3 nearest.cc && (time ./a.out < million_3D_points.txt )
#include <algorithm>           // sort
#include <functional>          // binary_function<>
#include <iostream>

#include <boost/foreach.hpp>
#include <boost/range.hpp>     // begin(), end()
#include <boost/tr1/tuple.hpp> // get<>, tuple<>, cout <<

#define foreach BOOST_FOREACH

namespace {
  typedef double coord_t;
  typedef std::tr1::tuple<coord_t,coord_t,coord_t> point_t;

  // calculate distance (squared) between points `a` & `b`
  coord_t distance_sq(const point_t& a, const point_t& b);

  // read from input stream `in` to the point `point_out`
  std::istream& getpoint(std::istream& in, point_t& point_out);    

  // Adaptable binary predicate that defines whether the first
  // argument is nearer than the second one to given reference point
  class less_distance : public std::binary_function<point_t, point_t, bool> {
    const point_t& point;
  public:
    explicit less_distance(const point_t& reference_point) 
        : point(reference_point) {}
    bool operator () (const point_t& a, const point_t& b) const {
      return distance_sq(a, point) < distance_sq(b, point);
    } 
  };
}

int main() {
  using namespace std;

  // use point value from previous examples
  point_t point(69.06310224, 2.23409409, 50.41979143);
  cout << "point: " << point << endl;
  less_distance nearer(point);

  const size_t nneighbours = 10; // number of nearest neighbours to find
  point_t points[nneighbours];

  // populate `points`
  foreach (point_t& p, points)
    if (! getpoint(cin, p))
      break;

  // Complexity: O(N*m)
  point_t current_point;
  while(cin) {
    getpoint(cin, current_point); //NOTE: `cin` fails after the last
                                  //point, so one can't lift it up to
                                  //the while condition

    // move to the last position the most distant from the
    // `point` point; O(m)
    foreach (point_t& p, points)
      if (nearer(current_point, p)) 
        // found point that is nearer to the `point` 

        //NOTE: could use insert (on sorted sequence) & break instead
        //of swap but in that case it might be better to use
        //heap-based algorithm altogether
        std::swap(current_point, p);
  }

  // print results;  O(m*log(m))
  sort(boost::begin(points), boost::end(points), nearer);
  foreach (point_t p, points)
    cout << p << ' ' << distance_sq(p, point) << '\n';  
}

namespace {
  coord_t distance_sq(const point_t& a, const point_t& b) { 
    // boost::geometry::distance() squared
    using std::tr1::get;
    coord_t x = get<0>(a) - get<0>(b);
    coord_t y = get<1>(a) - get<1>(b);
    coord_t z = get<2>(a) - get<2>(b);
    return x*x + y*y + z*z;
  }

  std::istream& getpoint(std::istream& in, point_t& point_out) {    
    using std::tr1::get;
    return (in >> get<0>(point_out) >> get<1>(point_out) >> get<2>(point_out));
  }
}

Measurements shows that most of the time is spent reading array from the file, actual computations take on order of magnitude less time.

Ahearn answered 21/3, 2010 at 7:42 Comment(11)
Nice write up. To offset for file read I've run your python implementations with loop around the search that execute 100 times (each time looking around a different point and building the kd-tree only once). The bruteforce still won. Made me scratch my head. But then I examined your leafsize and you have an error there - you are setting the leafsize to 1000001, and that'll not perform well. After setting leafsize to 10, kd won (35s to 70s for 100 points, with most of 35s spent on building the tree and 100 retrievals of 10 points taking a second).Senzer
So as a conclusion, if you can precompute the kd-tree it will win over brute force by orders of magnitude (not to mention that for really large data sets, if you have a tree you will not have to read all the data in memory).Senzer
@goran: if I set leafsize to 10 then it takes ~10 seconds (instead of 1 second) to query one point. I agree if the task is to query multiple (>10) points then kd-tree should win.Ahearn
@Unreason: pririty queue- and linear search- based implementations above do NOT read read all the data in memory.Ahearn
@J.F.Sebastian, sorry I was imprecise - they have to scan all the data (read it in memory). I agree that they don't have to keep it in memory. The kd tree approach (once it has a kd index will have to scan the index only, reading only part of it). Of course to build the index it will have to read all the data, but I was commenting on the situation where you can precompute (which in case of linear search is no gain)Senzer
from scipy.spatial import cKDTree is cython, does lookups > 50 times faster than the pure-python KDTree (in 16d, on my old mac ppc).Imbrue
Thanks for the great work. In order to make this answer more helpful it would be best to have the time taken to set-up the data structure separated from the time it takes to check the 10 points. I recognize the question only asked for 10 points but to make your answer more helpful to the world it would be better to generalize it to n points and to do that all it would take would be to separate out the set-up and running times.Mhd
Wish you have added some explanation, for every solution. This answer is not useful, for people who are not familiar with python and C++. like me!Alcaraz
If you only want the 10 nearest points and you don't care about their order, you could probably save some time by using np.argpartition rather than np.argsort.Mixtec
@ali_m: notice that there is "brute force" in the title for the code that uses np.argsort().Ahearn
I realise that - I would also consider np.argpartition to be a "brute force" solution. I think it would at least be worth mentioning it, given that you've also shown a hand-coded partial sort in C++.Mixtec
M
20

If the million entries are already in a file, there's no need to load them all into a data structure in memory. Just keep an array with the top-ten points found so far, and scan over the million points, updating your top-ten list as you go.

This is O(n) in the number of points.

Merv answered 21/3, 2010 at 5:53 Comment(7)
This will work well, but the array is not the most efficient data store, because you have to check over it each step, or keep it sorted, which can be a hassle. David's answer about a min-heap does that stuff for you, but is otherwise the same solution. When the user only wants 10 points, these concerns are negligible, but when the user suddenly wants the nearest 100 pts, you will be in trouble.Seventeenth
@ Karl: The question specifies 10 points. I think including this detail is deliberate on the part of the asker. So, Will describes a very good solution for what was asked.Brilliancy
@Karl, it is often surprising how good the compiler and the CPU is at optimizing the dumbest of loops to beat the cleverest of algorithms. Never underestimate the speedup to be gained when a loop can run in on-chip ram.Kettledrum
Million entries are not already in the file - you can choose how to store them in the file. :) This choice on how to store it implies that you can also precompute any accompanying indexing structure. Kd-tree will win as it will not have to read the whole file at all < O(n).Senzer
I've posted implementation of your answer #2486593 (though I use heap instead of linear search and it is completely unnecessary for the task)Ahearn
@FogleBird: For 10 points linear search is better than heap (1.11-1.12 vs. 1.15-1.18 seconds for the input I've tested). For N=10 a constant factor in algorithm does matter therefore O(N) algorithm can be faster than O(log(N)) algorithm.Ahearn
1.11 vs 1.15 is pretty slim. I'd code it with a heap anyway in case I ever wanted to do more than 10. Of course I don't know what the OP wants to do with it.Gruff
D
14

You could store the points in a k-dimensional tree (kd-tree). Kd-trees are optimized for nearest-neighbor searches (finding the n points closest to a given point).

Dogeatdog answered 21/3, 2010 at 6:4 Comment(6)
I think an octree is called for here.Biggs
The complexity required to build a K-d tree is going to be higher than the complexity required to do a linear search for the 10 closet points. The real power of a k-d tree comes when you are going to do many queries on a point set.Brilliancy
@gabe: One big advantage of a kd-tree is that while it'll take some memory overhead to build, once you have it a left-balanced kd-tree takes no more memory than the original unstructured list of point. An octree will definitely need at least some memory overhead.Radioman
kd-tree can be slower in real life than the brute-force approach #2486593Ahearn
This is the answer I would give in an interview. It's not rare for interviewers to use less-than-precise language, and reading between the lines this seems to be most likely what they're looking for. In fact if I were the interviewer, and someone gave the answer "I would store the points in any old order, and do a linear scan to find the 10 points" and justified that answer based on my imprecise wording, I would be pretty unimpressed.Knighthood
@ Jason Orendorff: I'd definitely discuss using a kd-tree for such a problem in a technical interview; however, I would also explain why for the specific problem given, the simpler, linear search method will not only be asymptoticly faster, but will run faster too. This will show a deeper understanding of the complexity of algorithms, knowledge of data structures, and the ability to consider different solutions to a problem.Brilliancy
E
10

I think this is a tricky question that tests if you don't try to overdo things.

Consider the simplest algorithm people already have given above: keep a table of ten best-so-far candidates and go through all the points one by one. If you find a closer point than any of the ten best-so-far, replace it. What's the complexity? Well, we have to look at each point from the file once, calculate it's distance (or square of the distance actually) and compare with the 10th closest point. If it's better, insert it in the appropriate place in the 10-best-so-far table.

So what's the complexity? We look at each point once, so it's n computations of the distance and n comparisons. If the point is better, we need to insert it in the right position, this requires some more comparisons, but it's a constant factor since the table of best candidates is of a constant size 10.

We end up with an algorithm that runs in linear time, O(n) in the number of points.

But now consider what is the lower bound on such an algorithm? If there is no order in the input data, we have to look at each point to see if it's not one of the closest ones. So as far as I can see, the lower bound is Omega(n) and thus the above algorithm is optimal.

Emulsoid answered 21/3, 2010 at 11:34 Comment(1)
Excellent point! Since you have to read the file one by one in order to build any data structure, your lowest possible is O(n) just as you say. Only if the question asks about finding the closest 10 points repeatedly does anything else matter! And you explained it well I think.Fort
C
6

No need to calculate the distance. Just the square of the distance should serve your needs. Should be faster I think. In other words, you can skip the sqrt bit.

Contrast answered 21/3, 2010 at 8:4 Comment(0)
B
4

This isn't a homework question, is it? ;-)

My thought: iterate over all points and put them into a min heap or bounded priority queue, keyed by distance from the target.

Bevy answered 21/3, 2010 at 5:51 Comment(1)
sure, but it is not known what the target is. :)Senzer
M
4

This question is essentially testing your knowledge and/or intuition of space partitioning algorithms. I'd say that storing the data in an octree is your best bet. It's commonly used in 3d engines that handle just this kind of problem (storing millions of vertices, ray tracing, finding collisions, etc.). The lookup time will be on the order of log(n) in the worst case scenario (I believe).

Marybelle answered 21/3, 2010 at 6:39 Comment(0)
T
2

Straightforward algorithm:

Store the points as a list of tuples, and scan over the points, computing the distance, and keeping a 'closest' list.

More creative:

Group points into regions (such as the cube described by "0,0,0" to "50,50,50", or "0,0,0" to "-20,-20,-20"), so you can "index" into them from the target point. Check which cube the target lies in, and only search through the points in that cube. If there are less than 10 points in that cube, check the "neighboring" cubes, and so on.

On further thought, this isn't a very good algorithm: if your target point is closer to the wall of a cube than 10 points, then you'll have to search into the neighboring cube as well.

I'd go with the kd-tree approach and find the closest, then remove (or mark) that closest node, and re-search for the new closest node. Rinse and repeat.

Turnery answered 21/3, 2010 at 6:1 Comment(0)
A
2

For any two points P1 (x1, y1, z1) and P2 (x2, y2, z2), if the distance between the points is d then all of the following must be true:

|x1 - x2| <= d 
|y1 - y2| <= d
|z1 - z2| <= d

Hold the 10 closest as you iterate over your entire set, but also hold the distance to the 10th closest. Save yourself a lot of complexity by using these three conditions before calculating the distance for every point you look at.

Aragonite answered 22/3, 2010 at 23:18 Comment(0)
C
1

basically a combination of the first two answer above me. since the points are in a file there's no need to keep them in memory. Instead of an array, or a min heap, I would use a max heap, since you only want to check for distances less than the 10th closest point. For an array, you would need to compare each newly calculated distance with all 10 of the distances you kept. For a min heap, you have to perform 3 comparisons with every newly calculated distance. With a max heap, you only perform 1 comparison when the newly calculated distance is greater than the root node.

Crinkly answered 21/3, 2010 at 6:3 Comment(0)
I
0

Calculate the distance for each of them, and do Select(1..10, n) in O(n) time. That would the naive algorithm I guess.

Intramundane answered 21/3, 2010 at 6:30 Comment(0)
S
0

This question needs further definition.

1) The decision regarding the algorithms that pre-index data varies very much depending if you can hold the whole data in the memory or not.

With kdtrees and octrees you will not have to hold the data in memory and the performance benefits from that fact, not only because the memory footprint is lower but simply because you will not have to read the whole file.

With bruteforce you will have to read the whole file and recompute distances for each new point you will be searching for.

Still, this might not be important to you.

2) Another factor is how many times will you have to search for a point.

As stated by J.F. Sebastian sometimes bruteforce is faster even on large data sets, but take care to take into account that his benchmarks measure reading the whole data set from disk (which is not necessary once kd-tree or octree is built and written somewhere) and that they measure only one search.

Senzer answered 21/3, 2010 at 14:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.