3D clustering Algorithm

Asked 14/8, 2010 at 5:30 Answered 4/9, 2013 at 16:2

algorithm 3d cluster-analysis spatial data-partitioning

Problem Statement: I have the following problem:

There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.

Goal: To find an algorithm that can scale well to many processors and has a small memory requirement.

Thoughts: Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.

Stitt answered 14/8, 2010 at 5:30 Comment(2)

This sounds like the 3-D variant of the set covering problem!! :-) – Despairing 14/8, 2010 at 15:45

Your problem reminds me of "Vector QuantizatioN" which may give you some ideas: data-compression.com/vq.shtml . At the glance, the problem shouldn't be difficult to solve if you remove this restriction "the distance between any two points of those top N points must be greater than R" - this restriction causes major problem, and will require a lot of thinking to overcome it. – Carminacarminative 14/8, 2010 at 23:42

Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.

Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.

Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.

For more details, see Wikipedia:

https://en.wikipedia.org/wiki/Octree

Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!

Baylor answered 4/9, 2012 at 6:21 Comment(0)

I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.

I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.

Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).

If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.

Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.

Premonish answered 14/8, 2010 at 22:41 Comment(5)

This is indeed very similar to maximum coverage problem. The object function is different. The object here is to minimize: Sum((Ci-Ct/K)^2), i = 1,..K, where K is the number of the partition, Ci is the number of points in Set i and Ct is the total number of points. – Stitt 15/8, 2010 at 4:17

Ci is not exactly the variable we want to optimize. But it should be close enough. Ideally, Ci should also include the number of points in its nearest neighbor cells on the surface. Since the cell size is R, if distance calculation only needs to extend its nearest neighbor cell. – Stitt 15/8, 2010 at 4:21

One idea I have in mind now is that to create LxMxN cells ( length for each cell is R). The number of points can be easily recorded for each cell. And then a cluster algorithm can be used to find the dense clusters. Since there are way too many points, it is infeasible to perform the clustering algorithm for individual point. However, we can reduce the resolution of the counts in the LxMxN cell by dividing the counts by an arbitrary number. For instance, Ct/(LMN). And then greedy algorithm can be utilized to make the partition. Not sure if that is on the right track. – Stitt 15/8, 2010 at 4:23

I'm not quite sure what those first two comments were in reference to. However, the last approach you described is basically exactly what I was suggesting - some kind of spatial division (e.g. LSH) that allows you to count points in buckets to locate plausible solutions that you can then check (also, if your bucket size is in terms of R, then you can actually start to just do normal distance metrics on points in the current bucket and adjacent buckets). – Premonish 15/8, 2010 at 14:33

The first two comments are related to the partition of the data after placing them into the grid. – Stitt 16/8, 2010 at 3:57

Here are some possible parts of a solution. There are various choices at each stage, which will depend on Ncluster, on how fast the data changes, and on what you want to do with the means.

3 steps: quantize, box, K-means.

1) quantize: reduce the input XYZ coordinates to say 8 bits each, by taking 2^8 percentiles of X,Y,Z separately. This will speed up the whole flow without much loss of detail. You could sort all 1G points, or just a random 1M, to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256 with 2^(30-8) points in each range. To map float X -> 8 bit x, unrolled binary search is fast — see Bentley, Pearls p. 95.

Added: Kd trees split any point cloud into different-sized boxes, each with ~ Leafsize points — much better than splitting X Y Z as above. But afaik you'd have to roll your own Kd tree code to split only the first say 16M boxes, and keep counts only, not the points.

2) box: count the number of points in each 3d box, [xj .. xj+1, yj .. yj+1, zj .. zj+1]. The average box will have 2^(30-3*8) points; the distribution will depend on how clumpy the data is. If some boxes are too big or get too many points, you could a) split them into 8, b) track the centre of the points in each box, otherwide just take box midpoints.

3) K-means clustering on the 2^(3*8) box centres. (Google parallel "k means" -> 121k hits.) This depends strongly on K aka Ncluster, also on your radius R. A rough approach would be to grow a heap of the say 27*Ncluster boxes with the most points, then take the biggest ones subject to your Radius constraint. (I like to start with a Minimum spanning tree, then remove the K-1 longest links to get K clusters.) See also Color quantization .

I'd make Nbit, here 8, a parameter from the beginning.

What is your Ncluster ?

Added: if your points are moving in time, see collision-detection-of-huge-number-of-circles on SO.

Roca answered 6/9, 2010 at 10:38 Comment(0)

I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box). After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.

Here you can see how it was extended to model the point distribution using a planar model.

Reorganization answered 4/9, 2013 at 16:2 Comment(0)

Just an idea. Create a graph with given points and edges between points when distance < R.

Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.

I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.

Klansman answered 29/7, 2011 at 11:22 Comment(0)

Recommended topics

Hot tags