Hierarchical clustering of 1 million objects

M

2

25

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.

hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed).

I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10.

Mellott answered 6/2, 2012 at 7:40 Comment(4)

are your points in 2d, 3d, 10d, 128d ? – Manno 22/2, 2012 at 17:29

@Denis I dont understand what you mean by that. AS such, the limitation seems to stem from the fact that a nxn distance matrix for 1M objects cant fit into memory, and each of the clustering libraries I stated above (orange and scipy) take an in memory distance matrix as input (which is not possible to provide as input for 1M objects...) – Mellott 24/2, 2012 at 4:14

the points/objects are simple text files, that I am trying to cluster based on the text they contain.... can you also explain me if this is 2d or what? thanks. – Mellott 24/2, 2012 at 4:16

@user940154 The idea of 2d or nd is in which feature dimension you operate? So, if you attach to each of the word 'length' and 'starting letter' features, that would be 2d feature space. Highdimensional and sparsed data requires special data structures to do efficient clustering. – Hardily 2/3, 2012 at 21:57

M

12

To beat O(n^2), you'll have to first reduce your 1M points (documents) to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
Two possible approaches:

build a hierarchical tree from say 15k points, then add the rest one by one: time ~ 1M * treedepth
first build 100 or 1000 flat clusters, then build your hierarchical tree of the 100 or 1000 cluster centres.

How well either of these might work depends critically on the size and shape of your target tree -- how many levels, how many leaves ?
What software are you using, and how many hours / days do you have to do the clustering ?

For the flat-cluster approach, K-d_tree s work fine for points in 2d, 3d, 20d, even 128d -- not your case. I know hardly anything about clustering text; Locality-sensitive_hashing ?

Take a look at scikit-learn clustering -- it has several methods, including DBSCAN.

Added: see also
google-all-pairs-similarity-search "Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
SO hierarchical-clusterization-heuristics

Manno answered 27/2, 2012 at 14:22 Comment(2)

I don't think there is a general way to beat O(n^2) for hierarchical clustering. You can do some stuff for the particular case of single-link (see my reply), and of course you can use other algorithms (e.g. DBSCAN). Which is much more sensible for this large data anyway than hierarchical clustering. Note that scikit-learns DBSCAN is O(n^2), as it does AFAIK not use indexes. – Shenika 28/2, 2012 at 19:2

On O(n^2): if you accept higher error rates, you can sample (my first trivial suggestion), or LSH. There are lots of papers on fast clustering some of them write-only. On hierarchical clustering, I agree, but it would be nice if the OP would say how big a tree he or she wants, and why. – Manno 29/2, 2012 at 11:53

S

16

The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in O(n^3) time anyway.

You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.

If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a O(n^2) implementation of SLINK. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already have CLINK, too. And I'm not sure if there actually is any sub-O(n^3) algorithm for other variants than single-link and complete-link.

Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular). DBSCAN and OPTICS are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run in O(n log n), if you have an index with O(log n) query time. Which can make a huge difference for large data sets. I've personally used OPTICS on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.

Shenika answered 6/2, 2012 at 8:59 Comment(0)

M

12