K-means with really large matrix

Asked 16/6, 2011 at 13:8 Answered 14/9, 2012 at 22:15

Solved r cluster-analysis weka k-means mahout

I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space.

I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough space).

So my question is : what software should I use (eventually in association with some other packages or custom settings).

Thanks for helping me.

Note : I use linux.

Fotina answered 16/6, 2011 at 13:8 Comment(3)

The problem you're likely to run in to with R is that storing this data in a matrix limits indices to the max integer value (2147483647), and you have more elements than that. This isn't a memory limitation, but a limitation that results from using integers to index the data. Can you sample the matrix instead? – Salable 16/6, 2011 at 13:37

Why do you want to cluster all 300,000 objects at once? Why not take a smaller sample, cluster that and then assign the remaining objects to their nearest cluster? – Dichlorodiphenyltrichloroethane 16/6, 2011 at 14:12

How many clusters are you looking for ? Are there samples with known clustering, for validation ? – Workmanship 17/6, 2011 at 16:0

Does it have to be K-means? Another possible approach is to transform your data into a network first, then apply graph clustering. I am the author of MCL, an algorithm used quite often in bioinformatics. The implementation linked to should easily scale up to networks with millions of nodes - your example would have 300K nodes, assuming that you have 100K attributes. With this approach, the data will be naturally pruned in the data transformation step - and that step will quite likely become the bottleneck. How do you compute the distance between two vectors? In the applications that I have dealt with I used the Pearson or Spearman correlation, and MCL is shipped with software to efficiently perform this computation on large scale data (it can utilise multiple CPUs and multiple machines).

There is still an issue with the data size, as most clustering algorithms will require you to at least perform all pairwise comparisons at least once. Is your data really stored as a giant matrix? Do you have many zeros in the input? Alternatively, do you have a way of discarding smaller elements? Do you have access to more than one machine in order to distribute these computations?

Spiros answered 16/6, 2011 at 14:25 Comment(2)

micans +1 MCL, +1 if I could for "How do you compute the distance between two vectors?" -- important. Delphine, first experiment on sample training sets that run quickly, see how they cluster. – Workmanship 17/6, 2011 at 13:39

+1 not for reciprocation's sake, but for the smaller sample training sets - quite important. Delphine, it would be nice if you engaged a bit more. I wonder whether your dataset is really best described as units, each having 100K attributes -- hence my question whether it is sparse (zero-rich). If that is the case, the distance between vectors might in fact be more resembling of an overlap type distance between sets, and in my view strengthen the case for a network based approach. – Spiros 17/6, 2011 at 22:35

I keep the link (that can be useful to the specific user) but I agree with Gavin's comment! To perform a k-means clustering on Big Data you can use the rxKmeans function implemented in the Revolution R Enterprise proprietary implementation of R (I know this can be a problem); this function seems to be capable of manage that kind of data.

Genus answered 16/6, 2011 at 13:35 Comment(1)

On StackOverflow, simple answers that employ links off site are frowned upon, at best. What happens if that page moves or becomes unavailable? Try to include sufficient information in your answer so that it stands on its own - by all means attribute where the idea came from but don't just use a link in an answer. – Dichlorodiphenyltrichloroethane 16/6, 2011 at 14:11

Since we know nothing at all about the data, nor the questioner's goals for it, just a couple of general links:
I. Guyon's video lectures — many papers and books too.
feature selection on stats.stackexchange

Workmanship answered 22/6, 2011 at 14:55 Comment(0)

Check out Mahout, it will do k means on a large data set:

http://mahout.apache.org/

Suspensor answered 14/9, 2012 at 22:15 Comment(0)

Recommended topics

Hot tags