parallel k-means in R

Asked 6/12, 2013 at 5:49 Answered 2/5, 2018 at 21:27

Solved r parallel-processing parallel-foreach

I am trying to understand how to parallelize some of my code using R. So, in the following example I want to use k-means to cluster data using 2,3,4,5,6 centers, while using 20 iterations. Here is the code:

library(parallel)
library(BLR)

data(wheat)

parallel.function <- function(i) {
    kmeans( X[1:100,100], centers=?? , nstart=i )
}

out <- mclapply( c(5, 5, 5, 5), FUN=parallel.function )

How can we parallel simultaneously the iterations and the centers? How to track the outputs, assuming I want to keep all the outputs from k-means across all, iterations and centers, just to learn how?

Wrongdoer answered 6/12, 2013 at 5:49 Comment(1)

Another option is using the biganalytics package In page 4 you can find the bigkmeans() function. – Uzziel 4/1, 2014 at 23:46

This looked very simple to me at first ... and then i tried it. After a lot of monkey typing and face palming during my lunch break however, I arrived at this:

library(parallel)
library(BLR)

data(wheat)

mc = mclapply(2:6, function(x,centers)kmeans(x, centers), x=X)

It looks right though I didn't check how sensible the clustering was.

> summary(mc)
     Length Class  Mode
[1,] 9      kmeans list
[2,] 9      kmeans list
[3,] 9      kmeans list
[4,] 9      kmeans list
[5,] 9      kmeans list

On reflection the command syntax seems sensible - although a lot of other stuff that failed seemed reasonable too...The examples in the help documentation are maybe not that great.

Hope it helps.

EDIT As requested here is that on two variables nstart and centers

(pars = expand.grid(i=1:3, cent=2:4))

  i cent
1 1    2
2 2    2
3 3    2
4 1    3
5 2    3
6 3    3
7 1    4
8 2    4
9 3    4

L=list()
# zikes horrible
pars2=apply(pars,1,append, L)
mc = mclapply(pars2, function(x,pars)kmeans(x, centers=pars$cent,nstart=pars$i ), x=X)

> summary(mc)
      Length Class  Mode
 [1,] 9      kmeans list
 [2,] 9      kmeans list
 [3,] 9      kmeans list
 [4,] 9      kmeans list
 [5,] 9      kmeans list
 [6,] 9      kmeans list
 [7,] 9      kmeans list
 [8,] 9      kmeans list
 [9,] 9      means list

How'd you like them apples?

Poppycock answered 6/12, 2013 at 13:59 Comment(6)

Stephen Henderson, Thank you so much for your answer -- However the challenge, at least for me, is to simultaneously parallel the iterations and the number of clusters i.e " kmeans(x, centers,nstart =?) " Again thank you & i appreciate your help – Wrongdoer 6/12, 2013 at 14:51

@Wrongdoer Challenge Accepted! – Poppycock 6/12, 2013 at 15:38

NB note for sensible speed up you should control how many cores you are actually using based on what you have and a bit of testing... – Poppycock 6/12, 2013 at 15:49

Stephen Henderson : Very interesting answer, i learned something new from you today. I will apply your idea in one of my functions that required 2 for loops " takes forever". I will accept your answer late today. – Wrongdoer 6/12, 2013 at 15:54

Stephen Henderson: can we exchange emails, i am trying to apply what you just did in my "real life function"--- it looks like i am missing something. Can i share i did with you then we can work this problem together -- Here is my email : ielbasyoni@gmail, i will understand if you don't have time. thanks again – Wrongdoer 6/12, 2013 at 17:18

@Wrongdoer my mail is on my profile. If you send a brief reproducible version, I'll have a look.. No promises though, I haven't used mclapply much either. – Poppycock 6/12, 2013 at 17:29

There's a CRAN package called knor that is derived from a research paper that improves the performance using a memory efficient variant of Elkan's pruning algorithm. It's an order of magnitude faster than everything in these answers.

install.packages("knor")
require(knor)
iris.mat <- as.matrix(iris[,1:4])
k <- length(unique(iris[, dim(iris)[2]])) # Number of unique classes
nthread <- 4
kms <- Kmeans(iris.mat, k, nthread=nthread)

Icaria answered 2/5, 2018 at 21:27 Comment(1)

Thanks for pointing this out. Knor is fast! I highly recommend this for anyone reading this thread. Now blowing up far fewer HPC nodes for far less time. – Deluca 17/10, 2020 at 19:41

You may use parallel to try K-Means from different random starting points on multiple cores.

The code below is an example. (K=K in K-means, N= number of random starting points, C = number of cores you would like to use)

suppressMessages( library("Matrix") )
suppressMessages( library("irlba") )
suppressMessages( library("stats") )
suppressMessages( library("cluster") )
suppressMessages( library("fpc") )
suppressMessages( library("parallel") )

#Calculate KMeans results
calcKMeans <- function(matrix, K, N, C){
  #Parallel running from various of random starting points (Using C cores)
  results <- mclapply(rep(N %/% C, C), FUN=function(nstart) kmeans(matrix, K, iter.max=15, nstart=nstart), mc.cores=C);
  #Find the solution with smallest total within sum of square error
  tmp <- sapply(results, function(r){r[['tot.withinss']]})
  km <- results[[which.min(tmp)]]  
  #return cluster, centers, totss, withinss, tot.withinss, betweenss, size
  return(km)
}

runKMeans <- function(fin_uf, K, N, C, 
                      #fout_center, fout_label, fout_size, 
                      fin_record=NULL, fout_prediction=NULL){
  uf = read.table(fin_uf)
  km = calcKMeans(uf, K, N, C)
  rm(uf)
  #write.table(km$cluster, file=fout_label, row.names=FALSE, col.names=FALSE)
  #write.table(km$center, file=fout_center, row.names=FALSE, col.names=FALSE)
  #write.table(km$size, file=fout_size, row.names=FALSE, col.names=FALSE)
  str(km)

  return(km$center)
}

Hope it helps!

Civilize answered 22/5, 2014 at 21:41 Comment(0)

Recommended topics

Hot tags