Using R parallel to speed up bootstrap
Asked Answered
C

4

15

I would like to speed up my bootstrap function, which works perfectly fine itself. I read that since R 2.14 there is a package called parallel, but I find it very hard for sb. with low knowledge of computer science to really implement it. Maybe somebody can help.

So here we have a bootstrap:

n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b<-numeric()
for(i in 1:boot){
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
  print(paste('Run',i,sep=" "))
}

The goal is to use parallel processing / exploit the multiple cores of my PC. I am running R under Windows. Thanks!

EDIT (after reply by Noah)

The following syntax can be used for testing:

library(foreach)
library(parallel)
library(doParallel)
registerDoParallel(cores=detectCores(all.tests=TRUE))
n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
start1<-Sys.time()
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  unname(lm(y~x,bootstrap_data)$coef[2])
}
end1<-Sys.time()
boot_b<-numeric()
start2<-Sys.time()
for(i in 1:boot){
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  boot_b[i]<-lm(y~x,bootstrap_data)$coef[2]
}
end2<-Sys.time()
start1-end1
start2-end2
as.numeric(start1-end1)/as.numeric(start2-end2)

However, on my machine the simple R code is quicker. Is this one of the known side effects of parallel processing, i.e. it causes overheads to fork the process which add to the time in 'simple tasks' like this one?

Edit: On my machine the parallel code takes about 5 times longer than the 'simple' code. This factor apparently does not change as I increase the complexity of the task (e.g. increase boot or n). So maybe there is an issue with the code or my machine (Windows based processing?).

Concoction answered 12/4, 2013 at 18:26 Comment(0)
L
12

Try the boot package. It is well-optimized, and contains a parallel argument. The tricky thing with this package is that you have to write new functions to calculate your statistic, which accept the data you are working on and a vector of indices to resample the data. So, starting from where you define data, you could do something like this:

# Define a function to resample the data set from a vector of indices
# and return the slope
slopeFun <- function(df, i) {
  #df must be a data frame.
  #i is the vector of row indices that boot will pass
  xResamp <- df[i, ]
  slope <- lm(y ~ x, data=xResamp)$coef[2] 
} 

# Then carry out the resampling
b <- boot(data, slopeFun, R=1000, parallel="multicore")

b$t is a vector of the resampled statistic, and boot has lots of nice methods to easily do stuff with it - for instance plot(b)

Note that the parallel methods depend on your platform. On your Windows machine, you'll need to use parallel="snow".

Lyontine answered 12/4, 2013 at 19:48 Comment(2)
Your solution is speeding it up by something like 25% on my machine. That's nice. In the real application I am looking at, the problem is much more complex, because my function returns a list of parameters, which all have to be bootstrapped. Therefore I am looking for a direct implementation of parallel.Concoction
@tomka, boot still works well in this situation. The function you write can return a vector of parameters you are interested in. Indeed, this is one of the strengths of boot - the ability to write any arbitrary function yourself.Cantaloupe
B
8

I haven't tested foreach with the parallel backend on Windows, but I believe this will work for you:

library(foreach)
library(doSNOW)

cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl=cl)

n<-1000
boot<-1000
x<-rnorm(n,0,1)
y<-rnorm(n,1+2*x,2)
data<-data.frame(x,y)
boot_b <- foreach(i=1:boot, .combine=c) %dopar% {
  bootstrap_data<-data[sample(nrow(data),nrow(data),replace=T),]
  unname(lm(y~x,bootstrap_data)$coef[2])
}
Bimolecular answered 12/4, 2013 at 18:50 Comment(8)
Thanks, I sumitted the suggested syntax to testing (edited code above). It now uses 100% of my CPU (i.e. all processors). However, this is slower than doing it without parallel processing, see above.Concoction
Would be great if you can give any additional suggestions on the time problem, i.e. why is your suggestion not speeding it up? Thanks.Concoction
Hmm. Interesting. On my machine, (8 HT cores, 8 GB ram, Ubuntu 12.04), I got a speedup of about 3.4X with little RAM usage. I'm not as familiar with multithreading in a windows environment. Here's some things to try:Bimolecular
1) Pass a number to the cores argument of registerDoParallel, instead of trying to automatically detect them. 2) Sort the Task Manager decreasing on CPU usage and see when the processes start spinning up. If they're not starting, or if they're hanging, it may indicate that there are problems using the parallel package. You might try snow instead.Bimolecular
I will check this. Meanwhile, I read that multicore does not work (well/not at all?) on Windows machines. Maybe this is related? And just to be sure: the factor of 3.4 you are giving, is that computed from my code above? Because as.numeric(start1-end1)/as.numeric(start2-end2) will give you the factor of how much longer the parallel processing takes. Or did you take the inverse?Concoction
1) No difference 2) my CPU has two cores, three R processes are started which do not hang and together use 100% of the CPU.Concoction
?registerDoParallel says the right argument for Windows is a snow cluster object (as produced by makeCluster in snow), rather than # cores. Its default behavior on Windows when no cluster object is specified is to create a 3 worker socket cluster, (ergo 3 processes). I can't replicate your problem without Windows, but I noticed that snow seems to have more overhead than multicore (default parallel backend for *nix platforms). It may not be worth it to multithread with only two processors. You could look at running a cloud cluster (check out AWS.tools). Otherwise, I don't know.Bimolecular
I edited the code to change the parallel backend to snow explicitly. See if it's faster this way or if it can be made so. Interestingly when I first ran it on my machine it was slightly slower in parallel than in sequence. Then I stopped the cluster (stopCluster), ran the same code again and it was slightly faster in parallel than in sequence. Can't explain it. For a speed check I'm just comparing the system.times of each.Bimolecular
T
3

I think the main problem is that you have a lot of small tasks. In some cases, you can improve your performance by using task chunking, which results in fewer, but larger data transfers between the master and workers, which is often more efficient:

boot_b <- foreach(b=idiv(boot, chunks=getDoParWorkers()), .combine='c') %dopar% {
  sapply(1:b, function(i) {
    bdata <- data[sample(nrow(data), nrow(data), replace=T),]
    lm(y~x, bdata)$coef[[2]]
  })
}

I like using the idiv function for this, but you could b=rep(boot/detectCores(),detectCores()) if you like.

Tread answered 20/4, 2013 at 14:12 Comment(0)
V
2

this is an old-question but I think a lot of this can be made more efficient using data.table. the benefits will not really be noticed until larger data sets are used. Putting this answer here to help others that may have to bootstrap larger datasets

library(data.table)
setDT(data) # convert data.frame to data.table by reference
system.time({
  b <- rbindlist(
    lapply(
      1:boot,
      function(i) {
        data.table(
          # store the statistic
          'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
          # store the iteration
          'iteration' = i
        )
      }
    )
  )
})
# 1.66 seconds on my system
ggplot(b) + geom_density(aes(x = statistic))

You could then further improve performance by making use of parallel package.

library(parallel)
cl <- makeCluster(detectCores())            # use all cores on machine, can change this

clusterExport(                              # give it the variables it needs #nolint
  cl,
  c(
    "data"
  ),
  envir = environment()
)
clusterEvalQ(                               # give it libraries needed #nolint
  cl,
  c(
    library(data.table)
  )
)

system.time({
  b <- rbindlist(
    parLapply(              # this is changed to be in parallel
      cl,                   # give it the cluster you created earlier
      1:boot,
      function(i) {
        data.table(
          'statistic' = lm(y ~ x, data=data[sample(.N, .N, replace = T)])$coef[[2]],
          'iteration' = i
        )
      }
    )
  )
})

stopCluster(cl)
# .47 seconds on my machine
Valencia answered 16/11, 2021 at 12:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.