Fully reproducible parallel models using caret
Asked Answered
O

3

48

When I run 2 random forests in caret, I get the exact same results if I set a random seed:

library(caret)
library(doParallel)

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE

However, if I register a parallel back-end to speed up the modeling, I get a different result each time I run the model:

cl <- makeCluster(detectCores())
registerDoParallel(cl)

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

set.seed(42)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

stopCluster(cl)

> all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01813729"
[2] "Component 3: Mean relative difference: 0.02271638"

Is there any way to fix this issue? One suggestion was to use the doRNG package, but train uses nested loops, which currently aren't supported:

library(doRNG)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
registerDoRNG()

set.seed(42)
myControl <- trainControl(method='cv', index=createFolds(iris$Species))

set.seed(42)
> model1 <- train(Species~., iris, method='rf', trControl=myControl)
Error in list(e1 = list(args = seq(along = resampleIndex)(), argnames = "iter",  : 
  nested/conditional foreach loops are not supported yet.
See the package's vignette for a work around.

UPDATE: I thought this problem could be solved using doSNOW and clusterSetupRNG, but I couldn't quite get there.

set.seed(42)
library(caret)
library(doSNOW)
cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)

myControl <- trainControl(method='cv', index=createFolds(iris$Species))

clusterSetupRNG(cl, seed=rep(12345,6))
a <- clusterCall(cl, runif, 10000)
model1 <- train(Species~., iris, method='rf', trControl=myControl)

clusterSetupRNG(cl, seed=rep(12345,6))
b <- clusterCall(cl, runif, 10000)
model2 <- train(Species~., iris, method='rf', trControl=myControl)

all.equal(a, b)
[1] TRUE
all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] "Component 2: Mean relative difference: 0.01890339"
[2] "Component 3: Mean relative difference: 0.01656751"

stopCluster(cl)

What's special about foreach, and why doesn't it use the seeds I initiated on the cluster? objects a and b are identical, so why not model1 and model2?

Overcritical answered 15/11, 2012 at 18:1 Comment(3)
Perhaps this question will provide some useful information...?Ecker
It does provide useful information. Unfortunately, using snow would require modifying the caret source code, and using doRNG fails.Overcritical
Nowadays one can use library(doMC) - see caret.r-forge.r-project.org/parallel.htmlDogmatize
U
56

One easy way to run fully reproducible model in parallel mode using the caret package is by using the seeds argument when calling the train control. Here the above question is resolved, check the trainControl help page for further infos.

library(doParallel); library(caret)

#create a list of seed, here change the seed for each resampling
set.seed(123)

#length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)

#(3 is the number of tuning parameter, mtry for rf, here equal to ncol(iris)-2)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 3)

#for the last model
seeds[[11]]<-sample.int(1000, 1)

 #control list
 myControl <- trainControl(method='cv', seeds=seeds, index=createFolds(iris$Species))

 #run model in parallel
 cl <- makeCluster(detectCores())
 registerDoParallel(cl)
 model1 <- train(Species~., iris, method='rf', trControl=myControl)

 model2 <- train(Species~., iris, method='rf', trControl=myControl)
 stopCluster(cl)

 #compare
 all.equal(predict(model1, type='prob'), predict(model2, type='prob'))
[1] TRUE
Umbra answered 24/2, 2014 at 13:14 Comment(2)
This is new functionality in the caret package since I asked the question. Thanks for keeping me up-to-date!Overcritical
@Umbra I have a question, what If I set seeds=NA in the trainControl function?Leatriceleave
G
8

So caret uses the foreach package to parallelize. There is most likely a way to set the seed at each iteration, but we would need to setup more options in train.

Alternatively, you could create a custom modeling function that mimics the internal one for random forests and set the seed yourself.

Max

Gregor answered 12/12, 2012 at 16:53 Comment(0)
P
0

Which version of caret were you using?

@BBrill's answer is correct. However, since v6.0.64 (Jan 15, 2016), caret takes this issue into account. You may provide your customized trControl$seeds, but you don't have to. If trControl$seeds is NULL, caert will automatically generate those for you, which ensures reproducibility even when for parallel training.

This behavior can be found at https://github.com/topepo/caret/commit/9f375a1704e413d0806b73ab8891c7fadc39081c

Pull request: https://github.com/topepo/caret/pull/353

Related code snippets:

    if(is.null(trControl$seeds) || all(is.na(trControl$seeds)))  {
      seeds <- sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop) + 1L)
      seeds <- lapply(seq(from = 1L, to = length(seeds), by = nrow(trainInfo$loop)),
                      function(x) { seeds[x:(x+nrow(trainInfo$loop)-1L)] })
      seeds[[num_rs + 1L]] <- seeds[[num_rs + 1L]][1L]
      trControl$seeds <- seeds
    } else {
      (... omitted ...)
    }

For more details, you may

Pina answered 3/12, 2021 at 14:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.