I think it's very important to allow the user to register their own parallel backend. The doParallel
backend is very portable, but what if they want to run your function on multiple nodes of a cluster? What if they want to set the makeCluster
"outfile" option? It's unfortunate if making the parallel support transparent also makes it useless for many of your users.
I suggest that you use the getDoParRegistered
function to see if the user has already registered a parallel backend, and only register one for them if they haven't.
Here's an example:
library(doParallel)
parfun <- function(n=10, parallel=FALSE,
cores=getOption('mc.cores', 2L)) {
if (parallel) {
# honor registration made by user, and only create and register
# our own cluster object once
if (! getDoParRegistered()) {
cl <- makePSOCKcluster(cores)
registerDoParallel(cl)
message('Registered doParallel with ',
cores, ' workers')
} else {
message('Using ', getDoParName(), ' with ',
getDoParWorkers(), ' workers')
}
`%d%` <- `%dopar%`
} else {
message('Executing parfun sequentially')
`%d%` <- `%do%`
}
foreach(i=seq_len(n), .combine='c') %d% {
Sys.sleep(1)
i
}
}
This is written so that it only runs in parallel if parallel=TRUE
, even if they registered a parallel backend:
> parfun()
Executing parfun sequentially
[1] 1 2 3 4 5 6 7 8 9 10
If parallel=TRUE
and they haven't registered a backend, then it will create and register a cluster object for them:
> parfun(parallel=TRUE, cores=3)
Registered doParallel with 3 workers
[1] 1 2 3 4 5 6 7 8 9 10
If parfun
is called with parallel=TRUE
again, it will use the previously registered cluster:
> parfun(parallel=TRUE)
Using doParallelSNOW with 3 workers
[1] 1 2 3 4 5 6 7 8 9 10
This can be refined in many ways: it's just a simple demonstration. But at least it provides a convenience without preventing users from registering a different backend with custom options that might be necessary in their environment.
Note that the choice of a default number of cores/workers is also a tricky issue, and one that the CRAN maintainers care about. That is why I didn't make the default number of cores detectCores()
. Instead, I'm using the method used by mclapply
, although perhaps a different option name should be used.
Concerning stopCluster
Note that this example will sometimes create a new cluster object, but it never stops it via a call to stopCluster
. The reason is that creating cluster objects can be expensive, so I like to reuse them for multiple foreach loops, rather than create and destroy them each time. I'd rather leave that to the user, however, in this example, there isn't a way for the user to do that, since they don't have access to the cl
variable.
There are three ways to handle this:
- Call
stopCluster
in parfun
whenever makePSOCKcluster
is called;
- Write an additional function that allows the user to stop the implicitly created cluster object (equivalent to the
stopImplicitCluster
function in the doParallel
package);
- Don't worry about the implicitly created cluster object.
I would probably choose the second option for my own code, but that would significantly complicate this example. It's already rather complicated.