What is the best practice for making functions in my R package parallelizable?

Asked 21/2, 2017 at 17:33 Answered 6/3, 2017 at 8:48

Solved r parallel-processing r-package parallel-foreach

I have developed an R package that contains embarassingly parallel functions.

I would like to implement parallelization for these functions in a way that is transparent to the user, regardless of his/her OS (at least ideally).

I have looked around to see how other package authors have imported foreach-based Parallelism. For example, Max Kuhn's caret package imports foreach to use %dopar% but relies on the user to specify a parallel backend. (Several examples use doMC, which doesn't work on Windows.)

Noting that doParallel works for Windows and Linux/OSX and uses the built-in parallel package (see comments here for useful discussion), does it make sense to import doParallel and have my functions call registerDoParallel() whenever the user specifies parallel=TRUE as an argument?

Ritchie answered 21/2, 2017 at 17:33 Comment(0)

I think it's very important to allow the user to register their own parallel backend. The doParallel backend is very portable, but what if they want to run your function on multiple nodes of a cluster? What if they want to set the makeCluster "outfile" option? It's unfortunate if making the parallel support transparent also makes it useless for many of your users.

I suggest that you use the getDoParRegistered function to see if the user has already registered a parallel backend, and only register one for them if they haven't.

Here's an example:

library(doParallel)
parfun <- function(n=10, parallel=FALSE,
                   cores=getOption('mc.cores', 2L)) {
  if (parallel) {
    # honor registration made by user, and only create and register
    # our own cluster object once
    if (! getDoParRegistered()) {
      cl <- makePSOCKcluster(cores)
      registerDoParallel(cl)
      message('Registered doParallel with ',
              cores, ' workers')
    } else {
      message('Using ', getDoParName(), ' with ',
              getDoParWorkers(), ' workers')
    }
    `%d%` <- `%dopar%`
  } else {
    message('Executing parfun sequentially')
    `%d%` <- `%do%`
  }

  foreach(i=seq_len(n), .combine='c') %d% {
    Sys.sleep(1)
    i
  }
}

This is written so that it only runs in parallel if parallel=TRUE, even if they registered a parallel backend:

> parfun()
Executing parfun sequentially
 [1]  1  2  3  4  5  6  7  8  9 10

If parallel=TRUE and they haven't registered a backend, then it will create and register a cluster object for them:

> parfun(parallel=TRUE, cores=3)
Registered doParallel with 3 workers
 [1]  1  2  3  4  5  6  7  8  9 10

If parfun is called with parallel=TRUE again, it will use the previously registered cluster:

> parfun(parallel=TRUE)
Using doParallelSNOW with 3 workers
 [1]  1  2  3  4  5  6  7  8  9 10

This can be refined in many ways: it's just a simple demonstration. But at least it provides a convenience without preventing users from registering a different backend with custom options that might be necessary in their environment.

Note that the choice of a default number of cores/workers is also a tricky issue, and one that the CRAN maintainers care about. That is why I didn't make the default number of cores detectCores(). Instead, I'm using the method used by mclapply, although perhaps a different option name should be used.

Concerning stopCluster

Note that this example will sometimes create a new cluster object, but it never stops it via a call to stopCluster. The reason is that creating cluster objects can be expensive, so I like to reuse them for multiple foreach loops, rather than create and destroy them each time. I'd rather leave that to the user, however, in this example, there isn't a way for the user to do that, since they don't have access to the cl variable.

There are three ways to handle this:

Call stopCluster in parfun whenever makePSOCKcluster is called;
Write an additional function that allows the user to stop the implicitly created cluster object (equivalent to the stopImplicitCluster function in the doParallel package);
Don't worry about the implicitly created cluster object.

I would probably choose the second option for my own code, but that would significantly complicate this example. It's already rather complicated.

Stearoptene answered 22/2, 2017 at 16:36 Comment(4)

This looks like the right balance of support for easy, built-in parallelization and support for more advanced users. Thanks very much for your time. – Ritchie 22/2, 2017 at 17:17

How would you stopCluster(cl) then? Leave it up to the user? – Proselytism 6/7, 2017 at 20:22

@JPMac How about on.exit(stopCluster(cl)) after cl <- makePSOCKcluster(cores) – Seyler 7/8, 2017 at 7:57

@JPMac I updated my answer to address the issue of stopCluster. – Stearoptene 7/8, 2017 at 14:10

As the author of the future package, I recommend that you look at it. The future package unifies all of parallel's parallel / cluster functions into a single API.

https://cran.r-project.org/package=future

It is designed such that you as a developer write your code once and the user decides on the back end, e.g. plan(multiprocess), plan(cluster, workers = c("n1", "n3", "remote.server.org")) etc.

If user ha s access to an HPC cluster with one of the common schedulers such as Slurm, TORQUE / PBS, and SGE, then they can use the future.BatchJobs package which implements the future API on top of BatchJobs, e.g. plan(batchjobs_slurm). Your code remains the same. (There soon will also be future.batchtools package on top of batchtools)).

Pearlinepearlman answered 6/3, 2017 at 8:48 Comment(0)

Recommended topics

Hot tags