R foreach: from single-machine to cluster
Asked Answered
L

2

8

The following (simplified) script works fine on the master node of a unix cluster (4 virtual cores).

library(foreach)
library(doParallel)

nc = detectCores()
cl = makeCluster(nc)
registerDoParallel(cl)

foreach(i = 1:nrow(data_frame_1), .packages = c("package_1","package_2"), .export = c("variable_1","variable_2"))  %dopar% {     

    row_temp = data_frame_1[i,]
    function(argument_1 = row_temp, argument_2 = variable_1, argument_3 = variable_2)

}

stopCluster(cl)

I would like to take advantage of the 16 nodes in the cluster (16 * 4 virtual cores in total).

I guess all I need to do is change the parallel backend specified by makeCluster. But how should I do that? The documentation is not very clear.

Based on this quite old (2013) post http://www.r-bloggers.com/the-wonders-of-foreach/ it seems that I should change the default type (sock or MPI - which one- would that work on unix?)

EDIT

From this vignette by the authors of foreach:

By default, doParallel uses multicore functionality on Unix-like systems and snow functionality on Windows. Note that the multicore functionality only runs tasks on a single computer, not a cluster of computers. However, you can use the snow functionality to execute on a cluster, using Unix-like operating systems, Windows, or even a combination.

What does you can use the snow functionality mean? How should I do that?

Lisa answered 22/4, 2016 at 12:38 Comment(2)
I am not using a for loop...Lisa
I would recommend downloading Revolution R open, which has improved statistical libraries and multi-core support. check my answer on this post for more information about RRO. This is not directly related to your question, but it will speed up any mathematical calculations and package which can use multi cores.Burgos
S
9

The parallel package is a merger of the multicore and snow packages, but if you want to run on multiple nodes, you have to make use of the "snow functionality" in parallel (that is, the part of parallel that was derived from snow). Practically speaking, that means you need to call makeCluster with the "type" argument set to either "PSOCK", "SOCK", "MPI" or "NWS" because those are the only cluster types supported by the current version of parallel that support execution on multiple nodes. If you're using a cluster that is managed by knowledgeable HPC sysadmins, you should use "MPI", otherwise it may be easier to use "PSOCK" (or "SOCK" if you have a particular reason to use the "snow" package).

If you choose to create an "MPI" cluster, you should execute the script via R using the mpirun command with the "-n 1" option, and the first argument to makeCluster set to the number of workers that should be spawned. (If you don't know what that means, you may not want to use this approach.)

If you choose to create a "PSOCK" or "SOCK" cluster, the first argument to makeCluster must be a vector of hostnames, and makeCluster will start workers on those nodes via the "ssh" command when makeCluster is executed. That means you must have ssh daemons running on all of the specified hosts.

I've written much more on this subject elsewhere, but hopefully this will help you get started.

Supplication answered 25/4, 2016 at 2:30 Comment(6)
it was a pain to install the Rmpi package but I finally succeeded. When I follow the recommendation from the answer below and create the parallel backend with makeCluster(16, type = "MPI") the R session is terminated and I get a seg fault error (on the linux shell). So it seems that the only way to make it work is to use mpirun as you said but that seems complicatedLisa
@Lisa If R seg faults, I would guess that Rmpi isn't correctly installed. Did you try using a PSOCK cluster with a vector of hostnames, or did you run into ssh problems?Supplication
@Lisa You have to execute your R script via mpirun if you want to use multiple nodes using Rmpi. If you execute makeCluster(16, type = "MPI") from your R script without using mpirun, all 16 workers will start on the local machine. I suppose that could even be the cause of the seg fault.Supplication
hum I see. thanks much for the feedback I will try to learn how to use mpirunLisa
@Lisa If you're using Open MPI (which I recommend for Rmpi), you can use a command such as mpirun -n 1 -H "localhost,n1,n2" R --slave -f script.R where "n1" and "n2" are legal hostnames. This causes mpirun to execute your script only on "localhost", but enables Rmpi to spawn workers on hosts "n1" and "n2" as well. Start with some simple tests, such as using makeCluster(2, type="MPI").Supplication
thank you very much this is very helpful! Let me try, I'll keep you postedLisa
H
1

Here's a partial answer that may send you in the right direction

Based on this quite old (2013) post http://www.r-bloggers.com/the-wonders-of-foreach/ it seems that I should change the default type (fork to MPI but why? would that work on unix?)

fork is a way of spawning background processes on POSIX system. on a single node with n cores, you can spawn n processes in parallel and do work. this doesn't work across multiple machines as they don't share memory. you need a way to get data between them.

MPI is a portable way to communicate between clusters of nodes. An MPI cluster can work across nodes.

What does you can use the snow functionality mean? How should I do that?

snow is a separate package. To make a 16 node MPI cluster with snow, do cl <- makeCluster(16, type = "MPI") but you need to be running R in the right environment, as described Steve Weston's answer and in his answer to a similar question here. (Once you get it running you may also need to modify your loop to use 4 cores on each node.)

Hopson answered 22/4, 2016 at 21:29 Comment(1)
When I execute the command makeCluster(16, type = "MPI") the R session is terminated and I get a seg fault error (on the linux shell). Do you know how to fix that?Lisa

© 2022 - 2024 — McMap. All rights reserved.