How to setup workers for parallel processing in R using snowfall and multiple Windows nodes?
Asked Answered
A

1

105

I’ve successfully used snowfall to setup a cluster on a single server with 16 processors.

require(snowfall)
if (sfIsRunning() == TRUE) sfStop()

number.of.cpus <- 15
sfInit(parallel = TRUE, cpus = number.of.cpus)
stopifnot( sfCpus() == number.of.cpus )
stopifnot( sfParallel() == TRUE )

# Print the hostname for each cluster member
sayhello <- function()
{
    info <- Sys.info()[c("nodename", "machine")]
    paste("Hello from", info[1], "with CPU type", info[2])
}
names <- sfClusterCall(sayhello)
print(unlist(names))

Now, I am looking for complete instructions on how to move to a distributed model. I have 4 different Windows machines with a total of 16 cores that I would like to use for a 16 node cluster. So far, I understand that I could manually setup a SOCK connection or leverage MPI. While it appears possible, I haven’t found clear and complete directions as to how.

The SOCK route appears to depend on code in a snowlib script. I can generate a stub from the master side with the following code:

winOptions <-
    list(host="172.01.01.03",
         rscript="C:/Program Files/R/R-2.7.1/bin/Rscript.exe",
         snowlib="C:/Rlibs")

cl <- makeCluster(c(rep(list(winOptions), 2)), type = "SOCK", manual = T)

It yields the following:

Manually start worker on 172.01.01.03 with
     "C:/Program Files/R/R-2.7.1/bin/Rscript.exe"
      C:/Rlibs/snow/RSOCKnode.R
      MASTER=Worker02 PORT=11204 OUT=/dev/null SNOWLIB=C:/Rlibs

It feels like a reasonable start. I found code for RSOCKnode.R on GitHub under the snow package:

local({
    master <- "localhost"
    port <- ""
    snowlib <- Sys.getenv("R_SNOW_LIB")
    outfile <- Sys.getenv("R_SNOW_OUTFILE") ##**** defaults to ""; document

    args <- commandArgs()
    pos <- match("--args", args)
    args <- args[-(1 : pos)]
    for (a in args) {
        pos <- regexpr("=", a)
        name <- substr(a, 1, pos - 1)
        value <- substr(a,pos + 1, nchar(a))
        switch(name,
               MASTER = master <- value,
               PORT = port <- value,
               SNOWLIB = snowlib <- value,
               OUT = outfile <- value)
    }

    if (! (snowlib %in% .libPaths()))
        .libPaths(c(snowlib, .libPaths()))
    library(methods) ## because Rscript as of R 2.7.0 doesn't load methods
    library(snow)

    if (port == "") port <- getClusterOption("port")

    sinkWorkerOutput(outfile)
    cat("starting worker for", paste(master, port, sep = ":"), "\n")
    slaveLoop(makeSOCKmaster(master, port))
})

It’s not clear how to actually start a SOCK listener on the workers, unless it is buried in snow::recvData.

Looking into the MPI route, as far as I can tell, Microsoft MPI version 7 is a starting point. However, I could not find a Windows alternative for sfCluster. I was able to start the MPI service, but it does not appear to listen on port 22 and no amount of bashing against it with snowfall::makeCluster has yielded a result. I’ve disabled the firewall and tried testing with makeCluster and directly connecting to the worker from the master with PuTTY.


Is there a comprehensive, step-by-step guide to setting up a snowfall cluster on Windows workers that I’ve missed? I am fond of snowfall::sfClusterApplyLB and would like to continue using that, but if there is an easier solution, I’d be willing to change course. Looking into Rmpi and parallel, I found alternative solutions for the master side of the work, but still little to no specific detail on how to setup workers running Windows.

Due to the nature of the work environment, neither moving to AWS, nor Linux is an option.

Related questions without definitive answers for Windows worker nodes:

Awad answered 30/3, 2016 at 1:0 Comment(6)
I'm going to try to give you a real answer to this, but I will also mention that there's an Debian-based Linux distribution for distributed/cluster HPC modeling created by a really nice econometrician formerly from UC Davis and it's called PelicanHPC. He quit supporting it a couple years ago, but it still works great and is available on DistroWatch.org . Setup on multiple computers is simple. There's also Rocks Cluster also on DistroWatch, but I haven't gone too far with that distro because their logo of a bunch of snakes on startup bothers me.Pelagianism
For the purpose of clustering with R, what is the difference between the latest Ubuntu distribution and the PelicanHPC distribution? If there is no path forwards with Windows, this may be the only course.Awad
I'm only really familiar with PelicanHPC in terms of linux cluster HPC-ing. However, with Pelican since it's specifically for that purpose it's meant to be installed live and there are super simple setup instructions for connecting the worker nodes when it boots up. A 10 year old could setup a cluster that way in a matter of minutes. I'm not sure what you'd need to do for Ubuntu. To be fair, Pelican is geared more toward Octave (MATLAB) than R, but I believe R is also installed by default. However, I do still think I can get you a Windows solution, I just need some time.Pelagianism
@Awad As someone who has done multi-threaded, distributed computing with R on Windows I first want to say, I'm sorry. I have found the support and documentation for this on Windows to be very poor. My next question, do you truly need to spawn the tasks all from the same computer or can you subset the data and have each computer work on their own subset of the data? I have had to do this kind of thing in production and it was outrageously painful to debug and maintain. I know this does not answer your question but I wanted to warn you that R, Windows and distributed computing will be painfulMetacarpus
@Awad Would a solution with doRedis meet your requirements? I ask as doRedis can work locally to take advantage of multicore systems, and also farm tasks out to remote R instances (“workers”). It’s straightforward to add or remove workers at runtime—even in mid-job—to adapt to changing work conditions or speed up job processing. It works across Windows, Linux and MacOSX. bigcomputing.com/doRedis.html Do you just want a specific step-by-step guide to setting up a snowfall cluster on Windows? - just checking before responding in detail.Sweeney
I know it's (also) not snowfall related but I have had good experiences with the futures (cran.r-project.org/web/packages/future/future.pdf) package in that respect (setting up workers and utilizing them, etc.)Hora
T
1

There were several options for HPC infrastructure considered: MPICH, Open MPI, and MS MPI. Initially tried to use MPICH2 but gave up as the latest stable release 1.4.1 for Windows dated back by 2013 and no support since those times. Open MPI is not supported by Windows. Then only the MS MPI option is left.

Unfortunately snowfall does not support MS MPI so I decided to go with pbdMPI package, which supports MS MPI by default. pbdMPI implements the SPMD paradigm in contrast withRmpi, which uses manager/worker parallelism.

MS MPI installation, configuration, and execution

  1. Install MS MPI v.10.1.2 on all machines in the to-be Windows HPC cluster.
  2. Create a directory accessible to all nodes, where R-scripts / resources will reside, for example, \HeadMachine\SharedDir.
  3. Check if MS MPI Launch Service (MsMpiLaunchSvc) running on all nodes.
  4. Check, that MS MPI has the rights to run R application on all the nodes on behalf of the same user, i.e. SharedUser. The user name and the password must be the same for all machines.
  5. Check, that R should be launched on behalf of the SharedUser user.
  6. Finally, execute mpiexec with the following options mentioned in Steps 7-10:

mpiexec.exe -n %1 -machinefile "C:\MachineFileDir\hosts.txt" -pwd SharedUserPassword –wdir "\HeadMachine\SharedDir" Rscript hello.R

where

  • -wdir is a network path to the directory with shared resources.
  • –pwd is a password by SharedUser user, for example, SharedUserPassword.
  • –machinefile is a path to hosts.txt text file, for example С:\MachineFileDir\hosts.txt. hosts.txt file must be readable from the head node at the specified path and it contains a list of IP addresses of the nodes on which the R script is to be run.
  1. As a result of Step 7 MPI will log in as SharedUser with the password SharedUserPassword and execute copies of the R processes on each computer listed in the hosts.txt file.

Details

hello.R:

library(pbdMPI, quiet = TRUE)
init()
cat("Hello World from
process",comm.rank(),"of",comm.size(),"!\n")
finalize()

hosts.txt

The hosts.txt - MPI Machines File - is a text file, the lines of which contain the network names of the computers on which R scripts will be launched. In each line, after the computer name is separated by a space (for MS MPI), the number of MPI processes to be launched. Usually, it equals the number of processors in each node.

Sample of hosts.txt with three nodes having 2 processors each:

192.168.0.1 2
192.168.0.2 2
192.168.0.3 2
Taveda answered 31/1, 2022 at 13:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.