Communication of parallel processes: what are my options?

C

2

5

I'm trying to dig a bit deeper into parallelziation of R routines.

What are my options with respect to the communication of a bunch of "worker" processes regarding

the communication between the respective workers?
the communication of the workers with the "master" process?

AFAIU, there's no such thing as a "shared environment/shared memory" that both the master as well as all worker processes have access to, right?

The best idea I came up with so far is to base the communication on reading and writing JSON documents to the hard drive. That's probably a bad idea ;-) I chose .json over .Rdata files because JSON seems to be used for inter-software communication a lot, so I thought to go with that "standard".

Looking forward to learning about better options!

FYI: I'm usually parallelizing based on functions of the base package parallel and the contrib package snowfall, mainly relying on function sfClusterApplyLB() to get the job done

EDIT

I should have stated that I'm running on Windows, but Linux-based answers/hints are also very much appreciated!

Carping answered 20/7, 2012 at 16:3 Comment(0)

T

5

For communication between processes, a kind of fun place to start is the help page ?socketConnections and the code in the chunk marked "## Not run:". So start an R process and run

 con1 <- socketConnection(port = 6011, server=TRUE)

This process is acting as a server, listening on a particular port for some information. Now start a second R process and enter

 con2 <- socketConnection(Sys.info()["nodename"], port = 6011)

con2 in process 2 has made a socket connection with con1 on process 1. Back at con1, write out the R object LETTERS

writeLines(LETTERS, con1)

and retrieve them on con2.

readLines(con2)

So you've communicated between processes without writing to disk. Some important concepts are also implicit here, e.g., about blocking vs. non-blocking connections, It is not limited to communication on the same machine, provided the ports are accessible across whatever network the computers are on. This is the basis for makePSOCKcluster in the parallel package, with the addition that process 1 actually uses the system command and a script in the parallel package to start process 2. The object returned by makePSOCKcluster is sub-settable, so that you can dedicate a fraction of your cluster to solving a particular task. In principle you could arrange for the spawned nodes to communicate with one another independent of the node that did the spawning.

An interesting exercise is to do the same using the fork-like commands in the parallel package (on non-Windows). A high-level version of this is in the help page ?mcparallel, e.g.,

 p <- mcparallel(1:10)
 q <- mcparallel(1:20)
 # wait for both jobs to finish and collect all results
 res <- mccollect(list(p, q))

but this builds on top of lower-level sendMaster and friends (peak at the mcparallel and mccollect source code).

The Rmpi package takes an approach like the PSOCK example, where the manager uses scripts to spawn workers, and with communication using mpi rather than sockets. But a different approach, worthy of a weekend project if you have a functioning MPI implementation, is to implement a script that does the same calculation on different data, and then collates results onto a single node, using commands like mpi.comm.rank, mpi.barrier, mpi.send.Robj, and mpi.recv.Robj.

A fun weekend project would use the parallel package to implement a work flow that involved parallel computation but not of the mclapply variety, e.g., where one process harvests data from a web site and then passes it to another process that draws pretty pictures. The input to the first process might well be JSON, but the communication within R is probably much more appropriately R data objects.

Tannenberg answered 20/7, 2012 at 21:39 Comment(0)

W

5

As detailed on the CRAN Task View for High-Performance Computing, the Rdsm package by Norm Matloff offers shared memory communication.

Windgall answered 20/7, 2012 at 16:8 Comment(0)

T

5

For communication between processes, a kind of fun place to start is the help page ?socketConnections and the code in the chunk marked "## Not run:". So start an R process and run

 con1 <- socketConnection(port = 6011, server=TRUE)

This process is acting as a server, listening on a particular port for some information. Now start a second R process and enter

 con2 <- socketConnection(Sys.info()["nodename"], port = 6011)

con2 in process 2 has made a socket connection with con1 on process 1. Back at con1, write out the R object LETTERS

writeLines(LETTERS, con1)

and retrieve them on con2.

readLines(con2)

So you've communicated between processes without writing to disk. Some important concepts are also implicit here, e.g., about blocking vs. non-blocking connections, It is not limited to communication on the same machine, provided the ports are accessible across whatever network the computers are on. This is the basis for makePSOCKcluster in the parallel package, with the addition that process 1 actually uses the system command and a script in the parallel package to start process 2. The object returned by makePSOCKcluster is sub-settable, so that you can dedicate a fraction of your cluster to solving a particular task. In principle you could arrange for the spawned nodes to communicate with one another independent of the node that did the spawning.

An interesting exercise is to do the same using the fork-like commands in the parallel package (on non-Windows). A high-level version of this is in the help page ?mcparallel, e.g.,

 p <- mcparallel(1:10)
 q <- mcparallel(1:20)
 # wait for both jobs to finish and collect all results
 res <- mccollect(list(p, q))

but this builds on top of lower-level sendMaster and friends (peak at the mcparallel and mccollect source code).

The Rmpi package takes an approach like the PSOCK example, where the manager uses scripts to spawn workers, and with communication using mpi rather than sockets. But a different approach, worthy of a weekend project if you have a functioning MPI implementation, is to implement a script that does the same calculation on different data, and then collates results onto a single node, using commands like mpi.comm.rank, mpi.barrier, mpi.send.Robj, and mpi.recv.Robj.

A fun weekend project would use the parallel package to implement a work flow that involved parallel computation but not of the mclapply variety, e.g., where one process harvests data from a web site and then passes it to another process that draws pretty pictures. The input to the first process might well be JSON, but the communication within R is probably much more appropriately R data objects.

Tannenberg answered 20/7, 2012 at 21:39 Comment(0)

EDIT

Recommended topics

Hot tags