makePSOCKcluster hangs on win x64 after calling system
Asked Answered
S

1

36

I am experiencing a hard to debug problem with makePSOCKcluster from the parallel package on R x64 on Windows. It does not happen on R i386 on Windows, nor on any OSX or Linux. Unfortunately it does not happen consistently either, only occasionally and quite randomly.

What happens is that the makePSOCKcluster function times out and freezes the R session, but only if earlier in the session some (arbitrary) system() calls were performed. The video and script below illustrate the problem more clearly.

Some stuff I tried without success:

  • Disable antivirus/firewalls.
  • Waiting a couple of seconds between calling system and makePSOCKcluser.
  • Using different system calls.

How would I further narrow this down? Here the video and the script used in the video is:

cmd_exists <- function(command){
  iswin <- identical(.Platform$OS.type, "windows"); 
  if(iswin){
    test <- suppressWarnings(try(system(command, intern=TRUE, ignore.stdout=TRUE, ignore.stderr=TRUE, show.output.on.console=FALSE), silent=TRUE));
  } else {
    test <- suppressWarnings(try(system(command, intern=TRUE, ignore.stdout=TRUE, ignore.stderr=TRUE), silent=TRUE));    
  }
  !is(test, "try-error")
}

options(hasgit = cmd_exists("git --version")); 
options(haspandoc = cmd_exists("pandoc --version"));  
options(hastex = cmd_exists("texi2dvi --version"));
cluster <- parallel::makePSOCKcluster(1);
Standoffish answered 26/6, 2013 at 5:25 Comment(2)
+1 for the out of the box video idea ...Potshot
so if you remove the options() calls, there is no problem ? you could try testing if there is one in particular... You could also look at the makePSOCKcluster implementation and see where it hangs.Iago
A
2

makePSOCKCluster, or more generally makeCluster, can hang for any number of reasons when creating the so-called worker processes, which involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK function, which will create a socket connection back to the master and then execute the slaveLoop function where it will eventually execute the tasks sent to it by the master. Wehen something goes wrong when starting any of the worker processes, the master will hang while executing socketConnection, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.

Using the outfile argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, then go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.

Here's an example:

> library(parallel)

> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
   '/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE 

Next open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.

To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:

$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

In that R session, you can put a breakpoint on the .slaveRSOCK function and then execute it:

> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()

Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop and makeSOCKmaster functions.

Adrianadriana answered 27/6, 2014 at 15:46 Comment(2)
I had the same problem and the hint ` which involves starting new R sessions using the Rscript command` was very useful. When there is a reason for R to have problems to start up (e.g. inaccessible network drive, error in the .Rsite file) it will cause makePSOCKcluster to hang on creating the clusters. In my case R was hanging nearly unnoticeably due to a network problem; after removing the incorrect network path I could use makePSOCKcluster without problems.Dicho
@tomka, what was the network problem in your case?Homozygote

© 2022 - 2024 — McMap. All rights reserved.