Results of workers not returned properly - snow - debug
Asked Answered
A

2

6

I'm using the snow package in R to execute a function on a SOCK cluster with multiple machines(3) running on Linux OS. I tried to run the code with both parLapply and clusterApply.

In case of any error at the worker level, the results of the worker nodes are not returned properly to master making it very hard to debug. I'm currently logging every heartbeat of the worker nodes independently using futile.logger. It seems as if the results are properly computed. But when I tried to print the result at the master node (After receiving the output from workers) I get an error which says, Error in checkForRemoteErrors(val): 8 nodes produced errors; first error: missing value where TRUE/FALSE needed.

Is there any way to debug the results of the workers more deeply?

Apure answered 3/6, 2013 at 11:38 Comment(7)
First oder of business would be to run the code (with reduced number of iterations) without parallelization and debug. Have you done that?Bullhorn
Check that some of the workers aren't actually computing NA or NULL as their results. That sort of thing would log fine but the reduce or aggregate step will fail out when it tries to return to the master. The error you are seeing could be something like that. Can you compute sequentially and see the actual result of each batch or chunk? Also check traceback().Vaudeville
@Roland: Thanks for your comment. Yes. I did that. It works fine without parallelization. Also, if it helps the workers are managed via SSH-passwordless login (using Auth keys). I am not able to reproduce this error.Apure
@TommyLevi: Thanks for your reply. I'm logging the results of the workers also. It is computed fine. And I cannot do a traceback() since, the R sessions that are created for the workers will be closed after the job is done. I don't want to keep the unwanted sessions alive.Apure
Is it possible to do a run sequentially on just your local machine? That could rule some things out. Does it fail out everytime on the workers? Or just some of the time? Also double check both your base R and any packages being used are the same version numbersVaudeville
@TommyLevi: There is no problem in sequential case. It fails every time but at random times. Some of the times the code will run smoothly for a while and then throw this error. Some times earlier too.Apure
Are you deploying from the cluster/server itself? or from a local machine that farms out?Vaudeville
C
13

The checkForRemoteErrors function is called by parLapply and clusterApply to check for task errors, and it will throw an error if any of the tasks failed. Unfortunately, although it displays the error message, it doesn't provide any information about what worker code caused the error. But if you modify your worker/task function to catch errors, you can retain some extra information that may be helpful in determining where the error occurred.

For example, here's a simple snow program that fails. Note that it uses outfile='' when creating the cluster so that output from the program is displayed, which by itself is a very useful debugging technique:

library(snow)
cl <- makeSOCKcluster(2, outfile='')
problem <- function(i) {
  if (NA)
    j <- 999
  else
    j <- i
  2 * j
}
r <- parLapply(cl, 1:2, problem)

When you execute this, you see the error message from checkForRemoteErrors and some other messages, but nothing that tells you that the if statement caused the error. To catch errors when calling problem, we define workerfun:

workerfun <- function(i) {
  tryCatch({
    problem(i)
  },
  error=function(e) {
    print(e)
    stop(e)
  })
}

Now we execute workerfun with parLapply instead of problem, first exporting problem to the workers:

clusterExport(cl, c('problem'))
r <- parLapply(cl, 1:2, workerfun)

Among the other messages, we now see

<simpleError in if (NA) j <- 999 else j <- i: missing value where TRUE/FALSE needed>

which includes the actual if statement that generated the error. Of course, it doesn't tell you the file name and line number of the expression, but it's often enough to let you solve the problem.

Cantor answered 5/6, 2013 at 15:56 Comment(3)
I tried your above solution but it doesn't produce any additional error message. I know my underlying function works fine because I can run it using apply with no errors and the parallelisation works fine in another instance of the underlying function, I just get the checkForRemoteErrors(val): 4 nodes produced errors; first error: subscript out of bounds for a particular instance. Any suggestions?Choroid
Im having a similar issue where the above works but only 1:2. 1:3 says one node fail, 1:4 says 2, 1:6 says 3 and 1:10 or above says 4 nodes. Think in my case its down to a limit on an api being accessed.Guillema
It actually works and this is incredibly useful for debugging parallel code. Thanks, +1Fusee
S
0

check the range of your observations. how the observation varies. I have noticed that when there are lots of decimal places 4, 5,6 , it throws glm.nb off. To solve this i just round the observations to 2 decimal places.

Scrimpy answered 5/9, 2014 at 20:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.