loop inside a foreach loop using doparallel

Asked 19/5, 2017 at 16:23 Answered 27/5, 2017 at 11:10

I have a function that contains a loop

myfun = function(z1.d, r, rs){
  x = z1.d[,r]
  or.d = order(as.vector(x), decreasing=TRUE)[rs]
  zz1.d = as.vector(x)
  r.l = zz1.d[or.d]

  y=vector()
  for (i in 1:9)
  {
    if(i<9) y[i]=mean( x[(x[,r] >= r.l[i] & x[,r] < r.l[i+1]),r] ) else{
      y[i] =  mean( z1.d[(x >= r.l[9]),r] )}
  }
  return(y)
}

rs is a numeric vector, z1.d is a zoo and y is also a numeric vector.

When I try to run the function inside a parallel loop:

cls = makePSOCKcluster(8)
registerDoParallel(cls)

rlarger.d.1  = foreach(r=1:dim(z1.d)[2], .combine = "cbind") %dopar% {    
  myfun(z1.d, r, rs)}

stopCluster(cls)

I get the following error:

Error in { : task 1 failed - "incorrect number of dimensions"

I don't know why, but I realized if I take the loop out of my function it does not give an error.

Also, if I run the exact same code with %do% instead of %dopar% (so not runing in parallel) it works fine (slow but without errors).

EDIT: as requested here is a sample of the parameters:

dim(z1.d)
[1] 8766  107
> z1.d[1:4,1:6]
                    AU_10092 AU_10622 AU_12038 AU_12046 AU_13017 AU_14015
1966-01-01 23:00:00       NA       NA       NA    1.816        0    4.573
1966-01-02 23:00:00       NA       NA       NA    9.614        0    4.064
1966-01-03 23:00:00        0       NA       NA    0.000        0    0.000
1966-01-04 23:00:00        0       NA       NA    0.000        0    0.000

> rs
[1] 300 250 200 150 100  75  50  30  10

r is defined in the foreach loop

Hayashi answered 19/5, 2017 at 16:23 Comment(7)

A sample input of parameters z1.d, rs, r would be helpful. – Scrivens 22/5, 2017 at 11:49

@Hayashi - What operating system are you running on. In the context of parallel execution, this points matters. As Windows, Linux and MacOS have in some case different parallel implementations exposed via R. – Mesonephros 22/5, 2017 at 16:45

I am running it in windows – Hayashi 24/5, 2017 at 11:58

I am not totally familiar with foreach but usually, when working with parallel cores, variables need to be "send" to the cores environments. In your case, I do not see where you declare z1.d and rs in the cores environments. As I said, I dont really know foreach but I would use it something like: rlarger.d.1 = foreach(r=1:dim(z1.d)[2], z1.d = z1.d, rs = rs, .combine = "cbind") %dopar% { myfun(z1.d, r, rs)}. By the way, usually the variable parameter of a function like r here should be defined first in your parameters functions myfun = function(r, z1.d, rs). – Elbow 25/5, 2017 at 15:10

@Are you on windows or on other operating system ? – Rysler 25/5, 2017 at 21:20

@Hayashi Can you provide a sample dataset with 'dput()', looking at the code my sense is the issue with partitioning the data across the parallel processes you are creating. ref: #24066136 – Mesonephros 26/5, 2017 at 5:19

@Hayashi what are you actually trying to achieve 'in plain words'? There is probably a way to do it fast without messing with parallelisation, which never worked well in Windows anyways;) – Production 26/5, 2017 at 10:35

The error pops up because you failed to initiate zoo on your workers. Thus the workers don't know how to deal with zoo objects properly, instead they handle them as matrizes which don't behave the same way when subsetting! So the quick fix to your stated problem would be to add.packages="zoo" to your foreach call.

In my opinion you don't even need to do parallel computations. You can enhance your function dramatically if you use numeric vectors instead of zoo-objects:

# sample time series to match your object's size
set.seed(1234)
z.test <- as.zoo(replicate(107,sample(c(NA,runif(1000,0,10)),size = 8766, replace = TRUE)))

myfun_new <-  function(z, r, rs){
  x <-  as.numeric(z[,r])
  r.l <- x[order(x, decreasing=TRUE)[rs]]
  res_dim <- length(rs)
  y=numeric(res_dim)
  for (i in 1:res_dim){
    if(i< res_dim){ 
      y[i] <- mean( x[(x >= r.l[i] & x < r.l[i+1])], na.rm = TRUE ) 
    }else{
      y[i] <-   mean( x[(x >= r.l[res_dim])] , na.rm = TRUE)
    }
  }
  return(y)
}

Simple timings show the improvement:

system.time({
  cls = makePSOCKcluster(4)
  registerDoParallel(cls)
  rlarger.d.1 = foreach(r=1:dim(z.test)[2],.packages = "zoo", .combine = "cbind") %dopar% { 
    myfun(z.test, r, rs)}
  stopCluster(cls)
})
##  User      System verstrichen 
##  0.08        0.10       10.93
system.time({
  res <-sapply(1:dim(z.test)[2], function(r){myfun_new(z.test, r, rs)})
})
##  User      System verstrichen 
##  0.48        0.21        0.68

While the results are the same (only column names differ)

all.equal(res, rlarger.d.1, check.attributes = FALSE)
## [1] TRUE

Balderas answered 27/5, 2017 at 11:10 Comment(1)

Thanks! Your suggestion is a much more efficient way of doing it! – Hayashi 29/5, 2017 at 19:10

It sims like there is an error in your function code.

In line 2 you create a 1-dimensional object

x = z1.d[,r]

In line 9 you treat it like 2-dimensional one

x[some_logic, r]

That is why you have "incorrect number of dimensions" error. Although, I do not know why it works in %do% variant.

In any case you need to replace code inside for loop with:

if(i<9) y[i]=mean( x[(x[,r] >= r.l[i] & x[,r] < r.l[i+1])] ) else{
      y[i] =  mean( x[(x >= r.l[9])] )}

Or with:

if(i<9) y[i]=mean( z1.d[(x[,r] >= r.l[i] & x[,r] < r.l[i+1]),r] ) else{
      y[i] =  mean( z1.d[(x >= r.l[9]),r] )}

As you did not provide reproducible example, I did not test it.

Arnaud answered 26/5, 2017 at 8:18 Comment(0)

Recommended topics

Hot tags