doParallel (package) foreach does not work for big iterations in R

Asked 10/6, 2016 at 14:36 Answered 15/6, 2016 at 14:46

Solved r parallel-processing parallel-foreach doparallel

I'm running the following code (extracted from doParallel's Vignettes) on a PC (OS Linux) with 4 and 8 physical and logical cores, respectively.

Running the code with iter=1e+6 or less, every thing is fine and I can see from CPU usage that all cores are employed for this computation. However, with larger number of iterations (e.g. iter=4e+6), it seems parallel computing does not work in which case. When I also monitor the CPU usage, just one core is involved in computations (100% usage).

Example1

require("doParallel")
require("foreach")
registerDoParallel(cores=8)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
iter=4e+6
ptime <- system.time({
    r <- foreach(i=1:iter, .combine=rbind) %dopar% {
        ind <- sample(100, 100, replace=TRUE)
        result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
        coefficients(result1)
    }
})[3]

Do you have any idea what could be the reason? Could memory be the cause?

I googled around and I found THIS relevant to my question but the point is that I'm not given any kind of error and the OP seemingly has came up with a solution by providing necessary packages inside foreach loop. But no package is used inside my loop, as can be seen.

UPDATE1

My problem still is not solved. As per my experiments, I don't think that memory could be the reason. I have 8GB of memory on the system on which I run the following simple parallel (over all 8 logical cores) iteration:

Example2

require("doParallel")
require("foreach")

registerDoParallel(cores=8)
iter=4e+6
ptime <- system.time({
    r <- foreach(i=1:iter, .combine=rbind) %dopar% {
        i
    }
})[3]

I do not have problem with running of this code but when I monitor the CPU usage, just one core (out of 8) is 100%.

UPDATE2

As for Example2, @SteveWeston (thanks for pointing this out) stated that (in comments) : "The example in your update is suffering from having tiny tasks. Only the master has any real work to do, which consists of sending tasks and processing results. That's fundamentally different than the problem with the original example which did use multiple cores on a smaller number of iterations."

However, Example1 still remains unsolved. When I run it and I monitor the processes with htop, here is what happens in more detail:

Let's name all 8 created processes p1 through p8. The status (column S in htop) for p1 is R meaning that it's running and remains unchanged. However, for p2 up to p8, after some minutes, the status changes to D (i.e. uninterruptible sleep) and, after some minutes, again changes to Z (i.e. terminated but not reaped by its parent). Do you have any idea why this happens?

Scrimmage answered 10/6, 2016 at 14:36 Comment(13)

Have you tried explicitly creating and registering the cluster? For example cl <- makePSOCKcluster(8); registerDoParallel(cl)? – Conklin 10/6, 2016 at 15:22

Even with iter=15e+6? Could you perhaps comment your hardware specifications (CPU and memory) and type of OS on which you run the code? – Scrimmage 10/6, 2016 at 15:25

Yea I just copied and pasted your code and it ran until I killed it. I saw the 8 Rscript.exe processes in my Resource Monitor though they were only using a small amount of the CPU core they were running on (but that's not necessarily a problem -- it can fluctuate due to inefficient chunking). I was running it on Windows Server 2008 with 24 physical CPU's. – Conklin 10/6, 2016 at 15:27

Yes I also tried creating and registering the clusters as you pointed out to no avail. registerDoParallel() is also OK as it's written in the vignette document provided by package authors. I have the feeling that It has something to do with the memory. – Scrimmage 10/6, 2016 at 15:31

Yea, it probably does have to do with memory or compute resources. So do you really need that many iterations? If so, can you achieve the same result by chunking it into steps with fewer iterations and then combining (perhaps averaging) the results? I'm not sure what your use case is, but for Machine Learning you can often save model progress to storage memory (i.e. a checkpoint) so that you can break apart training big neural nets and other models into multiple sessions. – Conklin 10/6, 2016 at 15:34

Yes I need. not as many as 15e+6 but about 3.8e+6. I actually tried to test with toy example (like above) if every thing goes fine and then run my experiments. In my experiments, I need to repeat foreach for around 100 times. So I think I have to break down 100 times repetitions rather than foreach itself. – Scrimmage 10/6, 2016 at 15:39

I am getting it to run all 8 (after 10-15 seconds of it getting itself sorted) on Windows with I7 4910 running your above code verbatim. – Uchida 10/6, 2016 at 15:41

Are you confirming that your workers are registered with getDoParRegistered() and getDoParWorkers()? – Uchida 10/6, 2016 at 15:43

Yes, I do. For getDoParRegistered() I get TRUE and 8 for getDoParWorkers(). – Scrimmage 10/6, 2016 at 15:47

The example in your update is suffering from having tiny tasks. Only the master has any real work to do, which consists of sending tasks and processing results. That's fundamentally different than the problem with the original example which did use multiple cores on a smaller number of iterations. – Casket 14/6, 2016 at 16:15

@SteveWeston Then if so, given that I have 8GB of memory available and taking the size of the objects created inside and outside of the loop into account, why the original example should not occupy all CPUs 100% for e.g. iter=3e+6? – Scrimmage 14/6, 2016 at 16:42

@SteveWeston You may again check my update. – Scrimmage 14/6, 2016 at 19:12

I'm starting to think there is a timeout for the client processes. This could be in the client, or even in something network security, which could be shutting down atypical ports after a certain time. – Velazquez 23/1, 2019 at 14:22

I think you're running low on memory. Here's a modified version of that example that should work better when you have many tasks. It uses doSNOW rather than doParallel because doSNOW allows you to process the results with the combine function as they're returned by the workers. This example writes those results to a file in order to use less memory, however it reads the results back into memory at the end using a ".final" function, but you could skip that if you don't have enough memory.

library(doSNOW)
library(tcltk)
nw <- 4  # number of workers
cl <- makeSOCKcluster(nw)
registerDoSNOW(cl)

x <- iris[which(iris[,5] != 'setosa'), c(1,5)]
niter <- 15e+6
chunksize <- 4000  # may require tuning for your machine
maxcomb <- nw + 1  # this count includes fobj argument
totaltasks <- ceiling(niter / chunksize)

comb <- function(fobj, ...) {
  for(r in list(...))
    writeBin(r, fobj)
  fobj
}

final <- function(fobj) {
  close(fobj)
  t(matrix(readBin('temp.bin', what='double', n=niter*2), nrow=2))
}

mkprogress <- function(total) {
  pb <- tkProgressBar(max=total,
                      label=sprintf('total tasks: %d', total))
  function(n, tag) {
    setTkProgressBar(pb, n,
      label=sprintf('last completed task: %d of %d', tag, total))
  }
}
opts <- list(progress=mkprogress(totaltasks))
resultFile <- file('temp.bin', open='wb')

r <-
  foreach(n=idiv(niter, chunkSize=chunksize), .combine='comb',
          .maxcombine=maxcomb, .init=resultFile, .final=final,
          .inorder=FALSE, .options.snow=opts) %dopar% {
    do.call('c', lapply(seq_len(n), function(i) {
      ind <- sample(100, 100, replace=TRUE)
      result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
      coefficients(result1)
    }))
  }

I included a progress bar since this example takes several hours to execute.

Note that this example also uses the idiv function from the iterators package to increase the amount of work in each of the tasks. This technique is called chunking, and often improves the parallel performance. However, using idiv messes up the task indices, since the variable i is now a per-task index rather than a global index. For a global index, you can write a custom iterator that wraps idiv:

idivix <- function(n, chunkSize) {
  i <- 1
  it <- idiv(n, chunkSize=chunkSize)
  nextEl <- function() {
    m <- nextElem(it)  # may throw 'StopIterator'
    value <- list(i=i, m=m)
    i <<- i + m
    value
  }
  obj <- list(nextElem=nextEl)
  class(obj) <- c('abstractiter', 'iter')
  obj
}

The values emitted by this iterator are lists, each containing a starting index and a count. Here's a simple foreach loop that uses this custom iterator:

r <- 
  foreach(a=idivix(10, chunkSize=3), .combine='c') %dopar% {
    do.call('c', lapply(seq(a$i, length.out=a$m), function(i) {
      i
    }))
  }

Of course, if the tasks are compute intensive enough, you may not need chunking and can use a simple foreach loop as in the original example.

Casket answered 12/6, 2016 at 1:11 Comment(8)

Thanks for your solution but when I run your code I get: Warning: progress function failed: object 'pb' not found – Scrimmage 12/6, 2016 at 13:51

@m0h3n You can always disable the progress bar by not using the foreach .options.snow option. Are pb and progress both defined in the global environment? – Casket 12/6, 2016 at 14:9

Yes, I just copy/paste your code and run it. But look at comb function and whose parameters and for loop inside. Is that normal? – Scrimmage 12/6, 2016 at 15:39

@m0h3n The comb function uses the first argument differently than the subsequent arguments which requires the use of the foreach .init argument. I've included examples of this pattern as examples in some of the foreach-related packages. – Casket 12/6, 2016 at 16:6

I'm wondering if you run your code before posting here. I get this: Error: object 'opts' not found. Please try to clean your environment and run the code you posted here. – Scrimmage 13/6, 2016 at 8:50

@m0h3n It works on my Mac laptop and my Linux desktop, and I got a friend to test it to verify that it doesn't just work for me. Since 'opts' is defined after 'mkprogress' and before 'resultFile', the error message that you report makes no sense to me. – Casket 13/6, 2016 at 21:6

OK. my bad, sorry. As I was running your code via putty, I, seemingly due to that, used to get that error. I tried to run it on another PC (local) and its fine now. How we can get those indices (i.e. 1,2,3,4,...,niter) stored in r as the final result? Because in my experiments, I need to work with these indices. The question stated here is just a toy example. – Scrimmage 14/6, 2016 at 13:37

Thanks for your update. You might check my update (original question) as well. – Scrimmage 14/6, 2016 at 16:4

At first I thought you were running into memory problems because submitting many tasks does use more memory, and that can eventually cause the master process to get bogged down, so my original answer shows several techniques for using less memory. However, now it sounds like there's a startup and shutdown phase where only the master process is busy, but the workers are busy for some period of time in the middle. I think the issue is that the tasks in this example aren't really very compute intensive, and so when you have a lot of tasks, you start to really notice the startup and shutdown times. I timed the actual computations and found that each task only takes about 3 milliseconds. In the past, you wouldn't get any benefit from parallel computing with tasks that small, but now, depending on your machine, you can get some benefit but the overhead is significant, so when you have a great many tasks you really notice that overhead.

I still think that my other answer works well for this problem, but since you have enough memory, it's overkill. The most important technique to use chunking. Here is an example that uses chunking with minimal changes to the original example:

require("doParallel")
nw <- 8
registerDoParallel(nw)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
niter <- 4e+6
r <- foreach(n=idiv(niter, chunks=nw), .combine='rbind') %dopar% {
  do.call('rbind', lapply(seq_len(n), function(i) {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)
  }))
}

Note that this does the chunking slightly differently than my other answer. It only uses one task per worker by using the idiv chunks option, rather than the chunkSize option. This reduces the amount of work done by the master and is a good strategy if you have enough memory.

Casket answered 15/6, 2016 at 14:46 Comment(2)

Thanks, voted up. I will finalize this question soon. – Scrimmage 16/6, 2016 at 8:44

If parallel approach in Example1 does not work because the tasks are not compute intensive, then why it works (parallel I mean) for small iterations and not for large ones? – Scrimmage 16/6, 2016 at 9:11

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags