Parallel computation: Loading packages in each thread only once
Asked Answered
B

2

6

I am currently working with some large datasets, so parallelizing the workflows is the only way to go.

I need to load some packages to each thread once at the beginning (i.e: for(this.thread in threads) { #load some packages }.

Unfortunately , I'm not sure how to do that.

The following code further illustrates my problem, where I am trying to use the pipe operator from magrittr in a %dopar% :

.

library(parallel)
library(doParallel)
library(foreach)
library(magrittr)


# Generate some random data and function :
# -----------------------------------------

randomData = runif(10^3)
randomFunction = function(x) {x * (2^x) } 

randomData[1] %>% randomFunction #Works



# And now ... The parallel part :
# --------------------------------

myCluster = makeCluster(6)
registerDoParallel(myCluster)


# Test that the do par is up and running: 
foreach(i = randomData) %dopar% { i }


# Use magrittr pipe operator: 
# Error in { : task 1 failed - "could not find function "%>%""
foreach(i = randomData) %dopar% { i %>% randomFunction }


# Load the library at each loop: (ie: length(data) times !)
# Other than unnecessarily loading the library (length(data) - numberOfThreads) times, 
# it works nicely
foreach(i = randomData) %dopar% { library(magrittr);  i %>% randomFunction }


# Now try without re-loading: 
# Tararaa - (ie: Works nicely)
foreach(i = randomData) %dopar% { i %>% randomFunction }

.

Any ideas?

Birth answered 1/12, 2015 at 13:50 Comment(4)
@VeerendraGadekar , I did generate some random data in the script above. My problem is not in running a parallel loops. I am trying to avoid loading the packages n times, where n = the length of my big data. Hope that clarifies it a bit more.Birth
Two calls to library() are about as costly as one (R checks to see if the library is already loaded, and if so does nothing), so no need to sweat it. Go with your "it works nicely" solution.Oke
Thanks for your comment @VeerendraGadekar . Actually, my problem isn't specifically with the magrittr or the pipe operator, but the concept in general. For example I am using some interpolation functions from some packages. I'm using magrittr here just for illustration.Birth
@MartinMorgan, You are absolutely right. The benchmark results of your idea: Initial load: 31.5 milliseconds, subsequent loads: 133 MICROseconds. Thanks again.Birth
A
15

The doParallel package inherits some handy low level functions from parallel including clusterCall which executes the function once on each node.

I had the exact same problem and solved it by doing:

library(doParallel)
myCluster = makeCluster(6)
registerDoParallel(myCluster)
clusterCall(myCluster, function() library(magrittr))

You can also use the argument .packages:

foreach(i = 1:5, .packages = "magrittr") %dopar% {i %>% runif}
Anstice answered 1/12, 2015 at 21:23 Comment(1)
Perfect! Didn't know about that! Thanks @mkemp6 !Birth
D
-1

You could try this:

foreach(i = randomData,.packages=c("magrittr")) %dopar% {
  i %>% randomFunction
}
Dunedin answered 6/9, 2016 at 6:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.