Run several R functions in parallel
Asked Answered
T

0

6

I have a dataset with few numeric columns and over 100 millions of rows as a data.table object. I would like to do group operations on some of the columns based on other columns. For example, count unique elements of column "a" per each category in column "d".

my_data[, a_count := uniqueN(col_a), col_d]

I have many of these operations which are independent of each other and it would be great to run them in parallel. I have found the following piece of code which will run different functions in parallel.

fun1 = function(x){
  x[, a_count := uniqueN(col_a), col_d]
  return(x[, .(callId, a_count)])
}
fun2 = function(x){
  x[, b_count := uniqueN(col_b), col_d]
  return(x[, .(callId, b_count)])
}
fun3 = function(x){
  x[, c_count := uniqueN(col_c), col_d]
  return(x[, .(callId, c_count)])
}

tasks = list(job1 = function(x) fun1(x),
             job2 = function(x) fun2(x),
             job3 = function(x) fun3(x))

cl = makeCluster(3)
clusterExport(cl, c('fun1', 'fun2', 'fun3', 'my_data', 'data.table', 'uniqueN'))

out = clusterApply( 
  cl,
  tasks,
  function(f) f(my_data)
)
stopCluster(cl)

How can I improve this solution? For example, it would be great to just pass only the essential columns to each function and not the entire dataframe.

Thordis answered 29/5, 2018 at 21:54 Comment(4)
What's wrong with your current solution?Bremer
It passes the entire "my_data" dataframe to all functions which causes memory limitations. One improvement would be to just pass the two essential columns to each function.Thordis
If you use FORK clusters and don't modify the data, I think you don't make any copy.Bremer
You can pass essential columns to each function so that a copy of the data is avoided. You could try using the { in j in a data.table, if that helpsMartymartyn

© 2022 - 2024 — McMap. All rights reserved.