I have a dataset with few numeric columns and over 100 millions of rows as a data.table object. I would like to do group operations on some of the columns based on other columns. For example, count unique elements of column "a" per each category in column "d".
my_data[, a_count := uniqueN(col_a), col_d]
I have many of these operations which are independent of each other and it would be great to run them in parallel. I have found the following piece of code which will run different functions in parallel.
fun1 = function(x){
x[, a_count := uniqueN(col_a), col_d]
return(x[, .(callId, a_count)])
}
fun2 = function(x){
x[, b_count := uniqueN(col_b), col_d]
return(x[, .(callId, b_count)])
}
fun3 = function(x){
x[, c_count := uniqueN(col_c), col_d]
return(x[, .(callId, c_count)])
}
tasks = list(job1 = function(x) fun1(x),
job2 = function(x) fun2(x),
job3 = function(x) fun3(x))
cl = makeCluster(3)
clusterExport(cl, c('fun1', 'fun2', 'fun3', 'my_data', 'data.table', 'uniqueN'))
out = clusterApply(
cl,
tasks,
function(f) f(my_data)
)
stopCluster(cl)
How can I improve this solution? For example, it would be great to just pass only the essential columns to each function and not the entire dataframe.
FORK
clusters and don't modify the data, I think you don't make any copy. – Bremer{
inj
in a data.table, if that helps – Martymartyn