The below code produces different results on Windows and Ubuntu platforms. I understand it is because of the different methods of handling parallel processing.
Summarizing:
I cannot insert
/ rbind
data on Linux parallely (mclapply
, mcmapply
) while I can do it on Windows.
Thanks @Hong Ooi for pointing out that
mclapply
does not works on Windows parallely, yet below question is still valid.
Of course there are no multiple inserts to same data.frame
, each insert is performed into separate data.frame.
library(R6)
library(parallel)
# storage objects generator
cl <- R6Class(
classname = "cl",
public = list(
data = data.frame(NULL),
initialize = function() invisible(self),
insert = function(x) self$data <- rbind(self$data, x)
)
)
N <- 4L # number of entities
i <- setNames(seq_len(N),paste0("n",seq_len(N)))
# random data.frames
set.seed(1)
ldt <- lapply(i, function(i) data.frame(replicate(sample(3:10,1),sample(letters,1e5,rep=TRUE))))
# entity storage
lcl1 <- lapply(i, function(i) cl$new())
lcl2 <- lapply(i, function(i) cl$new())
lcl3 <- lapply(i, function(i) cl$new())
# insert data
invisible({
mclapply(names(i), FUN = function(n) lcl1[[n]]$insert(ldt[[n]]))
mcmapply(FUN = function(dt, cl) cl$insert(dt), ldt, lcl2, SIMPLIFY=FALSE)
lapply(names(i), FUN = function(n) lcl3[[n]]$insert(ldt[[n]]))
})
### Windows
sapply(lcl1, function(cl) nrow(cl$data)) # mclapply
# n1 n2 n3 n4
# 100000 100000 100000 100000
sapply(lcl2, function(cl) nrow(cl$data)) # mcmapply
# n1 n2 n3 n4
# 100000 100000 100000 100000
sapply(lcl3, function(cl) nrow(cl$data)) # lapply
# n1 n2 n3 n4
# 100000 100000 100000 100000
### Unix
sapply(lcl1, function(cl) nrow(cl$data)) # mclapply
#n1 n2 n3 n4
# 0 0 0 0
sapply(lcl2, function(cl) nrow(cl$data)) # mcmapply
#n1 n2 n3 n4
# 0 0 0 0
sapply(lcl3, function(cl) nrow(cl$data)) # lapply
# n1 n2 n3 n4
# 100000 100000 100000 100000
And the question:
How can I achieve rbind
parallely into separate data.frame
s on a Linux platform?
P.S. Off-memory storage like SQLite
cannot be considered as solution in my case.
data.table
package. I know my tip doesn't answer your question directly, but still might help performance. Thedata.table
package plays nicer with large data sets (GBs range) than baseR
'sdata.frame
s. – Prebendarydata.table
package and its different extensions... – Bureaucratize