Consider a standard grouped operation on a data.frame:
library(plyr)
library(doMC)
library(MASS) # for example
nc <- 12
registerDoMC(nc)
d <- data.frame(x = c("data", "more data"), g = c("group1", "group2"))
y <- "some global object"
res <- ddply(d, .(g), function(d_group) {
# slow, complicated operations on d_group
}, .parallel = FALSE)
It's trivial to take advantage of a multi-core setup by simply writing .parallel = TRUE
instead. This is one of my favorite features of plyr.
But with plyr being deprecated (I think) and essentially replaced by dplyr, purrr, etc., the solution to parallel processing has become significantly more verbose:
library(dplyr)
library(multidplyr)
library(parallel)
library(MASS) # for example
nc <- 12
d <- tibble(x = c("data", "more data"), g = c("group1", "group2"))
y <- "some global object"
cl <- create_cluster(nc)
set_default_cluster(cl)
cluster_library(cl, packages = c("MASS"))
cluster_copy(cl, obj = y)
d_parts <- d %>% partition(g, cluster = cl)
res <- d_parts %>% collect() %>% ungroup()
rm(d_parts)
rm(cl)
You can imagine how long this example could get considering each package and object you need inside the loop needs its own cluster_*
command to copy it onto the nodes. The non-parallelized plyr-to-dplyr translation is just a simple dplyr::group_by
construction and it's unfortunate that there's no terse way to enable parallel processing on it. So, my questions are:
- Is this actually the preferred way to translate my code from plyr to dplyr?
- What sort of magic is happening behind the scenes in plyr that makes it so easy to turn on parallel processing? Is there a reason this capability would be particularly difficult to add to dplyr and that's why it doesn't exist yet?
- Are my two examples fundamentally different in terms of how the code is executed?
plyr
example usesdoMC
, that is amulticore
backend forforeach
, that is: forking. Yourmultidplyr
example usescreate_cluster
that defaults toparallel::makePSOCKcluster
, that is : Parallel SOCKet Cluster. – Thewpartition()
without setting up a cluster in advance:plyr
relies on a previously registeredforeach
backend (print(plyr:::setup_parallel))
),multidplyr::partition()
without a cluster relies oncreate_cluster()
implicitly, but would probably detect another backend if one is already registered (I haven't checked, though, seeprint(multidplyr:::cluster_exists))
). The first examples of themultidplyr
vignette illustrate this capability of simply callingpartition()
without previous setup. – Thewmultidplyr
doesn't allow forking the wayplyr
does, onlyPSOCK
. – Thew