R: speeding up "group by" operations

Asked 10/9, 2010 at 14:39 Answered 27/4, 2023 at 15:50

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one step were faster I could greatly decrease the number of nodes I need.

Here's a reasonable simplification of what I am trying to do:

library(Hmisc)

# Set up some example data
year <-    sample(1970:2008, 1e6, rep=T)
state <-   sample(1:50, 1e6, rep=T)
group1 <-  sample(1:6, 1e6, rep=T)
group2 <-  sample(1:3, 1e6, rep=T)
myFact <-  rnorm(100, 15, 1e6)
weights <- rnorm(1e6)
myDF <- data.frame(year, state, group1, group2, myFact, weights)

# this is the step I want to make faster
system.time(aggregateDF <- ddply(myDF, c("year", "state", "group1", "group2"),
                     function(df) wtd.mean(df$myFact, weights=df$weights)
                                 )
           )

All tips or suggestions are appreciated!

Decrepitude answered 10/9, 2010 at 14:39 Comment(2)

Not related to performance, but checkout weighted.mean in base – Azotic 10/9, 2010 at 15:18

Oh, that's handy. You can see I learned R by googling for what I need to do ;) – Decrepitude 10/9, 2010 at 16:11

Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:

idf <- idata.frame(myDF)
system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
   function(df) wtd.mean(df$myFact, weights=df$weights)))

#    user  system elapsed 
# 18.032   0.416  19.250

If I was to write a plyr function customised exactly to this situation, I'd do something like this:

system.time({
  ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
  data <- as.matrix(myDF[c("myFact", "weights")])
  indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))

  fun <- function(rows) {
    weighted.mean(data[rows, 1], data[rows, 2])
  }
  values <- vapply(indices, fun, numeric(1))

  labels <- myDF[match(seq_len(attr(ids, "n")), ids), 
    c("year", "state", "group1", "group2")]
  aggregateDF <- cbind(labels, values)
})

# user  system elapsed 
# 2.04    0.29    2.33

It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.

Azotic answered 10/9, 2010 at 15:25 Comment(8)

idata.frame was added in plyr 1.0. – Azotic 10/9, 2010 at 15:32

I had messed around with making indexes and such with data.table and had all but given up on that idea. I was hoping for 50% improvement. This far exceeds my expectations. – Decrepitude 10/9, 2010 at 16:13

having a little trouble making this run right... But I'm learning as I go... I had changed data to myDF but not sure where the issue is.. – Decrepitude 10/9, 2010 at 16:30

the code above seems to be missing the creation of the matrix 'data' (if I'm reading this right) possibly a data <- as.matrix(myDF[5:6]) at the top? – Decrepitude 10/9, 2010 at 16:55

I know this is from 4 years ago so it's a longshot, but have this question and now the indices code doesn't seem to run. Anyone else have this issue? Has updates to plyr changed how plyr:::split_indices works? Getting this error Error in plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, : unused argument (ids) – Toxicogenic 5/11, 2014 at 12:11

@Toxicogenic you're much better off using dplyr now – Azotic 5/11, 2014 at 15:1

@Azotic Thanks for that, finding it hard to keep up with the latest methods for data management and manipulation, any suggestions of a good form or something? – Toxicogenic 5/11, 2014 at 17:52

@Toxicogenic #rstats hashtag on twitter maybe – Azotic 5/11, 2014 at 19:42

Further 2x speedup and more concise code:

library(data.table)
dtb <- data.table(myDF, key="year,state,group1,group2")
system.time( 
  res <- dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)] 
)
#   user  system elapsed 
#  0.950   0.050   1.007

My first post, so please be nice ;)

From data.table v1.9.2, setDT function is exported that'll convert data.frame to data.table by reference (in keeping with data.table parlance - all set* functions modify the object by reference). This means, no unnecessary copying, and is therefore fast. You can time it, but it'll be negligent.

require(data.table)
system.time({
  setDT(myDF)
  res <- myDF[, weighted.mean(myFact, weights), 
             by=list(year, state, group1, group2)] 
})
#   user  system elapsed 
#  0.970   0.024   1.015

This is as opposed to 1.264 seconds with OP's solution above, where data.table(.) is used to create dtb.

Rebec answered 29/10, 2010 at 20:37 Comment(2)

Good post! Thanks for the answer. To be consistent with the other methods, however, the step that creates the data table and index should be inside of the system.time() step. – Decrepitude 29/10, 2010 at 22:20

Indeed, but it remains the fastest though. It would be nice to have an option in ddply to operate on data.tables or use data.tables under the hood (I just discovered data.table by looking for solutions to the very same problem, but I would prefer a more ddply-like syntax for this case). – Rebec 30/10, 2010 at 20:49

I would profile with base R

g <- with(myDF, paste(year, state, group1, group2))
x <- with(myDF, c(tapply(weights * myFact, g, sum) / tapply(weights, g, sum)))
aggregateDF <- myDF[match(names(x), g), c("year", "state", "group1", "group2")]
aggregateDF$V1 <- x

On my machine it takes 5sec compare to 67sec with original code.

EDIT Just found another speed up with rowsum function:

g <- with(myDF, paste(year, state, group1, group2))
X <- with(myDF, rowsum(data.frame(a=weights*myFact, b=weights), g))
x <- X$a/X$b
aggregateDF2 <- myDF[match(rownames(X), g), c("year", "state", "group1", "group2")]
aggregateDF2$V1 <- x

It takes 3sec!

Cassidycassie answered 10/9, 2010 at 16:4 Comment(6)

Second one takes 5 seconds on my computer, so plyr is still narrowly beating base ;) (Plus it orders the rows correctly) – Azotic 10/9, 2010 at 16:33

But thanks for the pointer to rowsum - it's so hard to keep up with the plethora of aggregation functions in base R. – Azotic 10/9, 2010 at 16:34

I knew there had to be a tapply way of doing this as well but I was struggling to figure it out. I generally have this struggle with the apply family. – Decrepitude 10/9, 2010 at 16:36

@Azotic Agree. Some time ago I found replacement for apply(X,1,which.max) in col.max. I wonder how many other functions this type exists (like pmax/pmin), optimized for matrices objects by using .Internal level. – Cassidycassie 13/9, 2010 at 8:17

@Marek: See 4dpiecharts.com/2010/09/14/… – Scout 15/9, 2010 at 9:41

@Richie It's exactly what I plan to do in the weekend :) And the simplicity of this code is outstanding – Cassidycassie 15/9, 2010 at 10:43

Are you using the latest version of plyr (note: this hasn't made it to all the CRAN mirrors yet)? If so, you could just run this in parallel.

Here's the llply example, but the same should apply to ddply:

  x <- seq_len(20)
  wait <- function(i) Sys.sleep(0.1)
  system.time(llply(x, wait))
  #  user  system elapsed 
  # 0.007   0.005   2.005 

  library(doMC)
  registerDoMC(2) 
  system.time(llply(x, wait, .parallel = TRUE))
  #  user  system elapsed 
  # 0.020   0.011   1.038

Edit:

Well, other looping approaches are worse, so this probably requires either (a) C/C++ code or (b) a more fundamental rethinking of how you're doing it. I didn't even try using by() because that's very slow in my experience.

groups <- unique(myDF[,c("year", "state", "group1", "group2")])
system.time(
aggregateDF <- do.call("rbind", lapply(1:nrow(groups), function(i) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))
}))
)

aggregateDF <- data.frame()
system.time(
for(i in 1:nrow(groups)) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   aggregateDF <- rbind(aggregateDF, data.frame(cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))))
}
)

Zenithal answered 10/9, 2010 at 14:49 Comment(1)

that helps me in the single machine case but I'm already blowing this out to Hadoop & oversubscribing each node (more processes than processors). But I'm very pleased to see parallelization making it into plyr! – Decrepitude 10/9, 2010 at 14:55

I usually use an index vector with tapply when the function being applied has multiple vector args:

system.time(tapply(1:nrow(myDF), myDF[c('year', 'state', 'group1', 'group2')], function(s) weighted.mean(myDF$myFact[s], myDF$weights[s])))
# user  system elapsed 
# 1.36    0.08    1.44

I use a simple wrapper which is equivalent but hides the mess:

tmapply(list(myDF$myFact, myDF$weights), myDF[c('year', 'state', 'group1', 'group2')], weighted.mean)

Edited to include tmapply for comment below:

tmapply = function(XS, INDEX, FUN, ..., simplify=T) {
  FUN = match.fun(FUN)
  if (!is.list(XS))
    XS = list(XS)
  tapply(1:length(XS[[1L]]), INDEX, function(s, ...)
    do.call(FUN, c(lapply(XS, `[`, s), list(...))), ..., simplify=simplify)
}

Jural answered 14/9, 2010 at 19:21 Comment(3)

Just to add: as.data.frame(as.table(RESULTS)) it's easy way to create data.frame from output. – Cassidycassie 15/9, 2010 at 8:37

Is this the tmapply that you're using? stat.ethz.ch/pipermail/r-help/2002-October/025773.html – Zenithal 29/12, 2010 at 15:22

There the m is being used to refer to "matrix". In my case it was meant more as a mashup of tapply and mapply - ie when you want to calculate a grouped aggregate using multiple inputs (cor, weighted.mean etc). – Jural 29/12, 2010 at 19:25

Probably the fastest solution is to use collapse::fgroup_by. It's 8x faster than data.table:

library(collapse)
myDF %>% 
  fgroup_by(year, state, group1, group2) %>% 
  fsummarise(myFact = fmean(myFact, weights))

bm <- bench::mark(
  collapse = myDF %>% 
  fgroup_by(year, state, group1, group2) %>% 
  fsummarise(myFact = fmean(myFact, weights)),
  data.table = dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)],
  check = FALSE)

#> bm
#  expression      min   median itr/se…¹ mem_a…² gc/se…³ n_itr  n_gc total…⁴
#1 collapse      101ms    105ms     9.10  8.84MB    0        5     0   549ms
#2 data.table    852ms    852ms     1.17 24.22MB    2.35     1     2   852ms

Spermatozoid answered 27/4, 2023 at 15:50 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags