Apply several summary functions (sum, mean, etc.) on several variables by group in one call
Asked Answered
B

7

109

I have the following data frame

x <- read.table(text = "  id1 id2 val1 val2
1   a   x    1    9
2   a   x    2    4
3   a   y    3    5
4   a   y    4    9
5   b   x    1    7
6   b   y    4    4
7   b   x    3    9
8   b   y    2    8", header = TRUE)

I want to calculate the mean of val1 and val2 grouped by id1 and id2, and simultaneously count the number of rows for each id1-id2 combination. I can perform each calculation separately:

# calculate mean
aggregate(. ~ id1 + id2, data = x, FUN = mean)

# count rows
aggregate(. ~ id1 + id2, data = x, FUN = length)

In order to do both calculations in one call, I tried

do.call("rbind", aggregate(. ~ id1 + id2, data = x, FUN = function(x) data.frame(m = mean(x), n = length(x))))

However, I get a garbled output along with a warning:

#     m   n
# id1 1   2
# id2 1   1
#     1.5 2
#     2   2
#     3.5 2
#     3   2
#     6.5 2
#     8   2
#     7   2
#     6   2
# Warning message:
#   In rbind(id1 = c(1L, 2L, 1L, 2L), id2 = c(1L, 1L, 2L, 2L), val1 = list( :
#   number of columns of result is not a multiple of vector length (arg 1)

I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.

How can I use aggregate or other functions to perform several calculations in one call?

Bunnybunow answered 21/8, 2012 at 22:55 Comment(1)
Beside aggregate mentioned in the answers there are also by and tapply.Eightfold
F
183

You can do it all in one step and get proper labeling:

> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
#   id1 id2 val1.mn val1.n val2.mn val2.n
# 1   a   x     1.5    2.0     6.5    2.0
# 2   b   x     2.0    2.0     8.0    2.0
# 3   a   y     3.5    2.0     7.0    2.0
# 4   b   y     3.0    2.0     6.0    2.0

This creates a dataframe with two id columns and two matrix columns:

str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame':   4 obs. of  4 variables:
 $ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
 $ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
 $ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr  "mn" "n"
 $ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr  "mn" "n"

As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)

str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) ) 
    )
'data.frame':   4 obs. of  6 variables:
 $ id1    : Factor w/ 2 levels "a","b": 1 2 1 2
 $ id2    : Factor w/ 2 levels "x","y": 1 1 2 2
 $ val1.mn: num  1.5 2 3.5 3
 $ val1.n : num  2 2 2 2
 $ val2.mn: num  6.5 8 7 6
 $ val2.n : num  2 2 2 2

This is the syntax for multiple variables on the LHS:

aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
Friedrick answered 21/8, 2012 at 23:7 Comment(13)
Thanks much. As a side note, how do I get aggregate to sum up just one column. If I have several numerical columns, I don't want it summing columns I don't want it to. I could of course throw away the columns after the aggregation is done, but the CPU cycles would already be spent then.Bunnybunow
You only give it the factors to be grouped on and the columns to be aggregated. Possibly use negative column indexing in data or put the columns you want on the LHS of the formula. (See edit.)Friedrick
I encountered the bug that user2659402 mentioned in his update while using RStudio 0.98.1014 on a windows 7 machine. If you output the data frame to the console as shown it appears normal, however if you save it into d, and then try to access d$val1.mn, it returns NULL. d also appears malformed if you run view(d). Using the code in the update fixed it.Genotype
The reason you are having difficulty is that the "vals" are being returned as matrices with two columns each, rather than as ordinary columns. Try d$val1[ , ""mn"] and do look at the structure with str.Friedrick
You can bind the columns which contain matrices back into the data frame: agg <- aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x))) by using agg_df <- do.call(data.frame, agg). See also here.Vomitory
Also see the accepted answer here:#32653928.Gent
Not sure what point this comment is delivering. Are you saying that question should be marked as a duplicate of this question?Friedrick
Now that I reread, it's a duplicate of the UPDATE. I too had no success with my data until I used object <- as.data.frame(as.list(aggregate(data.frame))). The explanation for why that is can be found in the link I posted. Now that I tried it with the data linked in this question, your answer works fine. Not sure what is unique about the datasets used by me and others experiencing this issue.Gent
The UPDATE was wrong when it was written, and it's been wrong every time I have tested it in the years that followed. Perhaps the persons with this problem should spend some time doing better investigation and documentation with str and dput. I used (implicitly) aggregate.formula so it's possible that some issues with environments and search paths could have arisen, ... but NEED a MCVE.Friedrick
This solution works perfectly if your functions return a single value. Things get complicated if they return a list, and you also have to extract parts of it (so that they're nicely arranged). Here is a solution for this case:do.call( data.frame, aggregate( . ~ id1 + id2, data = x, FUN = function( x ) do.call( c, lapply( c( z.test, t.test ), function( fun ) with( fun( x, stdev = 1 ), c( p.value = p.value, cilwr = conf.int[1] ) ) ) ) ) ). (z.test is from the library TeachingDemos.)Tamarin
I used with to show what is the comfortable way if we also have to extract part of a returned value.Tamarin
Another sidenote: if you want the output formed so that results for different variables appear below each other (i.e. in long format) the best is perhaps to melt the data frame beforehand: do.call( data.frame, aggregate( value ~ variable + id1 + id2, data = melt( x, id.vars = c( "id1", "id2" ) ), FUN = function( x ) do.call( c, lapply( c( z.test, t.test ), function( fun ) with( fun( x, stdev = 1 ), c( p.value = p.value, cilwr = conf.int[1] ) ) ) ) ) ).Tamarin
I think there is no an alternative in the answers and comments to apply a function to multiple variables, e.g. aggregate(cbind(numb, pot_id ) ~ year + project, data = dat_tot, FUN = function(x, y) sum(x)/length(y)), where numb = x and pot_id = y. Like as I wrote it the function doesn't work. Is something like that possible?Gehenna
C
31

Given this in the question :

I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.

Then in data.table (1.9.4+) you could try :

> DT
   id1 id2 val1 val2
1:   a   x    1    9
2:   a   x    2    4
3:   a   y    3    5
4:   a   y    4    9
5:   b   x    1    7
6:   b   y    4    4
7:   b   x    3    9
8:   b   y    2    8

> DT[ , .(mean(val1), mean(val2), .N), by = .(id1, id2)]   # simplest
   id1 id2  V1  V2 N
1:   a   x 1.5 6.5 2
2:   a   y 3.5 7.0 2
3:   b   x 2.0 8.0 2
4:   b   y 3.0 6.0 2

> DT[ , .(val1.m = mean(val1), val2.m = mean(val2), count = .N), by = .(id1, id2)]  # named
   id1 id2 val1.m val2.m count
1:   a   x    1.5    6.5     2
2:   a   y    3.5    7.0     2
3:   b   x    2.0    8.0     2
4:   b   y    3.0    6.0     2

> DT[ , c(lapply(.SD, mean), count = .N), by = .(id1, id2)]   # mean over all columns
   id1 id2 val1 val2 count
1:   a   x  1.5  6.5     2
2:   a   y  3.5  7.0     2
3:   b   x  2.0  8.0     2
4:   b   y  3.0  6.0     2

For timings comparing aggregate (used in question and all 3 other answers) to data.table see this benchmark (the agg and agg.x cases).

Cephalization answered 22/8, 2012 at 8:49 Comment(0)
A
16

Using the dplyr package you could achieve this by using summarise_all. With this summarise-function you can apply other functions (in this case mean and n()) to each of the non-grouping columns:

x %>%
  group_by(id1, id2) %>%
  summarise_all(funs(mean, n()))

which gives:

     id1    id2 val1_mean val2_mean val1_n val2_n
1      a      x       1.5       6.5      2      2
2      a      y       3.5       7.0      2      2
3      b      x       2.0       8.0      2      2
4      b      y       3.0       6.0      2      2

If you don't want to apply the function(s) to all non-grouping columns, you specify the columns to which they should be applied or by excluding the non-wanted with a minus using the summarise_at() function:

# inclusion
x %>%
  group_by(id1, id2) %>%
  summarise_at(vars(val1, val2), funs(mean, n()))

# exclusion
x %>%
  group_by(id1, id2) %>%
  summarise_at(vars(-val2), funs(mean, n()))
Applied answered 12/12, 2015 at 14:20 Comment(0)
D
12

You could add a count column, aggregate with sum, then scale back to get the mean:

x$count <- 1
agg <- aggregate(. ~ id1 + id2, data = x,FUN = sum)
agg
#   id1 id2 val1 val2 count
# 1   a   x    3   13     2
# 2   b   x    4   16     2
# 3   a   y    7   14     2
# 4   b   y    6   12     2

agg[c("val1", "val2")] <- agg[c("val1", "val2")] / agg$count
agg
#   id1 id2 val1 val2 count
# 1   a   x  1.5  6.5     2
# 2   b   x  2.0  8.0     2
# 3   a   y  3.5  7.0     2
# 4   b   y  3.0  6.0     2

It has the advantage of preserving your column names and creating a single count column.

Donall answered 21/8, 2012 at 23:13 Comment(0)
E
10

Perhaps you want to merge?

x.mean <- aggregate(. ~ id1+id2, p, mean)
x.len  <- aggregate(. ~ id1+id2, p, length)

merge(x.mean, x.len, by = c("id1", "id2"))

  id1 id2 val1.x val2.x val1.y val2.y
1   a   x    1.5    6.5      2      2
2   a   y    3.5    7.0      2      2
3   b   x    2.0    8.0      2      2
4   b   y    3.0    6.0      2      2
Erdman answered 21/8, 2012 at 23:6 Comment(0)
M
4

You can also use the plyr::each() to introduce multiple functions:

aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = plyr::each(avg = mean, n = length))
Matusow answered 7/7, 2017 at 11:35 Comment(0)
A
2

After dplyr version 1.0.0, the above summarize_all and summarize_at functions were superseded by summarize(across(...)), where you can select columns to operate on (val1:val2 here).

We can also supply a list of functions in across, and set column names with glue specification ({.col} = original column name, {.fn} = function name in the list).

More information of across can be found in the official documentation.

library(dplyr)

x %>% group_by(id1, id2) %>% 
  summarize(across(val1:val2, list(mean = mean, n = length), .names = "{.col}_{.fn}"))

# A tibble: 4 × 6
# Groups:   id1 [2]
  id1   id2   val1_mean val1_n val2_mean val2_n
  <chr> <chr>     <dbl>  <int>     <dbl>  <int>
1 a     x           1.5      2       6.5      2
2 a     y           3.5      2       7        2
3 b     x           2        2       8        2
4 b     y           3        2       6        2
Arther answered 8/5, 2022 at 10:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.