Group by multiple columns and sum other multiple columns
Asked Answered
M

7

25

I have a data frame with about 200 columns, out of them I want to group the table by first 10 or so which are factors and sum the rest of the columns.

I have list of all the column names which I want to group by and the list of all the cols which I want to aggregate.

The output format that I am looking for needs to be the same dataframe with same number of cols, just grouped together.

Is there a solution using packages data.table, plyr or any other?

Mccaffrey answered 21/11, 2011 at 13:38 Comment(0)
P
23

The data.table way is :

DT[, lapply(.SD,sum), by=list(col1,col2,col3,...)]

or

DT[, lapply(.SD,sum), by=colnames(DT)[1:10]]

where .SD is the (S)ubset of (D)ata excluding group columns. (Aside: If you need to refer to group columns generically, they are in .BY.)

Piceous answered 21/11, 2011 at 14:1 Comment(0)
W
23

See below for a more modern answer using dplyr::across.

The dplyr way would be:

library(dplyr)
df %>%
  group_by(col1, col2, col3) %>%
  summarise_each(funs(sum))

You can further specify the columns to be summarised or excluded from the summarise_each by using the special functions mentioned in the help file of ?dplyr::select.

Windhoek answered 22/10, 2015 at 15:4 Comment(0)
G
20

In base R this would be...

aggregate( as.matrix(df[,11:200]), as.list(df[,1:10]), FUN = sum)

EDIT: The aggregate function has come a long way since I wrote this. None of the casting above is necessary.

aggregate( df[,11:200], df[,1:10], FUN = sum )

And there are a variety of ways to write this. Assuming the first 10 columns are named a1 through a10 I like the following, even though it is verbose.

aggregate(. ~ a1 + a2 + a3 + a4 + a5 + a6 + a7 + a8 + a9 + a10, data = dat, FUN = sum)

(You could use paste to construct the formula and use formula)

Graben answered 21/11, 2011 at 14:40 Comment(0)
Y
19

This seems like a task for ddply (I use the 'baseball' dataset which is included with plyr):

library(plyr)
groupColumns = c("year","team")
dataColumns = c("hr", "rbi","sb")
res = ddply(baseball, groupColumns, function(x) colSums(x[dataColumns]))
head(res)

This gives per groupColumns the sum of the columns specified in dataColumns.

Yakka answered 21/11, 2011 at 13:50 Comment(0)
F
10

Using plyr::ddply:

library(plyr)
ddply(dtfr, .(name1, name2, namex), numcolwise(sum))
Fourthclass answered 21/11, 2011 at 13:46 Comment(0)
F
9

Let's consider this example :

df <- data.frame(a = 'a', b = c('a', 'a', 'b', 'b', 'b'), c = 1:5, d = 11:15,
                 stringsAsFactors = TRUE)

Update dplyr 1.1.0 onwards

You may use pick to select columns -

df %>% 
  group_by(pick(where(is.factor))) %>% 
  summarise(across(everything(), sum))

Or use the .by argument.

df %>% summarise(across(everything(), sum), .by = where(is.factor))

Before dplyr 1.1.0

_all, _at and _if verbs are now superseded and we use across now to group all the factor columns and sum all the other columns, we can do :

library(dplyr)

df %>% 
   group_by(across(where(is.factor))) %>% 
   summarise(across(everything(), sum))

#  a     b         c     d
#  <fct> <fct> <int> <int>
#1 a     a         3    23
#2 a     b        12    42

To group all factor columns and sum numeric columns :

df %>% 
  group_by(across(where(is.factor))) %>% 
  summarise(across(where(is.numeric), sum))

We can also do this by position but have to be careful of the number since it doesn't count the grouping columns.

df %>% group_by(across(1:2)) %>% summarise(across(1:2, sum))
Fugitive answered 25/6, 2020 at 1:38 Comment(0)
T
2

Another way to do this with dplyr that would be generic (don't need list of columns) would be:

df %>% group_by_if(is.factor) %>% summarize_if(is.numeric,sum,na.rm = TRUE)
Thereinto answered 19/3, 2018 at 17:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.