Summing rows based on specific factor combinations

Asked 3/5, 2012 at 3:16 Answered 11/12, 2018 at 23:24

This is probably a silly question, but I have read through Crawley's chapter on dataframes and scoured the internet and haven't yet been able to make anything work.

Here is a sample dataset similar to mine:

> data<-data.frame(site=c("A","A","A","A","B","B"), plant=c("buttercup","buttercup",
"buttercup","rose","buttercup","rose"), treatment=c(1,1,2,1,1,1), 
plant_numb=c(1,1,2,1,1,2), fruits=c(1,2,1,4,3,2),seeds=c(45,67,32,43,13,25))
> data
  site     plant treatment plant_numb fruits seeds
1    A buttercup         1          1      1    45
2    A buttercup         1          1      2    67
3    A buttercup         2          2      1    32
4    A      rose         1          1      4    43
5    B buttercup         1          1      3    13
6    B      rose         1          2      2    25

What I would like to do is create a scenario where "seeds" and "fruits" are summed whenever unique site & plant & treatment & plant_numb combinations exist. Ideally, this would result in a reduction of rows, but a preservation of the original columns (ie I need the above example to look like this:)

  site     plant treatment plant_numb fruits seeds
1    A buttercup         1          1      3   112
2    A buttercup         2          2      1    32
3    A      rose         1          1      4    43
4    B buttercup         1          1      3    13
5    B      rose         1          2      2    25

This example is pretty basic (my dataset is ~5000 rows), and although here you only see two rows that are required to be summed, the numbers of rows that need to be summed vary, and range from 1 to ~45.

I've tried rowsum() and tapply() with pretty dismal results so far (the errors are telling me that these functions are not meaningful for factors), so if you could even point me in the right direction, I would greatly appreciate it!

Thanks so much!

Escharotic answered 3/5, 2012 at 3:16 Comment(2)

look at the plyr and data.table tag. Lots of questions basically address this. Good luck! – Trunk 3/5, 2012 at 3:51

See also 4dpiecharts.com/2011/12/16/… – Loge 3/5, 2012 at 10:13

Hopefully the following code is fairly self-explanatory. It uses the base function "aggregate" and basically this is saying for each unique combination of site, plant, treatment, and plant_num look at the sum of fruits and the sum of seeds.

# Load your data
data <- data.frame(site=c("A","A","A","A","B","B"), plant=c("buttercup","buttercup",
"buttercup","rose","buttercup","rose"), treatment=c(1,1,2,1,1,1), 
plant_numb=c(1,1,2,1,1,2), fruits=c(1,2,1,4,3,2),seeds=c(45,67,32,43,13,25)) 

# Summarize your data
aggregate(cbind(fruits, seeds) ~ 
      site + plant + treatment + plant_numb, 
      sum, 
      data = data)
#  site     plant treatment plant_numb fruits seeds
#1    A buttercup         1          1      3   112
#2    B buttercup         1          1      3    13
#3    A      rose         1          1      4    43
#4    B      rose         1          2      2    25
#5    A buttercup         2          2      1    32

The order of the rows changes (and it sorted by site, plant, ...) but hopefully that isn't too much of a concern.

An alternative way to do this would be to use ddply from the plyr package.

library(plyr)
ddply(data, .(site, plant, treatment, plant_numb), 
      summarize, 
      fruits = sum(fruits), 
      seeds = sum(seeds))
#  site     plant treatment plant_numb fruits seeds
#1    A buttercup         1          1      3   112
#2    A buttercup         2          2      1    32
#3    A      rose         1          1      4    43
#4    B buttercup         1          1      3    13
#5    B      rose         1          2      2    25

Betweenwhiles answered 3/5, 2012 at 3:45 Comment(3)

Awesome - I was just playing with aggregate after I asked the question, but you've sped me along mightily. Thanks for your help. One more question, though: when I enter the code as you've shown, I'm getting the error "Error in as.data.frame.default(x) : cannot coerce class "formula" into a data.frame". Any ideas on making it work? – Escharotic 3/5, 2012 at 4:0

Both, unfortunately. I'm getting the same error message for both the example and my actual data sets (without spaces): > aggregate(cbind(fruits, seeds) ~ site + plant + treatment + plant_numb, sum, data = data) Error in as.data.frame.default(x) : cannot coerce class "formula" into a data.frame – Escharotic 3/5, 2012 at 4:12

The plyr solution should still work I would guess. But it sounds like you don't have a formula version of aggregate. Which version of R are you using? I think aggregate has allowed formula input since 2.11 – Betweenwhiles 3/5, 2012 at 4:16

And for completeness, here is the data.table solution, as suggested by @Chase. For larger datasets this will probably be the fastest method:

library(data.table)
data.dt <- data.table(data)
setkey(data.dt, site)
data.dt[, lapply(.SD, sum), by = list(site, plant, treatment, plant_numb)]

     site     plant treatment plant_numb fruits seeds
[1,]    A buttercup         1          1      3   112
[2,]    A buttercup         2          2      1    32
[3,]    A      rose         1          1      4    43
[4,]    B buttercup         1          1      3    13
[5,]    B      rose         1          2      2    25

The lapply(.SD, sum) part sums up all your columns that are not part of the grouping set (ie. columns not in the by function)

Margrettmarguerie answered 3/5, 2012 at 4:33 Comment(0)

Just to update this answer a long time later, the dplyr/tidyverse solution would be

library(tidyverse)

data %>% 
  group_by(site, plant, treatment, plant_numb) %>% 
  summarise(fruits=sum(fruits), seeds=sum(seeds))

Robinet answered 11/12, 2018 at 23:24 Comment(0)

Recommended topics

Hot tags