cumsum using ddply
Asked Answered
S

3

13

I need to use group by in levels with ddply or aggregate if that's easier. I am not really sure how to do this as I need to use cumsum as my aggregate function. This is what my data looks like:

level1      level2  hour     product 
A           tea     0          7
A           tea     1          2
A           tea     2          9
A           coffee  17         7
A           coffee  18         2
A           coffee  20         4
B           coffee  0          2
B           coffee  1          3
B           coffee  2          4
B           tea     21         3
B           tea     22         1

expected output:

A     tea     0   7
A     tea     1   9
A     tea     2   18
A     coffee  17  7
A     coffee  18  9
A     coffee  20  13
B     coffee  0   2
B     coffee  1   5
B     coffee  2   9
B     tea     21  3
B     tea     22  4

I tried using

ddply(dd,c("level1","level2","hour"),summarise,cumsum(product))

but that doesn't sum up which I think is because the hour column is being used for group by and its being split by that..I think.. I am not sure I completely understand how aggregate works here. Is there any way I could get the required output using aggregate or ddply?

Selia answered 21/2, 2013 at 10:11 Comment(1)
I don't think you meant to summarize by level1+level2+hour, because then, your aggregation would be different than what you've shown.Febrifacient
F
16

Here's a solution in base R using ave and within:

within(mydf, {
  cumsumProduct <- ave(product, level1, level2, FUN = cumsum)
})
#    level1 level2 hour product cumsumProduct
# 1       A    tea    0       7             7
# 2       A    tea    1       2             9
# 3       A    tea    2       9            18
# 4       A coffee   17       7             7
# 5       A coffee   18       2             9
# 6       A coffee   20       4            13
# 7       B coffee    0       2             2
# 8       B coffee    1       3             5
# 9       B coffee    2       4             9
# 10      B    tea   21       3             3
# 11      B    tea   22       1             4

Of course, if you wanted to drop the existing product column, you can change the command to the following to overwrite the current "product" column:

within(mydf, {
  product <- ave(product, level1, level2, FUN = cumsum)
})

Your current approach doesn't work in part because you've included "hour" as one of your grouping variables. In other words, it is seeing the combination of "A + tea + 0" as different from "A + tea + 1", but from your desired output, you seem to simply want the combination of "A + tea" to be the group.

aggregate won't work as you expect, because it will condense everything into a data.frame with the same number of rows as the number of unique combinations of "level1" and "level2", in this case, 4 rows. The aggregated column would be a list. The values would be correct, but it would be less useful.

Here's aggregate and its output:

> aggregate(product ~ level1 + level2, mydf, cumsum)
  level1 level2  product
1      A coffee 7, 9, 13
2      B coffee  2, 5, 9
3      A    tea 7, 9, 18
4      B    tea     3, 4
Febrifacient answered 21/2, 2013 at 10:21 Comment(4)
thanks! this works :) So when you give product,level1 and level2 as the three columns are you telling it to select these 3 and group by what is different?Selia
ave just applies the function to one variable (the first one) grouped by as many of the other variables as necessary. Because of this, FUN must be used to call whatever function you want to apply to the first variable.Febrifacient
@AnandaMahto the disadvantage of teaching people is that it does not generalise as well to new situations, and it's also really inefficient if the combination of grouping variables is sparse (unfortunately it's a simple bug that R-core hasn't fixed).Wizardly
@AnandaMahto Thank you for the detailed explanation! It makes much more sense :)Selia
P
7

you should use transform instead of summarise:

# you should probably order your `level2` first
dd$level2 <- factor(dd$level2, levels=c("tea", "coffee"))
# and transform using level1 and level2 alone, not hour
# if you use hour, the groups will be for each row
ddply(dd, .(level1, level2), transform, product=cumsum(product))

#    level1 level2 hour product
# 1       A    tea    0       7
# 2       A    tea    1       9
# 3       A    tea    2      18
# 4       A coffee   17       7
# 5       A coffee   18       9
# 6       A coffee   20      13
# 7       B    tea   21       3
# 8       B    tea   22       4
# 9       B coffee    0       2
# 10      B coffee    1       5
# 11      B coffee    2       9
Petronius answered 21/2, 2013 at 10:19 Comment(3)
Thank you for your answer...but what if I have more than 2 levels in level2. There can easily be a scenario that I might have around 10 or more levels, will i have to put them in a vector each time?Selia
dd$level2 <- factor(dd$level2, levels=dd$levels2[!duplicated(dd$levels2)], ordered=T)Petronius
This doesn't work for me. Is it because I am not using factors?Refresher
P
0

Folks referencing this now may wish to avoid using plyr::ddply. Here is a dplyr solution.

library(dplyr)
output <- dd %>%
  group_by(level1, level2) %>%
  mutate(product_sum = cumsum(product)) %>%
  ungroup()
output

Note that product_sum is where the new column is named. You could use product instead. Then, the original product column would be overwritten as it appears the original question may have preferred to do that and not have both columns in the output.

Data:

level1 <- c(rep("A",6), rep("B",5))
level2 <- c(rep("tea",3), rep("coffee",6), rep("tea",2))
hour <- c(0,1,2,17,18,20,0,1,2,21,22)
product <- c(7,2,9,7,2,4,2,3,4,3,1)

dd <- data.frame(level1, level2, hour, product)
Press answered 24/2, 2022 at 17:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.