R: Calculate cumulative sums and counts since the last occurrence of a value

given the simplified data

set.seed(13)

user_id = rep(1:2, each = 10)
order_id = sample(1:20, replace = FALSE)
cost = round(runif(20, 1.5, 75),1)
category = sample( c("apples", "pears", "chicken"), 20, replace = TRUE)
pit = rep(c(0,0,0,0,1), 4)

df = data.frame(cbind(user_id, order_id, cost, category, pit))

user_id order_id cost category pit
      1       15 11.6    pears   0
      1        5 41.7   apples   0
      1        8 51.3  chicken   0
      1        2 40.3    pears   0
      1       16  7.9    pears   1
      1        1 47.1  chicken   0
      1        9  3.8   apples   0
      1       10 35.4   apples   0
      1       11 25.8  chicken   0
      1       20 48.1  chicken   1
      2        7 32.6    pears   0
      2       18 31.3    pears   0
      2       14   69   apples   0
      2        4 60.9  chicken   0
      2       13 41.2   apples   1
      2       17  9.4    pears   0
      2       19 34.9   apples   0
      2        6  5.3    pears   0
      2        3 57.3   apples   0
      2       12  7.7   apples   1

I'd like to create columns with cumulative sum of cost and a count of distinct categories since the last time pit == 1. So the result would look like this:

user_id order_id cost category pit cum_cost distinct_categories
      1       15 11.6    pears   0     11.6                   1
      1        5 41.7   apples   0     53.3                   2
      1        8 51.3  chicken   0    104.6                   3
      1        2 40.3    pears   0    144.9                   3
      1       16  7.9    pears   1    152.8                   3
      1        1 47.1  chicken   0     47.1                   1
      1        9  3.8   apples   0     50.9                   2
      1       10 35.4   apples   0     86.3                   2
      1       11 25.8  chicken   0    112.1                   3
      1       20 48.1  chicken   1    160.2                   3
      2        7 32.6    pears   0     32.6                   1
      2       18 31.3    pears   0     63.9                   1
      2       14   69   apples   0    132.9                   2
      2        4 60.9  chicken   0    193.8                   3
      2       13 41.2   apples   1    235.0                   3
      2       17  9.4    pears   0      9.4                   1
      2       19 34.9   apples   0     44.3                   2
      2        6  5.3    pears   0     49.6                   2
      2        3 57.3   apples   0    106.9                   2
      2       12  7.7   apples   1    114.6                   2

Ideally, the solution would be in dplyr, but I'm open to other packages / approaches. Big thanks for your help! Kasia

library(dplyr) df %>% group_by(user_id, ind= lag(cumsum(pit), default=0)) %>% mutate(cum_cost = cumsum(cost), distinct_categories = cummax(match(category, unique(category)))) # user_id order_id cost category pit ind cum_cost distinct_categories # <int> <int> <dbl> <chr> <int> <dbl> <dbl> <int> #1 1 3 49.8 apples 0 0 49.8 1 #2 1 13 14.8 chicken 0 0 64.6 2 #3 1 18 11.4 apples 0 0 76.0 2 #4 1 15 52.6 chicken 0 0 128.6 2 #5 1 11 13.6 chicken 1 0 142.2 2 #6 1 19 26.9 chicken 0 1 26.9 1 #7 1 2 54.9 chicken 0 1 81.8 1 #8 1 1 70.6 chicken 0 1 152.4 1 #9 1 10 55.0 chicken 0 1 207.4 1 #10 1 12 19.7 chicken 1 1 227.1 1 #11 2 8 40.0 pears 0 2 40.0 1 #12 2 16 37.4 pears 0 2 77.4 1 #13 2 20 70.5 pears 0 2 147.9 1 #14 2 5 63.8 apples 0 2 211.7 2 #15 2 14 31.9 apples 1 2 243.6 2 #16 2 17 9.1 chicken 0 3 9.1 1 #17 2 4 21.9 pears 0 3 31.0 2 #18 2 7 52.3 apples 0 3 83.3 3 #19 2 9 43.3 chicken 0 3 126.6 3 #20 2 6 9.9 pears 1 3 136.5 3

Recommended topics

Hot tags