given the simplified data
set.seed(13)
user_id = rep(1:2, each = 10)
order_id = sample(1:20, replace = FALSE)
cost = round(runif(20, 1.5, 75),1)
category = sample( c("apples", "pears", "chicken"), 20, replace = TRUE)
pit = rep(c(0,0,0,0,1), 4)
df = data.frame(cbind(user_id, order_id, cost, category, pit))
user_id order_id cost category pit
1 15 11.6 pears 0
1 5 41.7 apples 0
1 8 51.3 chicken 0
1 2 40.3 pears 0
1 16 7.9 pears 1
1 1 47.1 chicken 0
1 9 3.8 apples 0
1 10 35.4 apples 0
1 11 25.8 chicken 0
1 20 48.1 chicken 1
2 7 32.6 pears 0
2 18 31.3 pears 0
2 14 69 apples 0
2 4 60.9 chicken 0
2 13 41.2 apples 1
2 17 9.4 pears 0
2 19 34.9 apples 0
2 6 5.3 pears 0
2 3 57.3 apples 0
2 12 7.7 apples 1
I'd like to create columns with cumulative sum of cost and a count of distinct categories since the last time pit == 1. So the result would look like this:
user_id order_id cost category pit cum_cost distinct_categories
1 15 11.6 pears 0 11.6 1
1 5 41.7 apples 0 53.3 2
1 8 51.3 chicken 0 104.6 3
1 2 40.3 pears 0 144.9 3
1 16 7.9 pears 1 152.8 3
1 1 47.1 chicken 0 47.1 1
1 9 3.8 apples 0 50.9 2
1 10 35.4 apples 0 86.3 2
1 11 25.8 chicken 0 112.1 3
1 20 48.1 chicken 1 160.2 3
2 7 32.6 pears 0 32.6 1
2 18 31.3 pears 0 63.9 1
2 14 69 apples 0 132.9 2
2 4 60.9 chicken 0 193.8 3
2 13 41.2 apples 1 235.0 3
2 17 9.4 pears 0 9.4 1
2 19 34.9 apples 0 44.3 2
2 6 5.3 pears 0 49.6 2
2 3 57.3 apples 0 106.9 2
2 12 7.7 apples 1 114.6 2
Ideally, the solution would be in dplyr
, but I'm open to other packages / approaches. Big thanks for your help!
Kasia