Cumulative sum with lag
Asked Answered
B

3

8

I have a very large dataset that looks simplified like this:

row.    member_id   entry_id    comment_count   timestamp
1       1            a              4           2008-06-09 12:41:00
2       1            b              1           2008-07-14 18:41:00
3       1            c              3           2008-07-17 15:40:00
4       2            d              12          2008-06-09 12:41:00
5       2            e              50          2008-09-18 10:22:00
6       3            f              0           2008-10-03 13:36:00

I can aggregate the counts with the following code:

transform(df, aggregated_count = ave(comment_count, member_id, FUN = cumsum))

But I want a lag of 1 in the cumulated data, or I want cumsum to ignore the current row. The result should be:

row.    member_id   entry_id     comment_count  timestamp             previous_comments
1       1            a              4           2008-06-09 12:41:00        0
2       1            b              1           2008-07-14 18:41:00        4
3       1            c              3           2008-07-17 15:40:00        5
4       2            d              12          2008-06-09 12:41:00        0
5       2            e              50          2008-09-18 10:22:00        12
6       3            f              0           2008-10-03 13:36:00        0

Some idea how I can do this in R? Maybe even with a lag bigger than 1 ?


Data for reproducibility:

# dput(df)
structure(list(member_id = c(1L, 1L, 1L, 2L, 2L, 3L), entry_id = c("a", 
"b", "c", "d", "e", "f"), comment_count = c(4L, 1L, 3L, 12L, 
50L, 0L), timestamp = c("2008-06-09 12:41:00", "2008-07-14 18:41:00", 
"2008-07-17 15:40:00", "2008-06-09 12:41:00", "2008-09-18 10:22:00", 
"2008-10-03 13:36:00")), .Names = c("member_id", "entry_id", 
"comment_count", "timestamp"), row.names = c("1", "2", "3", "4", 
"5", "6"), class = "data.frame")
Behalf answered 25/12, 2014 at 17:6 Comment(1)
Seems like you already wrote the correct code out in a sentence, hint hint :)Scarrow
E
9

You can use 0 for the first element, and remove the last element using head(, -1)

transform(df, previous_comments=ave(comment_count, member_id, 
          FUN = function(x) cumsum(c(0, head(x, -1)))))
#  member_id entry_id comment_count           timestamp previous_comments
#1         1        a             4 2008-06-09 12:41:00                 0
#2         1        b             1 2008-07-14 18:41:00                 4
#3         1        c             3 2008-07-17 15:40:00                 5
#4         2        d            12 2008-06-09 12:41:00                 0
#5         2        e            50 2008-09-18 10:22:00                12
#6         3        f             0 2008-10-03 13:36:00                 0
Exterminatory answered 25/12, 2014 at 17:17 Comment(0)
G
11

You could use lag from dplyr and change the k

library(dplyr)
df %>% 
    group_by(member_id) %>%
    mutate(previous_comments=lag(cumsum(comment_count),k=1, default=0))
 #    member_id entry_id comment_count           timestamp previous_comments
 #1         1        a             4 2008-06-09 12:41:00                 0
 #2         1        b             1 2008-07-14 18:41:00                 4
 #3         1        c             3 2008-07-17 15:40:00                 5
 #4         2        d            12 2008-06-09 12:41:00                 0
 #5         2        e            50 2008-09-18 10:22:00                12
 #6         3        f             0 2008-10-03 13:36:00                 0

Or using data.table

 library(data.table)
  setDT(df)[,previous_comments:=c(0,cumsum(comment_count[-.N])) , member_id]
Giacopo answered 25/12, 2014 at 17:10 Comment(0)
E
9

You can use 0 for the first element, and remove the last element using head(, -1)

transform(df, previous_comments=ave(comment_count, member_id, 
          FUN = function(x) cumsum(c(0, head(x, -1)))))
#  member_id entry_id comment_count           timestamp previous_comments
#1         1        a             4 2008-06-09 12:41:00                 0
#2         1        b             1 2008-07-14 18:41:00                 4
#3         1        c             3 2008-07-17 15:40:00                 5
#4         2        d            12 2008-06-09 12:41:00                 0
#5         2        e            50 2008-09-18 10:22:00                12
#6         3        f             0 2008-10-03 13:36:00                 0
Exterminatory answered 25/12, 2014 at 17:17 Comment(0)
W
5

Just subtract comment_count from ave :

transform(df, 
  aggregated_count = ave(comment_count, member_id, FUN = cumsum) - comment_count)
Winfrid answered 25/12, 2014 at 21:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.