R: cumulative sum over rolling date range
Asked Answered
P

4

14

In R, how can I calculate cumsum for a defined time period prior to the row being calculate? Prefer dplyr if possible.

For example, if the period was 10 days, then the function would achieve cum_rolling10:

date    value   cumsum  cum_rolling10
1/01/2000   9   9       9
2/01/2000   1   10      10
5/01/2000   9   19      19
6/01/2000   3   22      22
7/01/2000   4   26      26
8/01/2000   3   29      29
13/01/2000  10  39      29
14/01/2000  9   48      38
18/01/2000  2   50      21
19/01/2000  9   59      30
21/01/2000  8   67      38
25/01/2000  5   72      24
26/01/2000  1   73      25
30/01/2000  6   79      20
31/01/2000  6   85      18
Paregmenon answered 25/9, 2017 at 1:26 Comment(0)
M
21

A solution using dplyr, tidyr, lubridate, and zoo.

library(dplyr)
library(tidyr)
library(lubridate)
library(zoo)

dt2 <- dt %>%
  mutate(date = dmy(date)) %>%
  mutate(cumsum = cumsum(value)) %>%
  complete(date = full_seq(date, period = 1), fill = list(value = 0)) %>%
  mutate(cum_rolling10 = rollapplyr(value, width = 10, FUN = sum, partial = TRUE)) %>%
  drop_na(cumsum)
dt2
# A tibble: 15 x 4
         date value cumsum cum_rolling10
       <date> <dbl>  <int>         <dbl>
 1 2000-01-01     9      9             9
 2 2000-01-02     1     10            10
 3 2000-01-05     9     19            19
 4 2000-01-06     3     22            22
 5 2000-01-07     4     26            26
 6 2000-01-08     3     29            29
 7 2000-01-13    10     39            29
 8 2000-01-14     9     48            38
 9 2000-01-18     2     50            21
10 2000-01-19     9     59            30
11 2000-01-21     8     67            38
12 2000-01-25     5     72            24
13 2000-01-26     1     73            25
14 2000-01-30     6     79            20
15 2000-01-31     6     85            18

DATA

dt <- structure(list(date = c("1/01/2000", "2/01/2000", "5/01/2000", 
"6/01/2000", "7/01/2000", "8/01/2000", "13/01/2000", "14/01/2000", 
"18/01/2000", "19/01/2000", "21/01/2000", "25/01/2000", "26/01/2000", 
"30/01/2000", "31/01/2000"), value = c(9L, 1L, 9L, 3L, 4L, 3L, 
10L, 9L, 2L, 9L, 8L, 5L, 1L, 6L, 6L)), .Names = c("date", "value"
), row.names = c(NA, -15L), class = "data.frame")
Musky answered 25/9, 2017 at 2:38 Comment(2)
This works great for me too, but I am running into errors when applying this to data with an additional field that indicated groups. I add group_by(id) where id refers to the field containing the group ids. Any thoughts on how to solve this?Keely
really nice example. To make it a bit more "realworld problem", assuming we have the record at person x date grain. we then need to group by person, arrange by date, than apply the rollapply() in mutate(), which is still at person grain. Further more, if we want a trialing window type of view, we then need to first apply a lag() function.Pursley
S
9

I recommend using runner package designed to calculate functions on rolling/running windows. You can achieve this by using sum_run - one liner here:

library(runner)
library(dplyr)

df %>%
  mutate(
    cum_rolling_10 = sum_run(
      x = df$value, 
      k = 10, 
      idx = as.Date(df$date, format = "%d/%m/%Y"))
  )


df

#          date value cum_rolling_10
# 1   1/01/2000     9              9
# 2   2/01/2000     1             10
# 3   5/01/2000     9             19
# 4   6/01/2000     3             22
# 5   7/01/2000     4             26
# 6   8/01/2000     3             29
# 7  13/01/2000    10             29
# 8  14/01/2000     9             38
# 9  18/01/2000     2             21
# 10 19/01/2000     9             30
# 11 21/01/2000     8             38
# 12 25/01/2000     5             24
# 13 26/01/2000     1             25
# 14 30/01/2000     6             20
# 15 31/01/2000     6             18

Enjoy!

Sherman answered 14/1, 2020 at 15:35 Comment(2)
Is there a way to use these runner functions where you can exclude the calculation if the minimum timestamp range is not met within the window size? For example, here there is a 10-day window, so I would want an NA for cum_rolling_10 up until row/observation 7, because there is actually a time range that is 10 days before 13/01/2000 represented in the dataset (even though 3/01/2000 isn't specifically present in the data). So, then for row 7 it would calculate the rolling sum of the value column for 10 days prior to that row's date (row 7, row 6, row 5, row 4 and row 3) like currently.Talbot
So, something like if there's not a time stamp present in the data less than or equal to current row's time stamp - window size, don't perform the calculation.Talbot
A
2

Use slide_index_sum() from slider, which is designed to have the same API as purrr.

library(slider)
library(dplyr)

df <- tibble(
  date = c(
    "1/01/2000", "2/01/2000", "5/01/2000", "6/01/2000", "7/01/2000", 
    "8/01/2000", "13/01/2000", "14/01/2000", "18/01/2000", "19/01/2000", 
    "21/01/2000", "25/01/2000", "26/01/2000", "30/01/2000", "31/01/2000"
  ),
  value = c(9L, 1L, 9L, 3L, 4L, 3L, 10L, 9L, 2L, 9L, 8L, 5L, 1L, 6L, 6L)
)

df <- mutate(df, date = as.Date(date, format = "%d/%m/%Y"))

df %>%
  mutate(
    cumsum = cumsum(value),
    cum_rolling10 = slide_index_sum(value, date, before = 9L)
  )
#> # A tibble: 15 × 4
#>    date       value cumsum cum_rolling10
#>    <date>     <int>  <int>         <dbl>
#>  1 2000-01-01     9      9             9
#>  2 2000-01-02     1     10            10
#>  3 2000-01-05     9     19            19
#>  4 2000-01-06     3     22            22
#>  5 2000-01-07     4     26            26
#>  6 2000-01-08     3     29            29
#>  7 2000-01-13    10     39            29
#>  8 2000-01-14     9     48            38
#>  9 2000-01-18     2     50            21
#> 10 2000-01-19     9     59            30
#> 11 2000-01-21     8     67            38
#> 12 2000-01-25     5     72            24
#> 13 2000-01-26     1     73            25
#> 14 2000-01-30     6     79            20
#> 15 2000-01-31     6     85            18
Apache answered 16/9, 2022 at 21:3 Comment(2)
Updated URL for slider: slider.r-lib.orgRedfield
Thanks @FrancisBarton, I updated it in the answer itselfApache
T
1

this solution will avoid memory overhead, and migrate to sparklyr will be easy.

lag = 7

    dt %>%
  mutate(date = dmy(date)) %>%
  mutate(order = datediff(date,min(date)) %>% 
  arrange(desc(order)) %>% 
  mutate(n_order = lag(order + lag,1L,default = 0)) %>% 
  mutate(b_order = ifelse(order - n_order >= 0,order,-1)) %>% 
  mutate(m_order = cummax(b_order)) %>% 
  group_by(m_order) %>% 
  mutate(rolling_value = cumsum(value))
Tolerable answered 13/8, 2018 at 2:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.