Efficient way to perform running total in the last 365 day window

Asked 2/6, 2015 at 15:17 Answered 4/4, 2018 at 22:54

Solved r vectorization dplyr zoo rollapply

This is what my data frame looks like:

library(data.table)

df <- fread('
                Name  EventType  Date  SalesAmount RunningTotal Runningtotal(prior365Days)
                John    Email      1/1/2014      0          0            0
                John    Sale       2/1/2014     10          10           10
                John    Sale       7/1/2014     20          30           30
                John    Sale       4/1/2015     30          60           50 
                John    Webinar    5/1/2015      0          60           50
                Tom     Email      1/1/2014      0          0            0
                Tom     Sale       2/1/2014     15          15           15
                Tom     Sale       7/1/2014     10          25           25
                Tom     Sale       4/1/2015     25          50           35 
                Tom     Webinar    5/1/2015      0          50           35
                ')
    df[,Date:= as.Date(Date, format="%m/%d/%Y")]

The last column was my desired column which is the cumulative sum of SalesAmount(for each Name) in the last 365 days rolling window and I performed this with the help of @6pool. His solution was:

df$EventDate <- as.Date(df$EventDate, format="%d/%m/%Y")
df <- df %>%
   group_by (Name) %>%
   arrange(EventDate) %>% 
   mutate(day = EventDate - EventDate[1])

f <- Vectorize(function(i)
    sum(df[df$Name[i] == df$Name & df$day[i] - df$day >= 0 & 
             df$day[i] - df$day <= 365, "SalesAmount"]), vec="i")
df$RunningTotal365 <- f(1:nrow(df))

However,df$RunningTotal365 <- f(1:nrow(df)) is taking a long time(over 1.5 days so far) as my dataframe is over 1.5 million rows. I was suggested "rollapply" in my initial question but I have struggled to figure out how to use it in this instance. Kindly help.

Harri answered 2/6, 2015 at 15:17 Comment(5)

I wonder if you've tried implementing something of this – Fallow 2/6, 2015 at 15:37

@DavidArenburg Hi David, I implemented your solution(from the link) to a different problem. Cumulative running total in the last 365 day window for a given Name is a totally separate problem. – Harri 2/6, 2015 at 15:41

Yeah, I meant I wonder if you tried modifying it in order to solve you new problem. This would be probably the fastest solution by far – Fallow 2/6, 2015 at 15:42

@DavidArenburg Thanks David, I never really thought of that. I was wondering what will be the two temporary datasets in this case that I can join later. Any help is much appreciated! – Harri 2/6, 2015 at 15:47

I feel like you accepted an answer awfully quick. A 30 second runtime is great, but there's also an Rcpproll package; someone familiar with its (probably faster) methods might now be disinclined to answer. (Not me; I just found it by googling.) – Wesleyanism 2/6, 2015 at 17:15

Give this a try:

DF <- read.table(text = "Name  EventType  EventDate  SalesAmount RunningTotal Runningtotal(prior365Days)
John    Email      1/1/2014      0          0            0
John    Sale       2/1/2014     10          10           10
John    Sale       7/1/2014     20          30           30
John    Sale       4/1/2015     30          60           50 
John    Webinar    5/1/2015      0          60           50
Tom     Email      1/1/2014      0          0            0
Tom     Sale       2/1/2014     15          15           15
Tom     Sale       7/1/2014     10          25           25
Tom     Sale       4/1/2015     25          50           35 
Tom     Webinar    5/1/2015      0          50           35", header = TRUE)


fun <- function(x, date, thresh) {
  D <- as.matrix(dist(date)) #distance matrix between dates
  D <- D <= thresh
  D[lower.tri(D)] <- FALSE #don't sum to future
  R <- D * x #FALSE is treated as 0
  colSums(R)
}


library(data.table)
setDT(DF)
DF[, EventDate := as.Date(EventDate, format = "%m/%d/%Y")]
setkey(DF, Name, EventDate)

DF[, RT365 := fun(SalesAmount, EventDate, 365), by = Name]

#    Name EventType  EventDate SalesAmount RunningTotal Runningtotal.prior365Days. RT365
# 1: John     Email 2014-01-01           0            0                          0     0
# 2: John      Sale 2014-02-01          10           10                         10    10
# 3: John      Sale 2014-07-01          20           30                         30    30
# 4: John      Sale 2015-04-01          30           60                         50    50
# 5: John   Webinar 2015-05-01           0           60                         50    50
# 6:  Tom     Email 2014-01-01           0            0                          0     0
# 7:  Tom      Sale 2014-02-01          15           15                         15    15
# 8:  Tom      Sale 2014-07-01          10           25                         25    25
# 9:  Tom      Sale 2015-04-01          25           50                         35    35
#10:  Tom   Webinar 2015-05-01           0           50                         35    35

Grill answered 2/6, 2015 at 15:48 Comment(10)

Is it feasible for 1.5million rows? – Heptameter 2/6, 2015 at 15:57

@Heptameter 1.5MM is a very small data set. Any vectorised code can handle it easily. And by "vectorized" I don't mean the "Vectorize" function. – Fallow 2/6, 2015 at 15:59

@DavidArenburg D <- as.matrix(dist(date)) is about of size 1.5M by 1.5M. I have no idea how much memory that takes. – Heptameter 2/6, 2015 at 16:11

@DavidArenburg Thank you so much! This worked like magic Roland. It took less than 30 seconds. Can one of you kindly shed some light as to why the initial solution that I employed is so slow? – Harri 2/6, 2015 at 16:35

That's what's called a vectorized code compared to the for loop that for some reason was called Vectorize by the R devs. See "Circle 3" in that book – Fallow 2/6, 2015 at 17:5

@DavidArenburg Thanks you David. Lets say I have one more column called Fund. It has Fund A and B which appear only when EventType ==Sale. How do I modify Roland's Solution if I want the cumulative sum only for Fund A for example? – Harri 2/6, 2015 at 17:8

@Harri Hm, I'm not sure. If you pass SalesAmount*(Fund=="A") to Roland's fun instead of SalesAmount, it seems like that should be the correct cumulative sum, since it will have zeros wherever the Fund is not A. – Wesleyanism 2/6, 2015 at 17:30

@Heptameter The function is obviously not suitable for really long input vectors. However, I would have been very surprised if there were time series longer than a few years per user in the dataset and sales usually are much less frequent than daily. – Grill 2/6, 2015 at 17:44

@Wesleyanism R <- D * x * (df$Fund == "A") gives me " longer object length is not a multiple of shorter object length error". Kindly help. – Harri 2/6, 2015 at 18:49

@Harri I don't really follow. I meant DF[, newvar := fun(SalesAmount*(Fund=="A"), EventDate, 365), by = Name] If that's not what you were looking for and Roland can't answer it briefly, you might need to form a new question. – Wesleyanism 2/6, 2015 at 18:52

Here's an approach using foverlaps function from data.table package:

require(data.table)
setDT(df)[, end := as.Date(EventDate, format="%d/%m/%Y")
        ][, start := end - 365L]
setkey(df, Name, start, end)
olaps = foverlaps(df, df, nomatch=0L, which=TRUE)
olaps = olaps[xid >= yid, .(ans = sum(dt$SalesAmount[yid])), by=xid]

df[olaps$xid, Runningtotal := olaps$ans]

You can remove the start and end columns, if necessary, by doing:

df[, c("start", "end") := NULL]

Would be nice to know how fast/slow it is..

Johnie answered 2/6, 2015 at 18:59 Comment(6)

It is very fast. It took me less than 45 seconds. Thanks you Arun. However, I am strugging with extending the solutions to a minor problem. Lets say I have one more column called Fund. It has Fund A and B which appear only when EventType ==Sale. How do I modify Roland's Solution or your solution if I want the cumulative sum only for Fund A for example? – Harri 2/6, 2015 at 19:17

1. Could you provide link to your bigger data (if you can't share, could you generate an artificial data with similar settings)? 2. What extension are you struggling with? – Johnie 2/6, 2015 at 19:19

Please edit your post with this, and show the desired output (or better yet, as a new post as it seems different to this Q). Could you share the bigger data? I'd like to see if there are any improvements possible. – Johnie 2/6, 2015 at 19:22

@Wesleyanism I got the solution from Frank. Thank you guys a lot. – Harri 2/6, 2015 at 19:54

@Harri So, Arun's solution (less than 45 seconds according to you) is slower than my solution (30 seconds according to you)? Can you provide exact timings (use system.time)? Can you tell us how many unique names you have in your data? – Grill 3/6, 2015 at 5:35

@Grill I had about 15K unique names and 1.5 million rows with a lot more event types than sale,webinars and emails. I modified my data a lot after I implemented the solution(including subsetting) and did not keep the original dataframe object. Next time, I will try to use system.time rather than my subjective measurement of the clock. – Harri 3/6, 2015 at 16:18

Using newer non-equi joins feature in data.table:

    df1 = df[.(iName=Name,start = Date - 365L, end = Date),
    on=.(Name=iName,Date >= start, Date <= end),nomatch = 0, allow.cart=TRUE][,
  .(MyTotal = sum(SalesAmount)), by=.(Name,Date = Date.1)]


    df[df1, on = .(Name,Date)]

Harri answered 4/4, 2018 at 22:54 Comment(0)

Recommended topics

Hot tags