Efficient way to perform running total in the last 365 day window
Asked Answered
H

3

4

This is what my data frame looks like:

library(data.table)

df <- fread('
                Name  EventType  Date  SalesAmount RunningTotal Runningtotal(prior365Days)
                John    Email      1/1/2014      0          0            0
                John    Sale       2/1/2014     10          10           10
                John    Sale       7/1/2014     20          30           30
                John    Sale       4/1/2015     30          60           50 
                John    Webinar    5/1/2015      0          60           50
                Tom     Email      1/1/2014      0          0            0
                Tom     Sale       2/1/2014     15          15           15
                Tom     Sale       7/1/2014     10          25           25
                Tom     Sale       4/1/2015     25          50           35 
                Tom     Webinar    5/1/2015      0          50           35
                ')
    df[,Date:= as.Date(Date, format="%m/%d/%Y")]

The last column was my desired column which is the cumulative sum of SalesAmount(for each Name) in the last 365 days rolling window and I performed this with the help of @6pool. His solution was:

df$EventDate <- as.Date(df$EventDate, format="%d/%m/%Y")
df <- df %>%
   group_by (Name) %>%
   arrange(EventDate) %>% 
   mutate(day = EventDate - EventDate[1])

f <- Vectorize(function(i)
    sum(df[df$Name[i] == df$Name & df$day[i] - df$day >= 0 & 
             df$day[i] - df$day <= 365, "SalesAmount"]), vec="i")
df$RunningTotal365 <- f(1:nrow(df))

However,df$RunningTotal365 <- f(1:nrow(df)) is taking a long time(over 1.5 days so far) as my dataframe is over 1.5 million rows. I was suggested "rollapply" in my initial question but I have struggled to figure out how to use it in this instance. Kindly help.

Harri answered 2/6, 2015 at 15:17 Comment(5)
I wonder if you've tried implementing something of thisFallow
@DavidArenburg Hi David, I implemented your solution(from the link) to a different problem. Cumulative running total in the last 365 day window for a given Name is a totally separate problem.Harri
Yeah, I meant I wonder if you tried modifying it in order to solve you new problem. This would be probably the fastest solution by farFallow
@DavidArenburg Thanks David, I never really thought of that. I was wondering what will be the two temporary datasets in this case that I can join later. Any help is much appreciated!Harri
I feel like you accepted an answer awfully quick. A 30 second runtime is great, but there's also an Rcpproll package; someone familiar with its (probably faster) methods might now be disinclined to answer. (Not me; I just found it by googling.)Wesleyanism
G
4

Give this a try:

DF <- read.table(text = "Name  EventType  EventDate  SalesAmount RunningTotal Runningtotal(prior365Days)
John    Email      1/1/2014      0          0            0
John    Sale       2/1/2014     10          10           10
John    Sale       7/1/2014     20          30           30
John    Sale       4/1/2015     30          60           50 
John    Webinar    5/1/2015      0          60           50
Tom     Email      1/1/2014      0          0            0
Tom     Sale       2/1/2014     15          15           15
Tom     Sale       7/1/2014     10          25           25
Tom     Sale       4/1/2015     25          50           35 
Tom     Webinar    5/1/2015      0          50           35", header = TRUE)


fun <- function(x, date, thresh) {
  D <- as.matrix(dist(date)) #distance matrix between dates
  D <- D <= thresh
  D[lower.tri(D)] <- FALSE #don't sum to future
  R <- D * x #FALSE is treated as 0
  colSums(R)
}


library(data.table)
setDT(DF)
DF[, EventDate := as.Date(EventDate, format = "%m/%d/%Y")]
setkey(DF, Name, EventDate)

DF[, RT365 := fun(SalesAmount, EventDate, 365), by = Name]

#    Name EventType  EventDate SalesAmount RunningTotal Runningtotal.prior365Days. RT365
# 1: John     Email 2014-01-01           0            0                          0     0
# 2: John      Sale 2014-02-01          10           10                         10    10
# 3: John      Sale 2014-07-01          20           30                         30    30
# 4: John      Sale 2015-04-01          30           60                         50    50
# 5: John   Webinar 2015-05-01           0           60                         50    50
# 6:  Tom     Email 2014-01-01           0            0                          0     0
# 7:  Tom      Sale 2014-02-01          15           15                         15    15
# 8:  Tom      Sale 2014-07-01          10           25                         25    25
# 9:  Tom      Sale 2015-04-01          25           50                         35    35
#10:  Tom   Webinar 2015-05-01           0           50                         35    35
Grill answered 2/6, 2015 at 15:48 Comment(10)
Is it feasible for 1.5million rows?Heptameter
@Heptameter 1.5MM is a very small data set. Any vectorised code can handle it easily. And by "vectorized" I don't mean the "Vectorize" function.Fallow
@DavidArenburg D <- as.matrix(dist(date)) is about of size 1.5M by 1.5M. I have no idea how much memory that takes.Heptameter
@DavidArenburg Thank you so much! This worked like magic Roland. It took less than 30 seconds. Can one of you kindly shed some light as to why the initial solution that I employed is so slow?Harri
That's what's called a vectorized code compared to the for loop that for some reason was called Vectorize by the R devs. See "Circle 3" in that bookFallow
@DavidArenburg Thanks you David. Lets say I have one more column called Fund. It has Fund A and B which appear only when EventType ==Sale. How do I modify Roland's Solution if I want the cumulative sum only for Fund A for example?Harri
@Harri Hm, I'm not sure. If you pass SalesAmount*(Fund=="A") to Roland's fun instead of SalesAmount, it seems like that should be the correct cumulative sum, since it will have zeros wherever the Fund is not A.Wesleyanism
@Heptameter The function is obviously not suitable for really long input vectors. However, I would have been very surprised if there were time series longer than a few years per user in the dataset and sales usually are much less frequent than daily.Grill
@Wesleyanism R <- D * x * (df$Fund == "A") gives me " longer object length is not a multiple of shorter object length error". Kindly help.Harri
@Harri I don't really follow. I meant DF[, newvar := fun(SalesAmount*(Fund=="A"), EventDate, 365), by = Name] If that's not what you were looking for and Roland can't answer it briefly, you might need to form a new question.Wesleyanism
J
4

Here's an approach using foverlaps function from data.table package:

require(data.table)
setDT(df)[, end := as.Date(EventDate, format="%d/%m/%Y")
        ][, start := end - 365L]
setkey(df, Name, start, end)
olaps = foverlaps(df, df, nomatch=0L, which=TRUE)
olaps = olaps[xid >= yid, .(ans = sum(dt$SalesAmount[yid])), by=xid]

df[olaps$xid, Runningtotal := olaps$ans]

You can remove the start and end columns, if necessary, by doing:

df[, c("start", "end") := NULL]

Would be nice to know how fast/slow it is..

Johnie answered 2/6, 2015 at 18:59 Comment(6)
It is very fast. It took me less than 45 seconds. Thanks you Arun. However, I am strugging with extending the solutions to a minor problem. Lets say I have one more column called Fund. It has Fund A and B which appear only when EventType ==Sale. How do I modify Roland's Solution or your solution if I want the cumulative sum only for Fund A for example?Harri
1. Could you provide link to your bigger data (if you can't share, could you generate an artificial data with similar settings)? 2. What extension are you struggling with?Johnie
Please edit your post with this, and show the desired output (or better yet, as a new post as it seems different to this Q). Could you share the bigger data? I'd like to see if there are any improvements possible.Johnie
@Wesleyanism I got the solution from Frank. Thank you guys a lot.Harri
@Harri So, Arun's solution (less than 45 seconds according to you) is slower than my solution (30 seconds according to you)? Can you provide exact timings (use system.time)? Can you tell us how many unique names you have in your data?Grill
@Grill I had about 15K unique names and 1.5 million rows with a lot more event types than sale,webinars and emails. I modified my data a lot after I implemented the solution(including subsetting) and did not keep the original dataframe object. Next time, I will try to use system.time rather than my subjective measurement of the clock.Harri
H
0

Using newer non-equi joins feature in data.table:

    df1 = df[.(iName=Name,start = Date - 365L, end = Date),
    on=.(Name=iName,Date >= start, Date <= end),nomatch = 0, allow.cart=TRUE][,
  .(MyTotal = sum(SalesAmount)), by=.(Name,Date = Date.1)]


    df[df1, on = .(Name,Date)]
Harri answered 4/4, 2018 at 22:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.