relative windowed running sum through data.table non-equi join
Asked Answered
M

2

6

I have a data set customerId, transactionDate, productId, purchaseQty loaded into a data.table. for each row, I want to calculate the sum, and mean of purchaseQty for the prior 45 day

        productId customerID transactionDate purchaseQty
 1:    870826    1186951      2016-03-28      162000
 2:    870826    1244216      2016-03-31        5000
 3:    870826    1244216      2016-04-08        6500
 4:    870826    1308671      2016-03-28      221367
 5:    870826    1308671      2016-03-29       83633
 6:    870826    1308671      2016-11-29       60500

I'm looking for an output like this:

    productId customerID transactionDate purchaseQty    sumWindowPurchases
 1:    870826    1186951      2016-03-28      162000                162000
 2:    870826    1244216      2016-03-31        5000                  5000
 3:    870826    1244216      2016-04-08        6500                 11500
 4:    870826    1308671      2016-03-28      221367                221367
 5:    870826    1308671      2016-03-29       83633                305000
 6:    870826    1308671      2016-11-29       60500                 60500

so, sumWindowPurchases contains the sum of purchaseQty for the customer/product over a 45 day window from the current transaction date. Once i have that working, throwing the mean, and other calcs I need should be trivial

I went back to my SQL roots and thought of a self join:

select   DT.customerId, DT.transactionDate, DT.productId, sum(DT1.purchaseQty)
from     DT
         inner join DT as DT1 on 
             DT.customerId = DT1.customerId
             and DT.productId =  DT1.productId
             and DT1.transactionDate between DT.transactionDate and dateadd(day, -45, DT.transactionDate)

Trying to translate that into R using data.dable syntax, I was hoping to do something like this:

DT1 <- DT #alias. have confirmed this is just a pointer
DT[DT1[DT1$transactionDate >= DT$transactionDate - 45],
   .(sum(DT1$purchaseQty)), 
   by = .(DT$customerId , DT$transactionDate ), 
   on = .(customerId , DT1$transactionDate <= DT$TransactionDate), 
   allow.cartesian = TRUE]

I guess I have a 2 part question. What is the "R way" to do this. Is a data.table self join the correct approach, or woudl i be better of trying to use the Reduce function?

I suspect the self join is the only way to get the rolling 45 day window in there. so part 2 is I need some help with the data.table syntax to explicitly reference which source table the column comes from, since its a self join and they have the same column names.

Ive been studying the answers that Frank linked to and have come up with this expression

DT[.(p = productId, c = customerID, t = transactionDate, start = transactionDate - 45),
        on = .(productId==p, customerID==c, transactionDate<=t, transactionDate>=start),
        allow.cartesian = TRUE, nomatch = 0]

which produces this output:

   productId customerID transactionDate purchaseQty transactionDate.1
1:    870826    1186951      2016-03-28      162000        2016-02-12
2:    870826    1244216      2016-03-31        5000        2016-02-15
3:    870826    1244216      2016-04-08        5000        2016-02-23
4:    870826    1244216      2016-04-08        6500        2016-02-23
5:    870826    1308671      2016-03-28      221367        2016-02-12
6:    870826    1308671      2016-03-29      221367        2016-02-13
7:    870826    1308671      2016-03-29       83633        2016-02-13
8:    870826    1308671      2016-11-29       60500        2016-10-15

This is very close, to what i need to get to my final step. if i could sum the purchase quantities of this output, group by customer/product/transactionDate.1, i would have something useful. however, I cant get the syntax down for that, not do I understand where the transactionDate.1 name is coming from

Melicent answered 6/12, 2016 at 23:56 Comment(4)
Do either of these offer a solution: #23598235 ...or... #27512104Drily
One option might be the sqldf package that would allow you to use your sql knowledge. It allows sql commands on a data frame (data table counts)Aldus
The IRanges looks very interesting. i will have to play with that and see if it fits my needs. using sqldf is not an option for me,Melicent
Fyi, first step would be to use R's Date or IDate class instead of a string. Also, it's generally recommended to make a more easily reproduced example (like that could be copy-pasted into R) and to show explicitly what the desired output is. I think maybe a self-join is possible with data.table using a "non-equi" join, something like this: stackoverflow.com/a/38033399Betweenwhiles
M
1

This also works, it could be considered simpler. It has the advantage of not requiring a sorted input set, and has fewer dependencies.

I still don't know understand why it produces 2 transactionDate columns in the output. This seems to be a byproduct of the "on" clause. In fact, columns and order of the output seems to append the sum after all elements of the on clause, without their alias names

DT[.(p=productId, c=customerID, tmin=transactionDate - 45, tmax=transactionDate),
    on = .(productId==p, customerID==c, transactionDate<=tmax, transactionDate>=tmin),
    .(windowSum = sum(purchaseQty)), by = .EACHI, nomatch = 0]
Melicent answered 9/12, 2016 at 2:37 Comment(1)
Adding date range columns to the original data table allows you to simply do df[df,.....]Mirianmirielle
P
3

First, we find how many transaction dates occur in 45 day window prior to the current date (including current date)

setDT(df)
df[, n:= 1:.N - findInterval(transactionDate - 45, transactionDate), by=.(customerID)]
df
#   productId customerID transactionDate purchaseQty n
#1:    870826    1186951      2016-03-28      162000 1
#2:    870826    1244216      2016-03-31        5000 1
#3:    870826    1244216      2016-04-08        6500 2
#4:    870826    1308671      2016-03-28      221367 1
#5:    870826    1308671      2016-03-29       83633 2
#6:    870826    1308671      2016-11-29       60500 1

Next we find a rolling sum of purchaseQty with window size n. Adopting a great answer here

g <- function(x, window){
  b_pos <- seq_along(x) - window + 1  # begin positions
  cum <- cumsum(x)
  cum - cum[b_pos] + x[b_pos]
}
df[, sumWindowPurchases := g(purchaseQty, n),][,n:=NULL,]
df
#   productId customerID transactionDate purchaseQty sumWindowPurchases
#1:    870826    1186951      2016-03-28      162000             162000
#2:    870826    1244216      2016-03-31        5000               5000
#3:    870826    1244216      2016-04-08        6500              11500
#4:    870826    1308671      2016-03-28      221367             221367
#5:    870826    1308671      2016-03-29       83633             305000
#6:    870826    1308671      2016-11-29       60500              60500

Data

structure(list(productId = c(870826L, 870826L, 870826L, 870826L, 
870826L, 870826L), customerID = c(1186951L, 1244216L, 1244216L, 
1308671L, 1308671L, 1308671L), transactionDate = structure(c(16888, 
16891, 16899, 16888, 16889, 17134), class = "Date"), purchaseQty = c(162000L, 
5000L, 6500L, 221367L, 83633L, 60500L)), .Names = c("productId", 
"customerID", "transactionDate", "purchaseQty"), row.names = c("1:", 
"2:", "3:", "4:", "5:", "6:"), class = "data.frame")
Participle answered 8/12, 2016 at 5:5 Comment(3)
some very nice lateral thinking. initial timing on sample data shows the Interval based solution to be almost twice as fast as the join. thank you. I will try to come back with full time after a production runMelicent
Can be simplified lots with self join and by=.EACHI.Mirianmirielle
@Arun, i think i finally got the self-join with .EACHI and the aggregate working. (see other answer) this is so much like SQL, but so different at the same time. my brain hurts. Very powerful. thank you for building thisMelicent
M
1

This also works, it could be considered simpler. It has the advantage of not requiring a sorted input set, and has fewer dependencies.

I still don't know understand why it produces 2 transactionDate columns in the output. This seems to be a byproduct of the "on" clause. In fact, columns and order of the output seems to append the sum after all elements of the on clause, without their alias names

DT[.(p=productId, c=customerID, tmin=transactionDate - 45, tmax=transactionDate),
    on = .(productId==p, customerID==c, transactionDate<=tmax, transactionDate>=tmin),
    .(windowSum = sum(purchaseQty)), by = .EACHI, nomatch = 0]
Melicent answered 9/12, 2016 at 2:37 Comment(1)
Adding date range columns to the original data table allows you to simply do df[df,.....]Mirianmirielle

© 2022 - 2024 — McMap. All rights reserved.