Compute rolling sum by id variables, with missing timepoints
Asked Answered
G

4

16

I'm trying to learn R and there are a few things I've done for 10+ years in SAS that I cannot quite figure out the best way to do in R. Take this data:

 id  class           t  count  desired
 --  -----  ----------  -----  -------
  1      A  2010-01-15      1        1
  1      A  2010-02-15      2        3
  1      B  2010-04-15      3        3
  1      B  2010-09-15      4        4
  2      A  2010-01-15      5        5
  2      B  2010-06-15      6        6
  2      B  2010-08-15      7       13
  2      B  2010-09-15      8       21

I want to calculate the column desired as a rolling sum by id, class, and within a 4 months rolling window. Notice that not all months are present for each combination of id and class.

In SAS I'd typically do this in one of 2 ways:

  1. RETAIN plus a by id & class.
  2. PROC SQL with a left join from df as df1 to df as df2 on id, class and the df1.d-df2.d within the appropriate window

What is the best R approach to this type of problem?

t <- as.Date(c("2010-01-15","2010-02-15","2010-04-15","2010-09-15",
               "2010-01-15","2010-06-15","2010-08-15","2010-09-15"))
class <- c("A","A","B","B","A","B","B","B")
id <- c(1,1,1,1,2,2,2,2)
count <- seq(1,8,length.out=8)
desired <- c(1,3,3,4,5,6,13,21)
df <- data.frame(id,class,t,count,desired)
Godsey answered 30/5, 2013 at 15:26 Comment(11)
what's d? its definition isn't included in your setup code.Dukes
Take a look at the zoo package. It can do these rolling summaries on time aligned data fairly easily. If you're comfortable with sql, you can use the sqldf package.Poised
@MatthewPlourde, I think @Godsey mean df <- data.frame(t,class, id, count ,desired)Terresaterrestrial
One of the things with learning R from SAS (or the other way around) is to remember that they work sufficiently differently that things you need PROCs for in SAS (like PROC MEANS for a sum or mean across rows) are trivial to do in 'data step' R (ie, direct matrix programming); but the other way around as well. Things that are easy to do with the automatic looping through SAS rows are not easy to do with 'data step' R, but have packages (ie, procs) to handle.Bedard
if you're comfortable with proc sql then check videos #027 and #028 for an intro to sqldf and merge with R :)Piled
@Justin, this question finally made me interested enough to look into zoo, but it wasn't obvious to me how to use it here to get a rolling sum in the desired fashion, specifically how to handle either the missing values or the first few data points in the desired way. Any suggestions?Bierman
@Aaron off hand, I don't know exactly. Generally, I find myself using zoo and data.table together for the split-apply-combine strategy when I have sparse dates...Poised
Yes, sorry, I started with t and changed it to d...Godsey
I'm already using the sqldf package. For one thing, I still haven't quite figured out how to use R functions within a sqldf query, the same way I could use one of many SAS functions within PROC SQL. In this case, my preferred solution would involve using an R version of SAS' INTCK function to do dates arithmetic in a more sophisticated way than described in my original example. But since I'm learning R from zero, I'd have a preference for learning how to do things the R way (see Joe's comment)Godsey
Hi @ADJ, I'm glad my solution worked for you, but I'd really like to see if anyone has any better ideas. I'm going to change the title and add a bounty; if you have objection to my new title, feel free to change it back.Bierman
Honestly, for stuff like this, I'd just keep using SAS. :)Caulk
B
5

I'm almost embarrassed to post this. I'm usually pretty good as these, but there's got to be a better way.

This first uses zoo's as.yearmon to get the dates in terms of just month and year, then reshapes it to get one column for each id/class combination, then fills in with zeros before, after, and for missing months, then uses zoo to get the rolling sum, then pulls out just the desired months and merges back with the original data frame.

library(reshape2)
library(zoo)
df$yearmon <- as.yearmon(df$t)
dfa <- dcast(id + class ~ yearmon, data=df, value.var="count")
ida <- dfa[,1:2]
dfa <- t(as.matrix(dfa[,-c(1:2)]))
months <- with(df, seq(min(yearmon)-3/12, max(yearmon)+3/12, by=1/12))
dfb <- array(dim=c(length(months), ncol(dfa)), 
             dimnames=list(paste(months), colnames(dfa)))
dfb[rownames(dfa),] <- dfa
dfb[is.na(dfb)] <- 0
dfb <- rollsumr(dfb,4, fill=0)
rownames(dfb) <- paste(months)
dfb <- dfb[rownames(dfa),]
dfc <- cbind(ida, t(dfb))
dfc <- melt(dfc, id.vars=c("class", "id"))
names(dfc)[3:4] <- c("yearmon", "desired2")
dfc$yearmon <- as.yearmon(dfc$yearmon)
out <- merge(df,dfc)

> out
  id class  yearmon          t count desired desired2
1  1     A Feb 2010 2010-02-15     2       3        3
2  1     A Jan 2010 2010-01-15     1       1        1
3  1     B Apr 2010 2010-04-15     3       3        3
4  1     B Sep 2010 2010-09-15     4       4        4
5  2     A Jan 2010 2010-01-15     5       5        5
6  2     B Aug 2010 2010-08-15     7      13       13
7  2     B Jun 2010 2010-06-15     6       6        6
8  2     B Sep 2010 2010-09-15     8      21       21
Bierman answered 30/5, 2013 at 22:1 Comment(0)
M
18

Here are a few solutions:

1) zoo Using ave, for each group create a monthly series, m, by merging the original series, z, with a grid, g. Then calculate the rolling sum and retain only the original time points:

library(zoo)
f <- function(i) { 
    z <- with(df[i, ], zoo(count, t))
    g <- zoo(, seq(start(z), end(z), by = "month"))
    m <- merge(z, g)
    window(rollapplyr(m, 4, sum, na.rm = TRUE, partial = TRUE), time(z))
}
df$desired <- ave(1:nrow(df), df$id, df$class, FUN = f)

which gives:

> df
  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21

Note We have assumed the times are ordered within each group (as in the question). If that is not so then sort df first.

2) sqldf

library(sqldf)
sqldf("select id, class, a.t, a.'count', sum(b.'count') desired 
   from df a join df b 
   using(id, class) 
   where a.t - b.t between 0 and 100
   group by id, class, a.t")

which gives:

  id class          t count desired
1  1     A 2010-01-15     1       1
2  1     A 2010-02-15     2       3
3  1     B 2010-04-15     3       3
4  1     B 2010-09-15     4       4
5  2     A 2010-01-15     5       5
6  2     B 2010-06-15     6       6
7  2     B 2010-08-15     7      13
8  2     B 2010-09-15     8      21

Note: If the merge should be too large to fit into memory then use sqldf("...", dbname = tempfile()) to cause the intermediate results to be stored in a database which it creates on the fly and automatically destroys afterwards.

3) Base R The sqldf solution motivates this base R solution which just translates the SQL into R:

m <- merge(df, df, by = 1:2)
s <- subset(m, t.x - t.y >= 0 & t.x - t.y <= 100)
ag <- aggregate(count.y ~ t.x + class + id, s, sum)
names(ag) <- c("t", "class", "id", "count", "desired")

The result is:

> ag
           t class id count desired
1 2010-01-15     A  1     1       1
2 2010-02-15     A  1     2       3
3 2010-04-15     B  1     3       3
4 2010-09-15     B  1     4       4
5 2010-01-15     A  2     5       5
6 2010-06-15     B  2     6       6
7 2010-08-15     B  2     7      13
8 2010-09-15     B  2     8      21

Note: This does do a merge in memory which might be a problem if the data set is very large.

UPDATE: Minor simplifications of first solution and also added second solution.

UPDATE 2: Added third solution.

Mammillate answered 4/6, 2013 at 19:32 Comment(3)
Nice! Good use of ave, which I perhaps don't use as often as I should, plus a couple ways of using zoo that are new to me. Thanks!Bierman
Also thanks for your work on the zoo package -- it's appreciated!Bierman
Bounty awarded, to a well-deserving answer. Thanks!Bierman
B
5

I'm almost embarrassed to post this. I'm usually pretty good as these, but there's got to be a better way.

This first uses zoo's as.yearmon to get the dates in terms of just month and year, then reshapes it to get one column for each id/class combination, then fills in with zeros before, after, and for missing months, then uses zoo to get the rolling sum, then pulls out just the desired months and merges back with the original data frame.

library(reshape2)
library(zoo)
df$yearmon <- as.yearmon(df$t)
dfa <- dcast(id + class ~ yearmon, data=df, value.var="count")
ida <- dfa[,1:2]
dfa <- t(as.matrix(dfa[,-c(1:2)]))
months <- with(df, seq(min(yearmon)-3/12, max(yearmon)+3/12, by=1/12))
dfb <- array(dim=c(length(months), ncol(dfa)), 
             dimnames=list(paste(months), colnames(dfa)))
dfb[rownames(dfa),] <- dfa
dfb[is.na(dfb)] <- 0
dfb <- rollsumr(dfb,4, fill=0)
rownames(dfb) <- paste(months)
dfb <- dfb[rownames(dfa),]
dfc <- cbind(ida, t(dfb))
dfc <- melt(dfc, id.vars=c("class", "id"))
names(dfc)[3:4] <- c("yearmon", "desired2")
dfc$yearmon <- as.yearmon(dfc$yearmon)
out <- merge(df,dfc)

> out
  id class  yearmon          t count desired desired2
1  1     A Feb 2010 2010-02-15     2       3        3
2  1     A Jan 2010 2010-01-15     1       1        1
3  1     B Apr 2010 2010-04-15     3       3        3
4  1     B Sep 2010 2010-09-15     4       4        4
5  2     A Jan 2010 2010-01-15     5       5        5
6  2     B Aug 2010 2010-08-15     7      13       13
7  2     B Jun 2010 2010-06-15     6       6        6
8  2     B Sep 2010 2010-09-15     8      21       21
Bierman answered 30/5, 2013 at 22:1 Comment(0)
D
4

A farily efficient answer to this problem could be found using the data.table library.

##Utilize the data.table package
library("data.table")
data <- data.table(t,class,id,count,desired)[order(id,class)]

##Assign each customer an ID
data[,Cust_No:=.GRP,by=c("id","class")]

##Create "list" of comparison dates and values
Ref <- data[,list(Compare_Value=list(I(count)),Compare_Date=list(I(t))), by=c("id","class")]

##Compare two lists and see of the compare date is within N days
data$Roll.Val <- mapply(FUN = function(RD, NUM) {
  d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
  sum((d <= 0 & d >= -124)*Ref$Compare_Value[[NUM]])
}, RD = data$t,NUM=data$Cust_No)

##Print out data
data <- data[,list(id,class,t,count,desired,Roll.Val)][order(id,class)]
data

id class          t count desired Roll.Val
1:  1     A 2010-01-15     1       1        1
2:  1     A 2010-02-15     2       3        3
3:  1     B 2010-04-15     3       3        3
4:  1     B 2010-09-15     4       4        4
5:  2     A 2010-01-15     5       5        5
6:  2     B 2010-06-15     6       6        6
7:  2     B 2010-08-15     7      13       13
8:  2     B 2010-09-15     8      21       21
Danny answered 17/4, 2014 at 14:21 Comment(1)
This is a 124 days rolling time period. Obviosuly this isn't exactly 4 months, but the code can be easily modified.Danny
D
0

With runner package one can calculate everything on rolling windows. Below example of using sum_run

library(runner)
df %>%
  group_by(id) %>%
  mutate(
    output = sum_run(count, k = 30*4, idx = t)   
  )

# <dbl> <fct> <date>     <dbl>   <dbl>  <dbl>
#     1 A     2010-01-15     1       1      1
#     1 A     2010-02-15     2       3      3
#     1 B     2010-04-15     3       3      6
#     1 B     2010-09-15     4       4      4
#     2 A     2010-01-15     5       5      5
#     2 B     2010-06-15     6       6      6
#     2 B     2010-08-15     7      13     13
#     2 B     2010-09-15     8      21     21
Desolation answered 16/1, 2020 at 19:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.