R data.table grouping for lagged regression

Asked 9/7, 2012 at 15:3 Answered 7/1, 2015 at 19:26

table with data (its a data.table object) that looks like the following :

      date         stock_id logret
   1: 2011-01-01        1  0.001
   2: 2011-01-02        1  0.003
   3: 2011-01-03        1  0.005
   4: 2011-01-04        1  0.007
   5: 2011-01-05        1  0.009
   6: 2011-01-06        1  0.011
   7: 2011-01-01        2  0.013
   8: 2011-01-02        2  0.015
   9: 2011-01-03        2  0.017
  10: 2011-01-04        2  0.019
  11: 2011-01-05        2  0.021
  12: 2011-01-06        2  0.023
  13: 2011-01-01        3  0.025
  14: 2011-01-02        3  0.027
  15: 2011-01-03        3  0.029
  16: 2011-01-04        3  0.031
  17: 2011-01-05        3  0.033
  18: 2011-01-06        3  0.035

The above can be created as :

DT = data.table(
   date=rep(as.Date('2011-01-01')+0:5,3) , 
   stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
  logret=seq(0.001, by=0.002, len=18));

setkeyv(DT,c('stock_id','date'))

Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).

The final results would look like :

      date         stock_id logret lagret
   1: 2011-01-01        1  0.001    NA
   2: 2011-01-02        1  0.003    0.001
   3: 2011-01-03        1  0.005    0.003
   ....
  16: 2011-01-04        3  0.031  0.029
  17: 2011-01-05        3  0.033  0.031
  18: 2011-01-06        3  0.035  0.033

I'm finding this data structure really tricky to build without mixing up my stockid.

Backdate answered 9/7, 2012 at 15:3 Comment(11)

lag between 2011-04-01 and 2011-04-04 is not 1 day. – Atalayah 9/7, 2012 at 15:11

You are absolutely correct and to clarify - its because of the weekends when the stock markets are closed. So its effectively lag of prior date. – Backdate 9/7, 2012 at 15:14

setkey(stockid,date). Then add the lagged column using := and roll on date-1. Then do the regression by stock. – Gaspard 9/7, 2012 at 15:22

Hi Matthew which package is roll in ? I'm fairly new to R, converting from Matlab. – Backdate 9/7, 2012 at 15:54

@Backdate Search ?data.table for roll. – Gaspard 9/7, 2012 at 17:18

@Matthew I've been scratching my head around this for a while - there are no detailed examples highlighting exactly how the "roll" feature works. It sounds great in the overview (perhaps exactly what I need for this situation). But after having gone through the vignette, FAQ, help page and the intro document, I still can't figure out the exact syntax required. – Backdate 10/7, 2012 at 17:16

The closest I've come is to figure out that this is some type of self join : DT[DT,lagret:=log_return,roll=TRUE] fills the lagret variable with dates close to epoch. I've eliminated the obvious stuff (i.e. Date are actually dates not strings, the keys are set correctly) – Backdate 10/7, 2012 at 17:26

That's pretty close. Try DT[list(id,date-1),lagret:=log_return,roll=TRUE]. – Gaspard 10/7, 2012 at 20:29

@Matthew I tried this but am not able to see the lagret field that should have been created. More details on the original post. – Backdate 12/7, 2012 at 10:13

Hm. Try DT[,lagret:=DT[list(id,date-1),logret,roll=TRUE][[3L]]]. Which version of data.table are you using? A full reproducible example would be nice - something pastable into the R session. – Gaspard 12/7, 2012 at 10:41

@MatthewDowle I'm using 1.8.1. This latest version seems to work on my simple case. I'm still trying to adapt it to my larger data which doesnt have continuous dates. – Backdate 12/7, 2012 at 13:39

Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.

What do we actually want? We want a new column lagret and the syntax to add a new column in data.table is the following:

DT[, lagret := xxx]

where xxx has to be filled up with whatever you want to have in column lagret. So if we just want a new column that gives us the rows, we could just call

DT[, lagret := seq(from=1, to=nrow(DT))]

Here, we actually want the lagged value of logret, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT with itself by the columns stock_id and date, but since we want the previous value of each stock, we use date-1. Note that we have to set the keys first to do such a join:

setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
    stock_id       date logret
 1:        1 2010-12-31     NA
 2:        1 2011-01-01  0.001
 3:        1 2011-01-02  0.003
 4:        1 2011-01-03  0.005
 5:        1 2011-01-04  0.007
 6:        1 2011-01-05  0.009
...

As you can see, we now have what we want. logret is now lagged by one period. But we actually want that in a new column lagret in DT, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret:

DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-04        1  0.007  0.005
 5: 2011-01-05        1  0.009  0.007
...

This is already the correct solution. In this simple case, we do not need roll=TRUE because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT for the first stock:

DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011     NA
 5: 2011-01-01        2  0.013     NA
...

As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE:

DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011  0.005
 5: 2011-01-01        2  0.013     NA
...

Just have a look on the documentation on how roll=TRUE works exactly. In a nutshell: If it can't find the previous value (here logret for the 5th of January), it just takes the last available one (here from the 3rd of January).

Dualism answered 11/9, 2012 at 9:52 Comment(4)

+10. Couldn't have explained it better myself. The [[3L]] ugliness (having to hard code the 3L to ignore the grouping columns in the result) should be faster and more convenient when FR#1757 Add drop to [.data.table is implemented. – Gaspard 11/9, 2012 at 11:5

It seems more intuitive to use [,logret] rather than [[3L]]. Is there a reason you shouldn't? In addition to being more intuitive, using [,logret] allows you to change column orders in the future without having to change a column number reference. (Of course if the column names change, then you'd have to update the column name reference, but at least it should be more obvious...) – Nutcracker 4/2, 2013 at 16:30

Good point, I just used [[3L]] because this was the original proposal by Matthew and in my answer, I wanted to explain on how it works. However, I'm not sure if your option always used to work. If it does now, it is definitely the cleaner syntax, I agree. – Dualism 5/2, 2013 at 8:46

something must have changed in data.table, because it seems that this answer doesn't work. there seems to be a mistake --it should be either DT[,lagret:=DT[list(stock_id,date-1),logret]] or DT[,lagret:=DT[list(stock_id,date-1)][[3L]]], but not DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]] – Boddie 25/3, 2015 at 5:8

Update:

In the current development version of data.table, v1.9.5, shift() is implemented #965, which takes two types at the moment type = "lag" (default) and type = "lead". See ?shift for more on usage.

With this, we can simply do:

# type="lag" may be omitted, as it is the default.
require(data.table) ## 1.9.5+
DT[, lagret := shift(logret, 1L, type="lag"), by=stock_id]
#           date stock_id logret lagret
#  1: 2011-01-01        1  0.001     NA
#  2: 2011-01-02        1  0.003  0.001
#  3: 2011-01-03        1  0.005  0.003
#  4: 2011-01-04        1  0.007  0.005
#  5: 2011-01-05        1  0.009  0.007
#  6: 2011-01-06        1  0.011  0.009
#  7: 2011-01-01        2  0.013     NA
#  8: 2011-01-02        2  0.015  0.013
#  9: 2011-01-03        2  0.017  0.015
# 10: 2011-01-04        2  0.019  0.017
# 11: 2011-01-05        2  0.021  0.019
# 12: 2011-01-06        2  0.023  0.021
# 13: 2011-01-01        3  0.025     NA
# 14: 2011-01-02        3  0.027  0.025
# 15: 2011-01-03        3  0.029  0.027
# 16: 2011-01-04        3  0.031  0.029
# 17: 2011-01-05        3  0.033  0.031
# 18: 2011-01-06        3  0.035  0.033

Maiamaiah answered 7/1, 2015 at 19:26 Comment(2)

This is fine if your dates are regularly spaced (i.e. if they don't have 'gaps'). The accepted answer allows for such gaps. – Selfaddressed 13/12, 2018 at 1:47

But this won't handle irregular time series, and would give wrong lags, correct? – Transpadane 3/5, 2020 at 22:24

Thanks to Matthew Dowle's advice, I was able to use the following :

DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]

Results are :

             date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-04        1  0.007  0.005
 5: 2011-01-05        1  0.009  0.007
 6: 2011-01-06        1  0.011  0.009
 7: 2011-01-01        2  0.013     NA
 8: 2011-01-02        2  0.015  0.013
 9: 2011-01-03        2  0.017  0.015
10: 2011-01-04        2  0.019  0.017
11: 2011-01-05        2  0.021  0.019
12: 2011-01-06        2  0.023  0.021
13: 2011-01-01        3  0.025     NA
14: 2011-01-02        3  0.027  0.025
15: 2011-01-03        3  0.029  0.027
16: 2011-01-04        3  0.031  0.029
17: 2011-01-05        3  0.033  0.031
18: 2011-01-06        3  0.035  0.033

Backdate answered 12/7, 2012 at 13:38 Comment(3)

Good, glad something works. I've raised FR#2142 to add tests and examples for the first cleaner syntax above in commments. – Gaspard 12/7, 2012 at 14:6

Thank you again @MatthewDowle data.table a brilliant piece of software and hopefully as I wade through it for my research I'd add more scenarios that are not as well documented. Kudos on the effort to write it and to guide us newbies. – Backdate 12/7, 2012 at 14:11

could someone please explain how this answer works? i'm having difficulty understanding what is going on. – Smelter 23/8, 2012 at 5:56

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update:

Recommended topics

Hot tags