R data.table grouping for lagged regression
Asked Answered
B

3

13

table with data (its a data.table object) that looks like the following :

      date         stock_id logret
   1: 2011-01-01        1  0.001
   2: 2011-01-02        1  0.003
   3: 2011-01-03        1  0.005
   4: 2011-01-04        1  0.007
   5: 2011-01-05        1  0.009
   6: 2011-01-06        1  0.011
   7: 2011-01-01        2  0.013
   8: 2011-01-02        2  0.015
   9: 2011-01-03        2  0.017
  10: 2011-01-04        2  0.019
  11: 2011-01-05        2  0.021
  12: 2011-01-06        2  0.023
  13: 2011-01-01        3  0.025
  14: 2011-01-02        3  0.027
  15: 2011-01-03        3  0.029
  16: 2011-01-04        3  0.031
  17: 2011-01-05        3  0.033
  18: 2011-01-06        3  0.035

The above can be created as :

DT = data.table(
   date=rep(as.Date('2011-01-01')+0:5,3) , 
   stock_id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3),
  logret=seq(0.001, by=0.002, len=18));

setkeyv(DT,c('stock_id','date'))

Of course the real table is larger with many more stock_ids and dates. The aim to to reshape this data table such that I can run a regression of all stockid log_returns with their corresponding log_returns with a lag of 1 day (or prior traded day in case of weekends).

The final results would look like :

      date         stock_id logret lagret
   1: 2011-01-01        1  0.001    NA
   2: 2011-01-02        1  0.003    0.001
   3: 2011-01-03        1  0.005    0.003
   ....
  16: 2011-01-04        3  0.031  0.029
  17: 2011-01-05        3  0.033  0.031
  18: 2011-01-06        3  0.035  0.033

I'm finding this data structure really tricky to build without mixing up my stockid.

Backdate answered 9/7, 2012 at 15:3 Comment(11)
lag between 2011-04-01 and 2011-04-04 is not 1 day.Atalayah
You are absolutely correct and to clarify - its because of the weekends when the stock markets are closed. So its effectively lag of prior date.Backdate
setkey(stockid,date). Then add the lagged column using := and roll on date-1. Then do the regression by stock.Gaspard
Hi Matthew which package is roll in ? I'm fairly new to R, converting from Matlab.Backdate
@Backdate Search ?data.table for roll.Gaspard
@Matthew I've been scratching my head around this for a while - there are no detailed examples highlighting exactly how the "roll" feature works. It sounds great in the overview (perhaps exactly what I need for this situation). But after having gone through the vignette, FAQ, help page and the intro document, I still can't figure out the exact syntax required.Backdate
The closest I've come is to figure out that this is some type of self join : DT[DT,lagret:=log_return,roll=TRUE] fills the lagret variable with dates close to epoch. I've eliminated the obvious stuff (i.e. Date are actually dates not strings, the keys are set correctly)Backdate
That's pretty close. Try DT[list(id,date-1),lagret:=log_return,roll=TRUE].Gaspard
@Matthew I tried this but am not able to see the lagret field that should have been created. More details on the original post.Backdate
Hm. Try DT[,lagret:=DT[list(id,date-1),logret,roll=TRUE][[3L]]]. Which version of data.table are you using? A full reproducible example would be nice - something pastable into the R session.Gaspard
@MatthewDowle I'm using 1.8.1. This latest version seems to work on my simple case. I'm still trying to adapt it to my larger data which doesnt have continuous dates.Backdate
D
21

Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.

What do we actually want? We want a new column lagret and the syntax to add a new column in data.table is the following:

DT[, lagret := xxx]

where xxx has to be filled up with whatever you want to have in column lagret. So if we just want a new column that gives us the rows, we could just call

DT[, lagret := seq(from=1, to=nrow(DT))]

Here, we actually want the lagged value of logret, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT with itself by the columns stock_id and date, but since we want the previous value of each stock, we use date-1. Note that we have to set the keys first to do such a join:

setkeyv(DT,c('stock_id','date'))
DT[list(stock_id,date-1)]
    stock_id       date logret
 1:        1 2010-12-31     NA
 2:        1 2011-01-01  0.001
 3:        1 2011-01-02  0.003
 4:        1 2011-01-03  0.005
 5:        1 2011-01-04  0.007
 6:        1 2011-01-05  0.009
...

As you can see, we now have what we want. logret is now lagged by one period. But we actually want that in a new column lagret in DT, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret:

DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-04        1  0.007  0.005
 5: 2011-01-05        1  0.009  0.007
...

This is already the correct solution. In this simple case, we do not need roll=TRUE because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT for the first stock:

DT <- DT[-c(4, 5)]
setkeyv(DT,c('stock_id','date'))
DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011     NA
 5: 2011-01-01        2  0.013     NA
...

As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE:

DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
          date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-06        1  0.011  0.005
 5: 2011-01-01        2  0.013     NA
...

Just have a look on the documentation on how roll=TRUE works exactly. In a nutshell: If it can't find the previous value (here logret for the 5th of January), it just takes the last available one (here from the 3rd of January).

Dualism answered 11/9, 2012 at 9:52 Comment(4)
+10. Couldn't have explained it better myself. The [[3L]] ugliness (having to hard code the 3L to ignore the grouping columns in the result) should be faster and more convenient when FR#1757 Add drop to [.data.table is implemented.Gaspard
It seems more intuitive to use [,logret] rather than [[3L]]. Is there a reason you shouldn't? In addition to being more intuitive, using [,logret] allows you to change column orders in the future without having to change a column number reference. (Of course if the column names change, then you'd have to update the column name reference, but at least it should be more obvious...)Nutcracker
Good point, I just used [[3L]] because this was the original proposal by Matthew and in my answer, I wanted to explain on how it works. However, I'm not sure if your option always used to work. If it does now, it is definitely the cleaner syntax, I agree.Dualism
something must have changed in data.table, because it seems that this answer doesn't work. there seems to be a mistake --it should be either DT[,lagret:=DT[list(stock_id,date-1),logret]] or DT[,lagret:=DT[list(stock_id,date-1)][[3L]]], but not DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]Boddie
M
4

Update:

In the current development version of data.table, v1.9.5, shift() is implemented #965, which takes two types at the moment type = "lag" (default) and type = "lead". See ?shift for more on usage.

With this, we can simply do:

# type="lag" may be omitted, as it is the default.
require(data.table) ## 1.9.5+
DT[, lagret := shift(logret, 1L, type="lag"), by=stock_id]
#           date stock_id logret lagret
#  1: 2011-01-01        1  0.001     NA
#  2: 2011-01-02        1  0.003  0.001
#  3: 2011-01-03        1  0.005  0.003
#  4: 2011-01-04        1  0.007  0.005
#  5: 2011-01-05        1  0.009  0.007
#  6: 2011-01-06        1  0.011  0.009
#  7: 2011-01-01        2  0.013     NA
#  8: 2011-01-02        2  0.015  0.013
#  9: 2011-01-03        2  0.017  0.015
# 10: 2011-01-04        2  0.019  0.017
# 11: 2011-01-05        2  0.021  0.019
# 12: 2011-01-06        2  0.023  0.021
# 13: 2011-01-01        3  0.025     NA
# 14: 2011-01-02        3  0.027  0.025
# 15: 2011-01-03        3  0.029  0.027
# 16: 2011-01-04        3  0.031  0.029
# 17: 2011-01-05        3  0.033  0.031
# 18: 2011-01-06        3  0.035  0.033
Maiamaiah answered 7/1, 2015 at 19:26 Comment(2)
This is fine if your dates are regularly spaced (i.e. if they don't have 'gaps'). The accepted answer allows for such gaps.Selfaddressed
But this won't handle irregular time series, and would give wrong lags, correct?Transpadane
B
2

Thanks to Matthew Dowle's advice, I was able to use the following :

DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]

Results are :

             date stock_id logret lagret
 1: 2011-01-01        1  0.001     NA
 2: 2011-01-02        1  0.003  0.001
 3: 2011-01-03        1  0.005  0.003
 4: 2011-01-04        1  0.007  0.005
 5: 2011-01-05        1  0.009  0.007
 6: 2011-01-06        1  0.011  0.009
 7: 2011-01-01        2  0.013     NA
 8: 2011-01-02        2  0.015  0.013
 9: 2011-01-03        2  0.017  0.015
10: 2011-01-04        2  0.019  0.017
11: 2011-01-05        2  0.021  0.019
12: 2011-01-06        2  0.023  0.021
13: 2011-01-01        3  0.025     NA
14: 2011-01-02        3  0.027  0.025
15: 2011-01-03        3  0.029  0.027
16: 2011-01-04        3  0.031  0.029
17: 2011-01-05        3  0.033  0.031
18: 2011-01-06        3  0.035  0.033
Backdate answered 12/7, 2012 at 13:38 Comment(3)
Good, glad something works. I've raised FR#2142 to add tests and examples for the first cleaner syntax above in commments.Gaspard
Thank you again @MatthewDowle data.table a brilliant piece of software and hopefully as I wade through it for my research I'd add more scenarios that are not as well documented. Kudos on the effort to write it and to guide us newbies.Backdate
could someone please explain how this answer works? i'm having difficulty understanding what is going on.Smelter

© 2022 - 2024 — McMap. All rights reserved.