Create lagged variable in unbalanced panel data in R
Asked Answered
I

4

14

I'd like to create a variable containing the value of a variable in the previous year within a group.

     id   date        value
1     1   1992          4.1  
2     1     NA          4.5  
3     1   1991          3.3  
4     1   1990          5.3  
5     1   1994          3.0  
6     2   1992          3.2  
7     2   1991          5.2  

value_lagged should be missing when the previous year is missing within a group - either because it is the first date within a group (as in row 4, 7), or because there are year gaps in the data (as in row 5). Also, value_lagged should be missing when the current time is missing (as in row 2).

This gives:

     id   date    value    value_lagged  
1     1   1992      4.1             3.3
2     1     NA      4.5              NA
3     1   1991      3.3             5.3
4     1   1990      5.3              NA
5     1   1994      3.0              NA
6     2   1992      3.2             5.2
7     2   1991      5.2              NA

For now, in R, I use the data.table package

 DT = data.table(id    = c(1,1,1,1,1,2,2),
                 date  = c(1992,NA,1991,1990,1994,1992,1991),
                 value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2)
                )
 setkey(DT, id, date)
 DT[, value_lagged := DT[J(id, date-1), value], ]
 DT[is.na(date), value_lagged := NA, ]

It's fast but it seems somewhat error prone to me. I'd like to know if there are better alternatives using data.table, dplyr, or any other package. Thanks a lot!


In Stata, one would do:

    tsset id date
    gen value_lagged=L.value
Iowa answered 3/9, 2014 at 16:12 Comment(12)
unless you specifically want rows with a missing value to not have a matching lagged value, you likely meant to use is.na(date) not is.na(value)Silberman
@Iowa seems like you already have a decent solution - what exactly do you want to improve about it?Worsted
My solution seems somewhat convoluted and error prone to me - but maybe because I'm new to R. And even it is a decent solution, there might be even simpler ways to do it! In any case, it's an important question (at least for stata users), so I thought there should be a definitive question/answer on stack overflow about it.Iowa
Here's a slight rewrite of your final two steps, that's perhaps slightly less error prone (because you're not repeating DT multiple times), but I don't really see what else you want besides having a function that's specifically designed to do what you want (which what your stata solution really is): DT[J(id, date + 1, val = value), val_lag := i.val][is.na(date), val_lag := NA]Worsted
Thanks! Wait why does not it give rows with date==1995?Iowa
This is so weird. So D1=DT[J(id, date + 1, val = value)], D1[, val_lag:=val] does not give the same than DT[J(id, date + 1, val = value)][, val_lag:=val] ?Iowa
Well, I understand now -even though it's still a bit puzzling. Would you know if one of the syntax between DT[J(id, date + 1, val = value), val_lag := i.val] and DT[,value_lagged :=DT[list(id,date-1),value],] preferable for what data.table does? Or are they completely equivalent?Iowa
@Iowa they are not equivalent DT[J(...), a := b] does the assignments for the matching portion of DT from the merge, whereas DT[J(...)] may be larger or smaller than DT itself.Worsted
the DT[J(...), a := b] syntax is better in that it avoids using the data.table name twice, which can be a source of errors; it's also going to be a tiny bit faster, but that's probably irrelevant for any taskWorsted
You should post yours as the answer (for data.table), as that's the way to go about it. There's no copy being made there at all, which is also a requirement from your Q.Lashanda
@Worsted actually I've recently realized your answer does not work in case there are no missing dates and no corresponding observation within each group (for instance -8 in my example)Iowa
@Lashanda wrote another method as an answer!Iowa
I
7

Create a function tlag, which lags a vector given a vector of times, and use it within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) { 
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))
Iowa answered 29/9, 2014 at 20:21 Comment(7)
N=2e6L is quite small. 0.1 vs 0.23 seconds isn't that impressive.. Could you please try on 2e7L or even 2e8L?Lashanda
It's not impressive indeed. The point was really to have a more readable solution, and it turns out that there is no performance cost at 2e6. I've just runned the test : lag is still slightly faster at 1e7, but pure data.table is twice faster at 1e8Iowa
The problem with readability is that it is not a proper measure; varies from one to another. I very much like the pure data.table solution, for example :).Lashanda
Readability varies from one person to the next, but that does not mean it's not a proper measure, right? Don't you agree that, for a colleague who browses my code, the lag solution is clearer to read? What about confusing -1 and -1?Iowa
Would you post data.table's solution as an answer? I'd suppress it from my question, so that people would see three different answers for creating lagged variablesIowa
I don't think that's necessary. You've shown the solution as your attempt (which you felt could be improved - hence the question) and have posted a version which suits you better. And it's a great answer (I +1'd it already).Lashanda
This is good and I upvoted it, but there's no reason to go defining a custom lag function with a dozen major packages already containing one and n-time-period lags.Amathiste
A
11

I'd probably tackle this using a join:

library(dplyr)

df <- data.frame(
  id = c(1, 1, 1, 1, 1, 2, 2), 
  date = c(1992, NA, 1991, 1990, 1994, 1992, 1991), 
  value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2)
)


last_year <- df %>% 
  filter(!is.na(date)) %>%
  mutate(date = date + 1, lagged_value = value, value = NULL)

df %>%
  left_join(last_year)
#> Joining by: c("id", "date")
#>   id date value lagged_value
#> 1  1 1992   4.1          3.3
#> 2  1   NA   4.5           NA
#> 3  1 1991   3.3          5.3
#> 4  1 1990   5.3           NA
#> 5  1 1994   3.0           NA
#> 6  2 1992   3.2          5.2
#> 7  2 1991   5.2           NA
Aerophagia answered 3/9, 2014 at 18:53 Comment(0)
L
8

Using 1.9.5, where joins don't need keys to be set, this can be done as follows:

require(data.table) # v1.9.5+
DT[!is.na(date), value_lagged := 
         .SD[.(id = id, date = date - 1), value, on = c("id", "date")]]
#    id date value value_lagged
# 1:  1 1992   4.1          3.3
# 2:  1   NA   4.5           NA
# 3:  1 1991   3.3          5.3
# 4:  1 1990   5.3           NA
# 5:  1 1994   3.0           NA
# 6:  2 1992   3.2          5.2
# 7:  2 1991   5.2           NA

It's a variation of your idea. The trick is to use is.na() directly in i and use .SD in j instead of DT. I've used on= syntax, but the same idea can of course be done by setting keys as well. .

Lashanda answered 30/7, 2015 at 21:13 Comment(2)
Hi! With data.table 1.9.5, installed 11-sep-2015, this gives the error "Error in [.data.table(.SD, .(id = id, date = date - 1), value, on = c("id", : unused argument (on = c("id", "date"))"Paring
@Paring Use remove.packages() to uninstall, reinstall and try again.Lashanda
I
7

Create a function tlag, which lags a vector given a vector of times, and use it within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) { 
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))
Iowa answered 29/9, 2014 at 20:21 Comment(7)
N=2e6L is quite small. 0.1 vs 0.23 seconds isn't that impressive.. Could you please try on 2e7L or even 2e8L?Lashanda
It's not impressive indeed. The point was really to have a more readable solution, and it turns out that there is no performance cost at 2e6. I've just runned the test : lag is still slightly faster at 1e7, but pure data.table is twice faster at 1e8Iowa
The problem with readability is that it is not a proper measure; varies from one to another. I very much like the pure data.table solution, for example :).Lashanda
Readability varies from one person to the next, but that does not mean it's not a proper measure, right? Don't you agree that, for a colleague who browses my code, the lag solution is clearer to read? What about confusing -1 and -1?Iowa
Would you post data.table's solution as an answer? I'd suppress it from my question, so that people would see three different answers for creating lagged variablesIowa
I don't think that's necessary. You've shown the solution as your attempt (which you felt could be improved - hence the question) and have posted a version which suits you better. And it's a great answer (I +1'd it already).Lashanda
This is good and I upvoted it, but there's no reason to go defining a custom lag function with a dozen major packages already containing one and n-time-period lags.Amathiste
R
0

An elegant and fast general solution to computations on irregular time series and unbalanced panels is now offered by the collapse package with the functions flag, fdiff and fgrowth. See a general answer for lagging an unbalanced panel here.

Now in your specific application there is an additional rare fact that the panel is not only unbalanced, but you have a missing value in your time variable which means you don't know the time period when a record was observed. In such cases just applying collapse::flag won't do, but you could generate a new id variable that places the missing value in a separate group with collapse::seqid. So my solution would be:

library(collapse)  
DF = data.frame(id    = c(1,1,1,1,1,2,2),
                date  = c(1992,NA,1991,1990,1994,1992,1991),
                value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2))

settransform(DF, l_value = flag(value, 1, g = seqid(date, order(id, date)), t = date))
DF
#>   id date value l_value
#> 1  1 1992   4.1     3.3
#> 2  1   NA   4.5      NA
#> 3  1 1991   3.3     5.3
#> 4  1 1990   5.3      NA
#> 5  1 1994   3.0      NA
#> 6  2 1992   3.2     5.2
#> 7  2 1991   5.2      NA

Created on 2021-07-10 by the reprex package (v0.3.0)

I am quite confident that this is still faster than data.table, but I have not tested this. Again this data is rather odd, for most cases where the panel is unbalanced but the records are all identified by id and time, a simple flag(value, 1, id, as.integer(date)) will do fine and be striking fast. Note that you get efficiency gains by ensuring that the time variable is integer, as flag will coerce non-integer time variables to factor, which could also get rid of the irregularity.

Ripen answered 10/7, 2021 at 18:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.