Create lagged variable in unbalanced panel data in R

Asked 3/9, 2014 at 16:12 Answered 10/7, 2021 at 18:30

I'd like to create a variable containing the value of a variable in the previous year within a group.

     id   date        value
1     1   1992          4.1  
2     1     NA          4.5  
3     1   1991          3.3  
4     1   1990          5.3  
5     1   1994          3.0  
6     2   1992          3.2  
7     2   1991          5.2

value_lagged should be missing when the previous year is missing within a group - either because it is the first date within a group (as in row 4, 7), or because there are year gaps in the data (as in row 5). Also, value_lagged should be missing when the current time is missing (as in row 2).

This gives:

     id   date    value    value_lagged  
1     1   1992      4.1             3.3
2     1     NA      4.5              NA
3     1   1991      3.3             5.3
4     1   1990      5.3              NA
5     1   1994      3.0              NA
6     2   1992      3.2             5.2
7     2   1991      5.2              NA

For now, in R, I use the data.table package

 DT = data.table(id    = c(1,1,1,1,1,2,2),
                 date  = c(1992,NA,1991,1990,1994,1992,1991),
                 value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2)
                )
 setkey(DT, id, date)
 DT[, value_lagged := DT[J(id, date-1), value], ]
 DT[is.na(date), value_lagged := NA, ]

It's fast but it seems somewhat error prone to me. I'd like to know if there are better alternatives using data.table, dplyr, or any other package. Thanks a lot!

In Stata, one would do:

    tsset id date
    gen value_lagged=L.value

Iowa answered 3/9, 2014 at 16:12 Comment(12)

unless you specifically want rows with a missing value to not have a matching lagged value, you likely meant to use is.na(date) not is.na(value) – Silberman 3/9, 2014 at 16:28

@Iowa seems like you already have a decent solution - what exactly do you want to improve about it? – Worsted 3/9, 2014 at 17:59

My solution seems somewhat convoluted and error prone to me - but maybe because I'm new to R. And even it is a decent solution, there might be even simpler ways to do it! In any case, it's an important question (at least for stata users), so I thought there should be a definitive question/answer on stack overflow about it. – Iowa 3/9, 2014 at 18:6

Here's a slight rewrite of your final two steps, that's perhaps slightly less error prone (because you're not repeating DT multiple times), but I don't really see what else you want besides having a function that's specifically designed to do what you want (which what your stata solution really is): DT[J(id, date + 1, val = value), val_lag := i.val][is.na(date), val_lag := NA] – Worsted 3/9, 2014 at 18:39

Thanks! Wait why does not it give rows with date==1995? – Iowa 4/9, 2014 at 0:33

This is so weird. So D1=DT[J(id, date + 1, val = value)], D1[, val_lag:=val] does not give the same than DT[J(id, date + 1, val = value)][, val_lag:=val] ? – Iowa 4/9, 2014 at 0:56

Well, I understand now -even though it's still a bit puzzling. Would you know if one of the syntax between DT[J(id, date + 1, val = value), val_lag := i.val] and DT[,value_lagged :=DT[list(id,date-1),value],] preferable for what data.table does? Or are they completely equivalent? – Iowa 4/9, 2014 at 1:57

@Iowa they are not equivalent DT[J(...), a := b] does the assignments for the matching portion of DT from the merge, whereas DT[J(...)] may be larger or smaller than DT itself. – Worsted 4/9, 2014 at 15:10

the DT[J(...), a := b] syntax is better in that it avoids using the data.table name twice, which can be a source of errors; it's also going to be a tiny bit faster, but that's probably irrelevant for any task – Worsted 4/9, 2014 at 15:12

You should post yours as the answer (for data.table), as that's the way to go about it. There's no copy being made there at all, which is also a requirement from your Q. – Lashanda 20/9, 2014 at 13:36

@Worsted actually I've recently realized your answer does not work in case there are no missing dates and no corresponding observation within each group (for instance -8 in my example) – Iowa 20/9, 2014 at 15:59

@Lashanda wrote another method as an answer! – Iowa 30/9, 2014 at 4:49

Create a function tlag, which lags a vector given a vector of times, and use it within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) { 
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))

Iowa answered 29/9, 2014 at 20:21 Comment(7)

N=2e6L is quite small. 0.1 vs 0.23 seconds isn't that impressive.. Could you please try on 2e7L or even 2e8L? – Lashanda 1/10, 2014 at 14:25

It's not impressive indeed. The point was really to have a more readable solution, and it turns out that there is no performance cost at 2e6. I've just runned the test : lag is still slightly faster at 1e7, but pure data.table is twice faster at 1e8 – Iowa 1/10, 2014 at 14:58

The problem with readability is that it is not a proper measure; varies from one to another. I very much like the pure data.table solution, for example :). – Lashanda 1/10, 2014 at 15:1

Readability varies from one person to the next, but that does not mean it's not a proper measure, right? Don't you agree that, for a colleague who browses my code, the lag solution is clearer to read? What about confusing -1 and -1? – Iowa 1/10, 2014 at 15:23

Would you post data.table's solution as an answer? I'd suppress it from my question, so that people would see three different answers for creating lagged variables – Iowa 1/10, 2014 at 16:8

I don't think that's necessary. You've shown the solution as your attempt (which you felt could be improved - hence the question) and have posted a version which suits you better. And it's a great answer (I +1'd it already). – Lashanda 1/10, 2014 at 16:21

This is good and I upvoted it, but there's no reason to go defining a custom lag function with a dozen major packages already containing one and n-time-period lags. – Amathiste 2/10, 2014 at 14:25

I'd probably tackle this using a join:

library(dplyr)

df <- data.frame(
  id = c(1, 1, 1, 1, 1, 2, 2), 
  date = c(1992, NA, 1991, 1990, 1994, 1992, 1991), 
  value = c(4.1, 4.5, 3.3, 5.3, 3.0, 3.2, 5.2)
)


last_year <- df %>% 
  filter(!is.na(date)) %>%
  mutate(date = date + 1, lagged_value = value, value = NULL)

df %>%
  left_join(last_year)
#> Joining by: c("id", "date")
#>   id date value lagged_value
#> 1  1 1992   4.1          3.3
#> 2  1   NA   4.5           NA
#> 3  1 1991   3.3          5.3
#> 4  1 1990   5.3           NA
#> 5  1 1994   3.0           NA
#> 6  2 1992   3.2          5.2
#> 7  2 1991   5.2           NA

Aerophagia answered 3/9, 2014 at 18:53 Comment(0)

Using 1.9.5, where joins don't need keys to be set, this can be done as follows:

require(data.table) # v1.9.5+
DT[!is.na(date), value_lagged := 
         .SD[.(id = id, date = date - 1), value, on = c("id", "date")]]
#    id date value value_lagged
# 1:  1 1992   4.1          3.3
# 2:  1   NA   4.5           NA
# 3:  1 1991   3.3          5.3
# 4:  1 1990   5.3           NA
# 5:  1 1994   3.0           NA
# 6:  2 1992   3.2          5.2
# 7:  2 1991   5.2           NA

It's a variation of your idea. The trick is to use is.na() directly in i and use .SD in j instead of DT. I've used on= syntax, but the same idea can of course be done by setting keys as well. .

Lashanda answered 30/7, 2015 at 21:13 Comment(2)

Hi! With data.table 1.9.5, installed 11-sep-2015, this gives the error "Error in [.data.table(.SD, .(id = id, date = date - 1), value, on = c("id", : unused argument (on = c("id", "date"))" – Paring 11/9, 2015 at 9:5

@Paring Use remove.packages() to uninstall, reinstall and try again. – Lashanda 15/9, 2015 at 9:19

Create a function tlag, which lags a vector given a vector of times, and use it within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) { 
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))

Iowa answered 29/9, 2014 at 20:21 Comment(7)

N=2e6L is quite small. 0.1 vs 0.23 seconds isn't that impressive.. Could you please try on 2e7L or even 2e8L? – Lashanda 1/10, 2014 at 14:25

The problem with readability is that it is not a proper measure; varies from one to another. I very much like the pure data.table solution, for example :). – Lashanda 1/10, 2014 at 15:1

Would you post data.table's solution as an answer? I'd suppress it from my question, so that people would see three different answers for creating lagged variables – Iowa 1/10, 2014 at 16:8

This is good and I upvoted it, but there's no reason to go defining a custom lag function with a dozen major packages already containing one and n-time-period lags. – Amathiste 2/10, 2014 at 14:25

An elegant and fast general solution to computations on irregular time series and unbalanced panels is now offered by the collapse package with the functions flag, fdiff and fgrowth. See a general answer for lagging an unbalanced panel here.

Now in your specific application there is an additional rare fact that the panel is not only unbalanced, but you have a missing value in your time variable which means you don't know the time period when a record was observed. In such cases just applying collapse::flag won't do, but you could generate a new id variable that places the missing value in a separate group with collapse::seqid. So my solution would be:

library(collapse)  
DF = data.frame(id    = c(1,1,1,1,1,2,2),
                date  = c(1992,NA,1991,1990,1994,1992,1991),
                value = c(4.1,4.5,3.3,5.3,3.0,3.2,5.2))

settransform(DF, l_value = flag(value, 1, g = seqid(date, order(id, date)), t = date))
DF
#>   id date value l_value
#> 1  1 1992   4.1     3.3
#> 2  1   NA   4.5      NA
#> 3  1 1991   3.3     5.3
#> 4  1 1990   5.3      NA
#> 5  1 1994   3.0      NA
#> 6  2 1992   3.2     5.2
#> 7  2 1991   5.2      NA

^{Created on 2021-07-10 by the reprex package (v0.3.0)}

I am quite confident that this is still faster than data.table, but I have not tested this. Again this data is rather odd, for most cases where the panel is unbalanced but the records are all identified by id and time, a simple flag(value, 1, id, as.integer(date)) will do fine and be striking fast. Note that you get efficiency gains by ensuring that the time variable is integer, as flag will coerce non-integer time variables to factor, which could also get rid of the irregularity.

Ripen answered 10/7, 2021 at 18:30 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags