Lagging Forward in plm
Asked Answered
L

3

8

This is a very simple question, but I haven't been able to find a definitive answer, so I thought I would ask it. I use the plm package for dealing with panel data. I am attempting to use the lag function to lag a variable FORWARD in time (the default is to retrieve the value from the previous period, and I want the value from the NEXT). I found a number of old articles/questions (circa 2009) suggesting that this is possible by using k=-1 as an argument. However, when I attempt this, I get an error.

Sample code:

library(plm)
df<-as.data.frame(matrix(c(1,1,1,2,2,3,20101231,20111231,20121231,20111231,20121231,20121231,50,60,70,120,130,210),nrow=6,ncol=3))
names(df)<-c("individual","date","data")
df$date<-as.Date(as.character(df$date),format="%Y%m%d")
df.plm<-pdata.frame(df,index=c("individual","date"))

Lagging:

lag(df.plm$data,0)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31 
         50           60           70          120          130          210

lag(df.plm$data,1)
##returns
1-2010-12-31 1-2011-12-31 1-2012-12-31 2-2011-12-31 2-2012-12-31 3-2012-12-31 
         NA           50           60           NA          120           NA

lag(df.plm$data,-1)
##returns
Error in rep(1, ak) : invalid 'times' argument

I've also read that plm.data has replaced pdata.frame for some applications in plm. However, plm.data doesn't seem to work with the lag function at all:

df.plm<-plm.data(df,indexes=c("individual","date"))
lag(df.plm$data,1)
##returns
[1]  50  60  70 120 130 210
attr(,"tsp")
[1] 0 5 1

I would appreciate any help. If anyone has another suggestion for a package to use for lagging, I'm all ears. However, I do love plm because it automagically deals with lagging across multiple individuals and skips gaps in the time series.

Lacunar answered 23/10, 2012 at 19:3 Comment(5)
I don't know that package, but lag is a generic from the stats package, so the relevant code will be plm:::lag.pseries which may not be coded to handle negative values for kTrilbie
Type help(package=plm) and read that lag.pseries has its second argument assigned to "k", so you should try to name your 'lag' argument (and k will default to 1).Rubric
DWin - naming the argument (lag(df.plm$data,k=-1) results in the same error. GSee - there don't appear to be any restrictions on what k can be, but the function does use the length of the vector, so you might be correct.Lacunar
lagging forward (leading values) have now been implemented in the development version of plm (r-forge.r-project.org/R/?group_id=406)Sami
Thanks Helix, that is very good to know!Lacunar
S
3

EDIT2: lagging forward (=leading values) is implemented in plm CRAN releases >= 1.6-4 . Functions are either lead() or lag() (latter with a negative integer for leading values).

Take care of any other packages attached that use the same function names. To be sure, you can refer to the function by the full namespace, e.g., plm::lead.

Examples from ?plm::lead:

# First, create a pdata.frame
data("EmplUK", package = "plm")
Em <- pdata.frame(EmplUK)

# Then extract a series, which becomes additionally a pseries
z <- Em$output
class(z)

# compute negative lags (= leading values)
lag(z, -1)
lead(z, 1) # same as line above
identical(lead(z, 1), lag(z, -1)) # TRUE
Sami answered 18/8, 2015 at 22:44 Comment(0)
M
1

The collapse package in CRAN has a C++ based function flag and also associated lag/lead operators L and F. It supports continuous sequences of lags/leads (positive and negative n values), and plm pseries and pdata.frame classes. Performance: 100x faster than plm and 10x faster than data.table (the fastest in R at the time of writing). Example:

library(collapse)
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c", "year"))
head(flag(pwlddev$LIFEEX, -1:1))     # A sequence of lags and leads
             F1     --     L1
ABW-1960 66.074 65.662     NA
ABW-1961 66.444 66.074 65.662
ABW-1962 66.787 66.444 66.074
ABW-1963 67.113 66.787 66.444
ABW-1964 67.435 67.113 66.787
ABW-1965 67.762 67.435 67.113

head(L(pwlddev$LIFEEX, -1:1))        # Same as above
head(L(pwlddev, -1:1, cols = 9:12))  # Computing on columns 9 through 12
         iso3c year F1.PCGDP PCGDP L1.PCGDP F1.LIFEEX LIFEEX L1.LIFEEX F1.GINI GINI L1.GINI
ABW-1960   ABW 1960       NA    NA       NA    66.074 65.662        NA      NA   NA      NA
ABW-1961   ABW 1961       NA    NA       NA    66.444 66.074    65.662      NA   NA      NA
ABW-1962   ABW 1962       NA    NA       NA    66.787 66.444    66.074      NA   NA      NA
ABW-1963   ABW 1963       NA    NA       NA    67.113 66.787    66.444      NA   NA      NA
ABW-1964   ABW 1964       NA    NA       NA    67.435 67.113    66.787      NA   NA      NA
ABW-1965   ABW 1965       NA    NA       NA    67.762 67.435    67.113      NA   NA      NA
         F1.ODA ODA L1.ODA
ABW-1960     NA  NA     NA
ABW-1961     NA  NA     NA
ABW-1962     NA  NA     NA
ABW-1963     NA  NA     NA
ABW-1964     NA  NA     NA
ABW-1965     NA  NA     NA


library(microbenchmark)
library(data.table)
microbenchmark(plm_class = flag(pwlddev), 
               ad_hoc = flag(wlddev, g = wlddev$iso3c, t = wlddev$year), 
               data.table = qDT(wlddev)[, shift(.SD), by = iso3c]) 

Unit: microseconds
       expr      min        lq      mean   median         uq      max neval cld
  plm_class  462.313  512.5145  1044.839  551.562   637.6875 15913.17   100  a 
     ad_hoc  443.124  519.6550  1127.363  559.817   701.0545 34174.05   100  a 
 data.table 7477.316 8070.3785 10126.471 8682.184 10397.1115 33575.18   100   b
Mudskipper answered 1/9, 2020 at 21:4 Comment(0)
H
0

I had this same problem and couldn't find a good solution in plm or any other package. ddply was tempting (e.g. s5 = ddply(df, .(country,year), transform, lag=lag(df[, "value-to-lag"], lag=3))), but I couldn't get the NAs in my lagged column to line up properly for lags other than one.

I wrote a brute force solution that iterates over the dataframe row-by-row and populates the lagged column with the appropriate value. It's horrendously slow (437.33s for my 13000x130 dataframe vs. 0.012s for turning it into a pdata.frame and using lag) but it got the job done for me. I thought I would share it here because I couldn't find much information elsewhere on the internet.

In the function below:

  • df is your dataframe. The function returns df with a new column containing the forward values.
  • group is the column name of the grouping variable for your panel data. For example, I had longitudinal data on multiple countries, and I used "Country.Name" here.
  • x is the column you want to generate lagged values from, e.g. "GDP"
  • forwardx is the (new) column that will contain the forward lags, e.g. "GDP.next.year".
  • lag is the number of periods into the future. For example, if your data were taken in annual intervals, using lag=5 would set forwardx to the value of x five years later.

.

add_forward_lag <- function(df, group, x, forwardx, lag) {
for (i in 1:(nrow(df)-lag)) {
    if (as.character(df[i, group]) == as.character(df[i+lag, group])) {
        # put forward observation in forwardx
        df[i, forwardx] <- df[i+lag, x]
    }
    else {
        # end of group, no forward observation
        df[i, forwardx] <- NA
    }
}
# last elem(s) in forwardx are NA
for (j in ((nrow(df)-lag+1):nrow(df))) {
    df[j, forwardx] <- NA
}
return(df)
}

See sample output using built-in DNase dataset. This doesn't make sense in context of the dataset, but it lets you see what the columns do.

require(DNase)
add_forward_lag(DNase, "Run", "density", "lagged_density",3)

Grouped Data: density ~ conc | Run
     Run    conc    density lagged_density
1     1  0.04882812   0.017  0.124
2     1  0.04882812   0.018  0.206
3     1  0.19531250   0.121  0.215
4     1  0.19531250   0.124  0.377
5     1  0.39062500   0.206  0.374
6     1  0.39062500   0.215  0.614
7     1  0.78125000   0.377  0.609
8     1  0.78125000   0.374  1.019
9     1  1.56250000   0.614  1.001
10    1  1.56250000   0.609  1.334
11    1  3.12500000   1.019  1.364
12    1  3.12500000   1.001  1.730
13    1  6.25000000   1.334  1.710
14    1  6.25000000   1.364     NA
15    1 12.50000000   1.730     NA
16    1 12.50000000   1.710     NA
17    2  0.04882812   0.045  0.123
18    2  0.04882812   0.050  0.225
19    2  0.19531250   0.137  0.207

Given how long this takes, you may want to use a different approach: backwards-lag all of your other variables.

Hoberthobey answered 15/10, 2013 at 19:16 Comment(1)
Thanks Katrina! Interesting approach. I have actually since ceased using plm for lagging and leading. I now use the data.table approach in #11398271, and it works well and is very fast.Lacunar

© 2022 - 2024 — McMap. All rights reserved.