How to create a lag variable within each group?

Asked 10/10, 2014 at 4:33 Answered 27/6, 2018 at 15:54

I have a data.table:

require(data.table)

set.seed(1)
data <- data.table(time = c(1:3, 1:4),
                   groups = c(rep(c("b", "a"), c(3, 4))),
                   value = rnorm(7))

data
#    groups time      value
# 1:      b    1 -0.6264538
# 2:      b    2  0.1836433
# 3:      b    3 -0.8356286
# 4:      a    1  1.5952808
# 5:      a    2  0.3295078
# 6:      a    3 -0.8204684
# 7:      a    4  0.4874291

I want to compute a lagged version of the "value" column, within each level of "groups".

The result should look like

#   groups time      value  lag.value
# 1      a    1  1.5952808         NA
# 2      a    2  0.3295078  1.5952808
# 3      a    3 -0.8204684  0.3295078
# 4      a    4  0.4874291 -0.8204684
# 5      b    1 -0.6264538         NA
# 6      b    2  0.1836433 -0.6264538
# 7      b    3 -0.8356286  0.1836433

I have tried to use lag directly:

data$lag.value <- lag(data$value)

...which clearly wouldn't work.

I have also tried:

unlist(tapply(data$value, data$groups, lag))
 a1         a2         a3         a4         b1         b2         b3 
 NA -0.1162932  0.4420753  2.1505440         NA  0.5894583 -0.2890288

Which is almost what I want. However the vector generated is ordered differently from the ordering in the data.table which is problematic.

What is the most efficient way to do this in base R, plyr, dplyr, and data.table?

Insurable answered 10/10, 2014 at 4:33 Comment(5)

sorry, combine with group_by – Cogitation 10/10, 2014 at 4:39

unlist(by(data, data$groups, function(x) c(NA, head(x$value, -1)))) would be a base way – Woken 10/10, 2014 at 4:50

@Insurable If you have just one column to do lag and the dataset is not that big, there won't be much difference in efficiency between base R, plyr, data.table methods. – Cupp 10/10, 2014 at 4:50

@Cupp Understand. However I actually simplied it. I actually need it for many columns and general solutions are preferred for the benefit of other useRs – Insurable 10/10, 2014 at 4:53

@Insurable I updated for multiple columns. Regarding why lag is slow, it must depend on the code in lag. You can check getAnywhere('lag.default')[1] – Cupp 10/10, 2014 at 5:4

122

You could do this within data.table

 library(data.table)
 data[, lag.value:=c(NA, value[-.N]), by=groups]
  data
 #   time groups       value   lag.value
 #1:    1      a  0.02779005          NA
 #2:    2      a  0.88029938  0.02779005
 #3:    3      a -1.69514201  0.88029938
 #4:    1      b -1.27560288          NA
 #5:    2      b -0.65976434 -1.27560288
 #6:    3      b -1.37804943 -0.65976434
 #7:    4      b  0.12041778 -1.37804943

For multiple columns:

nm1 <- grep("^value", colnames(data), value=TRUE)
nm2 <- paste("lag", nm1, sep=".")
data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]
 data
#    time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132

Update

From data.table versions >= v1.9.5, we can use shift with type as lag or lead. By default, the type is lag.

data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1]
#   time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132

If you need the reverse, use type=lead

nm3 <- paste("lead", nm1, sep=".")

Using the original dataset

  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]
  #  time groups      value     value1      value2 lead.value lead.value1
  #1:    1      b -0.6264538  0.7383247  1.12493092  0.1836433   0.5757814
  #2:    2      b  0.1836433  0.5757814 -0.04493361 -0.8356286  -0.3053884
  #3:    3      b -0.8356286 -0.3053884 -0.01619026         NA          NA
  #4:    1      a  1.5952808  1.5117812  0.94383621  0.3295078   0.3898432
  #5:    2      a  0.3295078  0.3898432  0.82122120 -0.8204684  -0.6212406
  #6:    3      a -0.8204684 -0.6212406  0.59390132  0.4874291  -2.2146999
  #7:    4      a  0.4874291 -2.2146999  0.91897737         NA          NA
 #   lead.value2
 #1: -0.04493361
 #2: -0.01619026
 #3:          NA
 #4:  0.82122120
 #5:  0.59390132
 #6:  0.91897737
 #7:          NA

data

 set.seed(1)
 data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),
             value = rnorm(7), value1=rnorm(7), value2=rnorm(7))

Cupp answered 10/10, 2014 at 4:40 Comment(12)

Am wondering why data[, lag.value:=lag(value)), by=groups] which gives the same result is slower than your solution? – Insurable 10/10, 2014 at 4:51

How would I do this, but in reverse? In other words, instead of lagging by one (taking the previous row) it would be ahead by one (taking the following row value)? Thank you for the great entry! – Tact 9/5, 2015 at 21:19

Is it also possible to lag by more than one value? (i.e. getting data[, lag.value.1:=c(NA, lag.value[-.N]), by=groups] without calculating lag.value?) – Pleonasm 30/7, 2015 at 8:47

@Pleonasm I didn't understand what you wanted. In the post it shows shift(.SD) which is calculating the lag for more than one column by specifying the columns in the .SDcols. DId you meant to get two lags for a single column. In that case data[, shift(value, 1:2), by=groups] – Cupp 30/7, 2015 at 9:14

@Cupp from your reply is assume the solution I am looking for is going to be data[, shift(value, 2), by=groups] (but I can't the devel version to work yet). But to clarify, my question was how to lag a single column by 2 values rather than by 1 value. The first 3 rows of the output I need are:

data.table(time = c(1,2,3), groups = c(a,a,a), value = c(0.02779005, 0.88029938, -1.69514201), lag.value = c(NA, NA, 0.02779005))

– Pleonasm 30/7, 2015 at 10:6

@Pleonasm Yes, the data[, shift(value, 2), by=groups] get the result, but if you are not able to do that, then we hve to lag by 2. i.e. dt[, lag(value, 2), by=groups] – Cupp 30/7, 2015 at 10:11

@Cupp thanks, I'll see if I can make it work. I posted the question here in the mean time (#31721523)! – Pleonasm 30/7, 2015 at 10:16

When i try this I get Error:Unused arguments (by=groups, .SDcols=nm1) What am i doing wrong? – Russellrusset 19/7, 2022 at 21:6

@Russellrusset not clear. I tested it once again and it is working fine for me. Please check if you have a data.table object or not – Cupp 20/7, 2022 at 15:7

Sorry about that. I did convert it to data.table object, but I am now getting the error: cannot coerce type 'closure' to vector of type 'list' This is the code Im using:

nm1 <- grep("Time$", colnames(df), value=TRUE) nm2 <- paste("lag", nm1, sep="_") df[, (nm2) := shift(.SD), by=groups, .SDcols=nm1]

The columns im lagging are of POSIXct datatype, could that be the issue? – Russellrusset 21/7, 2022 at 19:53

@Russellrusset is the error based on the same example data in the post or a different one (as the code is working fine with example). If it is POSIXlt, then it is a list column – Cupp 21/7, 2022 at 20:2

The data is different as I’m doing the operation on timestamps column of data type POSIXct. I guess the approach should be different in that case. I’ll try posting it as a separate question. Thank you for being kind enough to follow up :) – Russellrusset 23/7, 2022 at 3:7

104

Using package dplyr:

library(dplyr)
data <- 
    data %>%
    group_by(groups) %>%
    mutate(lag.value = dplyr::lag(value, n = 1, default = NA))

gives

> data
Source: local data table [7 x 4]
Groups: groups

  time groups       value   lag.value
1    1      a  0.07614866          NA
2    2      a -0.02784712  0.07614866
3    3      a  1.88612245 -0.02784712
4    1      b  0.26526825          NA
5    2      b  1.23820506  0.26526825
6    3      b  0.09276648  1.23820506
7    4      b -0.09253594  0.09276648

As noted by @BrianD, this implicitly assumes that value is sorted by group already. If not, either sort it by group, or use the order_by argument in lag. Also note that due to an existing issue with some versions of dplyr, for safety, arguments and the namespace should be explicitly given.

Cogitation answered 10/10, 2014 at 4:38 Comment(7)

How do you use this while looping over all the variables you need to create a lag for ? – Carabineer 23/3, 2017 at 22:26

do you mean you have multiple columns you wish to do the lag operation over? Check out mutate_each, mutate_all, mutate_at etc commands – Cogitation 23/3, 2017 at 22:34

does this solution assume that the source dataset is pre-sorted appropriately? – Mccarthyism 7/7, 2017 at 17:24

@BrianD yes it does, but this is implicit in the OP's comment that they want value lagged by group. – Cogitation 10/7, 2017 at 4:30

@Cogitation I was just thinking that if the time variable wasn't sorted ahead of time (as might be the case in other users datasets who are reading this), there is no explicit sorting in this code. Might be safer to specify the sort order explicitly like: lag(value, 1, order_by=time) – Mccarthyism 10/7, 2017 at 20:30

@BrianD I do not think there is any confusion as lag in my mind means take previous values and shift them by n positions, but it is useful to note that you can pass an ordering argument to lag, thanks. – Cogitation 10/7, 2017 at 23:37

@Cogitation Is it possible to do some arithmetic operation within lag function? For example, can I subtract every row from the value in the first row - groupwise? – Winebibber 1/4, 2022 at 2:8

I wanted to complement the previous answers by mentioning two ways in which I approach this problem in the important case when you are not guaranteed that each group has data for every time period. That is, you still have a regularly spaced time series, but there might be missings here and there. I will focus on two ways to improve the dplyr solution.

We start with the same data that you used...

library(dplyr)
library(tidyr)

set.seed(1)
data_df = data.frame(time   = c(1:3, 1:4),
                     groups = c(rep(c("b", "a"), c(3, 4))),
                     value  = rnorm(7))
data_df
#>   time groups      value
#> 1    1      b -0.6264538
#> 2    2      b  0.1836433
#> 3    3      b -0.8356286
#> 4    1      a  1.5952808
#> 5    2      a  0.3295078
#> 6    3      a -0.8204684
#> 7    4      a  0.4874291

... but now we delete a couple of rows

data_df = data_df[-c(2, 6), ]
data_df
#>   time groups      value
#> 1    1      b -0.6264538
#> 3    3      b -0.8356286
#> 4    1      a  1.5952808
#> 5    2      a  0.3295078
#> 7    4      a  0.4874291

Simple `dplyr` solution no longer works

data_df %>% 
  arrange(groups, time) %>% 
  group_by(groups) %>% 
  mutate(lag.value = lag(value)) %>% 
  ungroup()
#> # A tibble: 5 x 4
#>    time groups  value lag.value
#>   <int> <fct>   <dbl>     <dbl>
#> 1     1 a       1.60     NA    
#> 2     2 a       0.330     1.60 
#> 3     4 a       0.487     0.330
#> 4     1 b      -0.626    NA    
#> 5     3 b      -0.836    -0.626

You see that, although we don't have the value for the case (group = 'a', time = '3'), the above still shows a value for the lag in the case of (group = 'a', time = '4'), which is actually the value at time = 2.

Correct `dplyr` solution

The idea is that we add the missing (group, time) combinations. This is VERY memory-inefficient when you have lots of possible (groups, time) combinations, but the values are sparsely captured.

dplyr_correct_df = expand.grid(
  groups = sort(unique(data_df$groups)),
  time   = seq(from = min(data_df$time), to = max(data_df$time))
) %>% 
  left_join(data_df, by = c("groups", "time")) %>% 
  arrange(groups, time) %>% 
  group_by(groups) %>% 
  mutate(lag.value = lag(value)) %>% 
  ungroup()
dplyr_correct_df
#> # A tibble: 8 x 4
#>   groups  time   value lag.value
#>   <fct>  <int>   <dbl>     <dbl>
#> 1 a          1   1.60     NA    
#> 2 a          2   0.330     1.60 
#> 3 a          3  NA         0.330
#> 4 a          4   0.487    NA    
#> 5 b          1  -0.626    NA    
#> 6 b          2  NA        -0.626
#> 7 b          3  -0.836    NA    
#> 8 b          4  NA        -0.836

Notice that we now have a NA at (group = 'a', time = '4'), which should be the expected behaviour. Same with (group = 'b', time = '3').

Tedious but also correct solution using the class `zoo::zooreg`

This solution should work better in terms of memory when the amount of cases is very large, because instead of filling the missing cases with NA's, it uses indices.

library(zoo)

zooreg_correct_df = data_df %>% 
  as_tibble() %>% 
  # nest the data for each group
  # should work for multiple groups variables
  nest(-groups, .key = "zoo_ob") %>%
  mutate(zoo_ob = lapply(zoo_ob, function(d) {

    # create zooreg objects from the individual data.frames created by nest
    z = zoo::zooreg(
      data      = select(d,-time),
      order.by  = d$time,
      frequency = 1
    ) %>% 
      # calculate lags
      # we also ask for the 0'th order lag so that we keep the original value
      zoo:::lag.zooreg(k = (-1):0) # note the sign convention is different

    # recover df's from zooreg objects
    cbind(
      time = as.integer(zoo::index(z)),
      zoo:::as.data.frame.zoo(z)
    )

  })) %>% 
  unnest() %>% 
  # format values
  select(groups, time, value = value.lag0, lag.value = `value.lag-1`) %>% 
  arrange(groups, time) %>% 
  # eliminate additional periods created by lag
  filter(time <= max(data_df$time))
zooreg_correct_df
#> # A tibble: 8 x 4
#>   groups  time   value lag.value
#>   <fct>  <int>   <dbl>     <dbl>
#> 1 a          1   1.60     NA    
#> 2 a          2   0.330     1.60 
#> 3 a          3  NA         0.330
#> 4 a          4   0.487    NA    
#> 5 b          1  -0.626    NA    
#> 6 b          2  NA        -0.626
#> 7 b          3  -0.836    NA    
#> 8 b          4  NA        -0.836

Finally, lets check that both correct solutions are actually equal:

all.equal(dplyr_correct_df, zooreg_correct_df)
#> [1] TRUE

Thousandth answered 27/6, 2018 at 15:54 Comment(3)

Has anything happened to dplyr? Using either of the two solutions does not lag anything in my case. It just replicates the original values in a different column – Bathroom 30/9, 2022 at 10:1

The dplyr version still works for me as of this moment, except for one minor change, I need to indicate the seq "by" parameter, which can be explored with ?seq.Date. I note this operation is EXTREMELY common. – Consols 11/2, 2023 at 1:52

The example as it is right now still works for me. The fix @Consols used is necessary if you have Date data instead of ints (as in this example). There is no default increment for building sequences of Dates. – Thousandth 12/2, 2023 at 18:13

In base R, this will do the job:

data$lag.value <- c(NA, data$value[-nrow(data)])
data$lag.value[which(!duplicated(data$groups))] <- NA

The first line adds a string of lagged (+1) observations. The second string corrects the first entry of each group, as the lagged observation is from previous group.

Note that data is of format data.frame to not use data.table.

Stereoisomer answered 14/4, 2017 at 13:32 Comment(0)

If you wanted to make sure that you avoided any issue with ordering the data, you can do this, using dplyr, manually with something like:

df <- data.frame(Names = c(rep('Dan',50),rep('Dave',100)),
            Dates = c(seq(1,100,by=2),seq(1,100,by=1)),
            Values = rnorm(150,0,1))

df <- df %>% group_by(Names) %>% mutate(Rank=rank(Dates),
                                    RankDown=Rank-1)

df <- df %>% left_join(select(df,Rank,ValueDown=Values,Names),by=c('RankDown'='Rank','Names')
) %>% select(-Rank,-RankDown)

head(df)

Or alternatively I like the idea of putting it in a function with a chosen grouping variable(s), ranking column (like Date or otherwise), and chosen number of lags. This also requires lazyeval as well as dplyr.

groupLag <- function(mydf,grouping,ranking,lag){
  df <- mydf
  groupL <- lapply(grouping,as.symbol)

  names <- c('Rank','RankDown')
  foos <- list(interp(~rank(var),var=as.name(ranking)),~Rank-lag)

  df <- df %>% group_by_(.dots=groupL) %>% mutate_(.dots=setNames(foos,names))

  selectedNames <- c('Rank','Values',grouping)
  df2 <- df %>% select_(.dots=selectedNames)
  colnames(df2) <- c('Rank','ValueDown',grouping)

  df <- df %>% left_join(df2,by=c('RankDown'='Rank',grouping)) %>% select(-Rank,-RankDown)

  return(df)
}

groupLag(df,c('Names'),c('Dates'),1)

Hazlitt answered 24/1, 2018 at 10:25 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update

data

Simple dplyr solution no longer works

Correct dplyr solution

Tedious but also correct solution using the class zoo::zooreg

Recommended topics

Hot tags

Simple `dplyr` solution no longer works

Correct `dplyr` solution

Tedious but also correct solution using the class `zoo::zooreg`