Use a value from the previous row in an R data.table calculation

Asked 4/2, 2013 at 14:59 Answered 29/1, 2020 at 19:52

I want to create a new column in a data.table calculated from the current value of one column and the previous of another. Is it possible to access previous rows?

E.g.:

> DT <- data.table(A=1:5, B=1:5*10, C=1:5*100)
> DT
   A  B   C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
> DT[, D := C + BPreviousRow] # What is the correct code here?

The correct answer should be

> DT
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540

Ist answered 4/2, 2013 at 14:59 Comment(1)

I usually set a key to my data.tables: DT <- data.table(A=..., key = "A") – Packhorse 21/6, 2016 at 8:8

116

With shift() implemented in v1.9.6, this is quite straightforward.

DT[ , D := C + shift(B, 1L, type="lag")]
# or equivalently, in this case,
DT[ , D := C + shift(B)]

From NEWS:

New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead". It enables very convenient usage along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

See history for previous answers.

Tawanda answered 4/2, 2013 at 15:2 Comment(8)

Does that .N hold the current row number or something? Sorry to ask here, but I can't seem to find it in the help files... – Cinder 4/2, 2013 at 15:24

@SlowLearner: You might also find .I useful, which holds the row indices for the rows in the curren group. – Manard 4/2, 2013 at 16:2

Use seq_len(.N - 1) instead of 1:(.N-1). This avoids problems associated with 1:0. – Epochmaking 4/2, 2013 at 19:8

+1 for the .SD example--I was trying to use a lapply and getting funky results. this is much simpler. – Homogeneity 26/4, 2015 at 22:41

Where can I find an updated pdf with all this new information ? The official 1.9.4 vignettes and webminars don't include it. And the Rmd 1.9.5 vignettes are not comfortable and don't include it either. – Turnkey 30/4, 2015 at 16:45

@Tawanda , your first solution says "Error in C + shift(B, 1L, type = "lag") : non-numeric argument to binary operator" in my data.table 1.9.5 – Turnkey 30/4, 2015 at 16:51

@skan, update to the latest commit of data.table... Made changes yesterday. If using devel, better to watch the project on GitHub – Tawanda 30/4, 2015 at 17:10

Note - for me at least - using shift was much slower that using 'slicing' with -N - see timings here (for a data.table of 1,5 mio. rows) #58373127 – Vassalize 21/10, 2019 at 10:30

Using dplyr you could do:

mutate(DT, D = lag(B) + C)

Which gives:

#   A  B   C   D
#1: 1 10 100  NA
#2: 2 20 200 210
#3: 3 30 300 320
#4: 4 40 400 430
#5: 5 50 500 540

Scalene answered 27/4, 2015 at 1:52 Comment(0)

Several folks have answered the specific question. See the code below for a general purpose function that I use in situations like this that may be helpful. Rather than just getting the prior row, you can go as many rows in the "past" or "future" as you'd like.

rowShift <- function(x, shiftLen = 1L) {
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

# Create column D by adding column C and the value from the previous row of column B:
DT[, D := C + rowShift(B,-1)]

# Get the Old Faithul eruption length from two events ago, and three events in the future:
as.data.table(faithful)[1:5,list(eruptLengthCurrent=eruptions,
                                 eruptLengthTwoPrior=rowShift(eruptions,-2), 
                                 eruptLengthThreeFuture=rowShift(eruptions,3))]
##   eruptLengthCurrent eruptLengthTwoPrior eruptLengthThreeFuture
##1:              3.600                  NA                  2.283
##2:              1.800                  NA                  4.533
##3:              3.333               3.600                     NA
##4:              2.283               1.800                     NA
##5:              4.533               3.333                     NA

Emergence answered 1/8, 2014 at 16:24 Comment(3)

This is a brilliant answer, I'm annoyed that I've already upvoted the other answers because this is a far more general answer. In fact, I'm going to use it in my geneorama package (if you don't mind). – Doud 3/11, 2014 at 19:52

Sure, go for it. I was hoping to get some free time and submit it as a pull request to the data.table package, but alas... – Emergence 3/11, 2014 at 20:33

A similar function called shift has been added to data.table as of version 1.9.5. See the updated answer from @Arun. – Emergence 19/2, 2015 at 18:53

Based on @Steve Lianoglou 's comment above, why not just:

DT[, D:= C + c(NA, B[.I - 1]) ]
#    A  B   C   D
# 1: 1 10 100  NA
# 2: 2 20 200 210
# 3: 3 30 300 320
# 4: 4 40 400 430
# 5: 5 50 500 540

And avoid using seq_len or head or any other function.

Fendig answered 4/5, 2014 at 4:25 Comment(2)

Nice - however this would not work if you wanted to find the previous within a group. – Hydromedusa 2/9, 2014 at 18:39

@Hydromedusa you are right. If subsetting by group I would replace .I with seq_len(.N) – Fendig 15/2, 2015 at 20:0

Following Arun's solution, a similar results can be obtained without referring to to .N

> DT[, D := C + c(NA, head(B, -1))][]
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540

Scrummage answered 4/2, 2013 at 15:53 Comment(2)

Is there a reason to prefer one method to another? Or is it simply an aesthetic difference? – Ist 4/2, 2013 at 16:10

I think that in this scenario (i.e. where .N is readily available) it is mostly aesthetic choice. I am not aware of any important difference. – Scrummage 4/2, 2013 at 16:24

Here is my intuitive solution:

#create data frame
df <- data.frame(A=1:5, B=seq(10,50,10), C=seq(100,500, 100))`
#subtract the shift from num rows
shift  <- 1 #in this case the shift is 1
invshift <- nrow(df) - shift
#Now create the new column
df$D <- c(NA, head(df$B, invshift)+tail(df$C, invshift))`

Here invshift, the number of rows minus 1, is 4. nrow(df) provides you with the number of rows in a data frame or in a vector. Similarly, if you want to take still earlier values, subtract from nrow 2, 3, ...etc, and also put NA's accordingly at the beginning.

Goatsbeard answered 5/7, 2018 at 10:51 Comment(0)

-2

it can be done in a loop.

# Create the column D
DT$D <- 0
# for every row in DT
for (i in 1:length(DT$A)) {
  if(i==1) {
    #using NA at first line
    DT[i,4] <- NA
  } else {
    #D = C + BPreviousRow
    DT[i,4] <- DT[i,3] + DT[(i-1), 2]   
  }
}

Using a for, you can even use the previous value of the row of this new column DT[(i-1), 4]

Diplex answered 29/1, 2020 at 19:52 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags