Use a value from the previous row in an R data.table calculation
Asked Answered
I

7

99

I want to create a new column in a data.table calculated from the current value of one column and the previous of another. Is it possible to access previous rows?

E.g.:

> DT <- data.table(A=1:5, B=1:5*10, C=1:5*100)
> DT
   A  B   C
1: 1 10 100
2: 2 20 200
3: 3 30 300
4: 4 40 400
5: 5 50 500
> DT[, D := C + BPreviousRow] # What is the correct code here?

The correct answer should be

> DT
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540
Ist answered 4/2, 2013 at 14:59 Comment(1)
I usually set a key to my data.tables: DT <- data.table(A=..., key = "A")Packhorse
T
116

With shift() implemented in v1.9.6, this is quite straightforward.

DT[ , D := C + shift(B, 1L, type="lag")]
# or equivalently, in this case,
DT[ , D := C + shift(B)]

From NEWS:

  1. New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead". It enables very convenient usage along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

See history for previous answers.

Tawanda answered 4/2, 2013 at 15:2 Comment(8)
Does that .N hold the current row number or something? Sorry to ask here, but I can't seem to find it in the help files...Cinder
@SlowLearner: You might also find .I useful, which holds the row indices for the rows in the curren group.Manard
Use seq_len(.N - 1) instead of 1:(.N-1). This avoids problems associated with 1:0.Epochmaking
+1 for the .SD example--I was trying to use a lapply and getting funky results. this is much simpler.Homogeneity
Where can I find an updated pdf with all this new information ? The official 1.9.4 vignettes and webminars don't include it. And the Rmd 1.9.5 vignettes are not comfortable and don't include it either.Turnkey
@Tawanda , your first solution says "Error in C + shift(B, 1L, type = "lag") : non-numeric argument to binary operator" in my data.table 1.9.5Turnkey
@skan, update to the latest commit of data.table... Made changes yesterday. If using devel, better to watch the project on GitHubTawanda
Note - for me at least - using shift was much slower that using 'slicing' with -N - see timings here (for a data.table of 1,5 mio. rows) #58373127Vassalize
S
57

Using dplyr you could do:

mutate(DT, D = lag(B) + C)

Which gives:

#   A  B   C   D
#1: 1 10 100  NA
#2: 2 20 200 210
#3: 3 30 300 320
#4: 4 40 400 430
#5: 5 50 500 540
Scalene answered 27/4, 2015 at 1:52 Comment(0)
E
24

Several folks have answered the specific question. See the code below for a general purpose function that I use in situations like this that may be helpful. Rather than just getting the prior row, you can go as many rows in the "past" or "future" as you'd like.

rowShift <- function(x, shiftLen = 1L) {
  r <- (1L + shiftLen):(length(x) + shiftLen)
  r[r<1] <- NA
  return(x[r])
}

# Create column D by adding column C and the value from the previous row of column B:
DT[, D := C + rowShift(B,-1)]

# Get the Old Faithul eruption length from two events ago, and three events in the future:
as.data.table(faithful)[1:5,list(eruptLengthCurrent=eruptions,
                                 eruptLengthTwoPrior=rowShift(eruptions,-2), 
                                 eruptLengthThreeFuture=rowShift(eruptions,3))]
##   eruptLengthCurrent eruptLengthTwoPrior eruptLengthThreeFuture
##1:              3.600                  NA                  2.283
##2:              1.800                  NA                  4.533
##3:              3.333               3.600                     NA
##4:              2.283               1.800                     NA
##5:              4.533               3.333                     NA
Emergence answered 1/8, 2014 at 16:24 Comment(3)
This is a brilliant answer, I'm annoyed that I've already upvoted the other answers because this is a far more general answer. In fact, I'm going to use it in my geneorama package (if you don't mind).Doud
Sure, go for it. I was hoping to get some free time and submit it as a pull request to the data.table package, but alas...Emergence
A similar function called shift has been added to data.table as of version 1.9.5. See the updated answer from @Arun.Emergence
F
13

Based on @Steve Lianoglou 's comment above, why not just:

DT[, D:= C + c(NA, B[.I - 1]) ]
#    A  B   C   D
# 1: 1 10 100  NA
# 2: 2 20 200 210
# 3: 3 30 300 320
# 4: 4 40 400 430
# 5: 5 50 500 540

And avoid using seq_len or head or any other function.

Fendig answered 4/5, 2014 at 4:25 Comment(2)
Nice - however this would not work if you wanted to find the previous within a group.Hydromedusa
@Hydromedusa you are right. If subsetting by group I would replace .I with seq_len(.N)Fendig
S
9

Following Arun's solution, a similar results can be obtained without referring to to .N

> DT[, D := C + c(NA, head(B, -1))][]
   A  B   C   D
1: 1 10 100  NA
2: 2 20 200 210
3: 3 30 300 320
4: 4 40 400 430
5: 5 50 500 540
Scrummage answered 4/2, 2013 at 15:53 Comment(2)
Is there a reason to prefer one method to another? Or is it simply an aesthetic difference?Ist
I think that in this scenario (i.e. where .N is readily available) it is mostly aesthetic choice. I am not aware of any important difference.Scrummage
G
2

Here is my intuitive solution:

#create data frame
df <- data.frame(A=1:5, B=seq(10,50,10), C=seq(100,500, 100))`
#subtract the shift from num rows
shift  <- 1 #in this case the shift is 1
invshift <- nrow(df) - shift
#Now create the new column
df$D <- c(NA, head(df$B, invshift)+tail(df$C, invshift))`

Here invshift, the number of rows minus 1, is 4. nrow(df) provides you with the number of rows in a data frame or in a vector. Similarly, if you want to take still earlier values, subtract from nrow 2, 3, ...etc, and also put NA's accordingly at the beginning.

Goatsbeard answered 5/7, 2018 at 10:51 Comment(0)
D
-2

it can be done in a loop.

# Create the column D
DT$D <- 0
# for every row in DT
for (i in 1:length(DT$A)) {
  if(i==1) {
    #using NA at first line
    DT[i,4] <- NA
  } else {
    #D = C + BPreviousRow
    DT[i,4] <- DT[i,3] + DT[(i-1), 2]   
  }
}

Using a for, you can even use the previous value of the row of this new column DT[(i-1), 4]

Diplex answered 29/1, 2020 at 19:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.