data.table efficient recycling V2

Asked 5/12, 2019 at 14:26 Answered 30/11, 2020 at 18:6

This is a follow-up to this question : data.table efficient recycling

The difference here is that the number of future years for each line is not necessarily the same ..

I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year.

This can lead to something like that :

library(data.table)
dt <- data.table(1:500000, 500000:1, rpois(500000, 240))
dt2 <- dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]

But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this :

library(data.table)
dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
dt2 <- dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]

My question is : is there a more efficient way to achieve this purpose ?

Thanks for any help !

Hord answered 5/12, 2019 at 14:26 Comment(6)

How much ram do you have to work with? Already with 500000 I get dt2 is 2Gb – Building 5/12, 2019 at 15:59

I don't know exactly .. but I've a pretty good computer – Hord 5/12, 2019 at 16:16

I'm still stuck with this issue .. – Hord 23/1, 2020 at 10:42

@Hord would be useful if you would provide feedback to existing answer, does it solve the problem but is not efficient enough? – Oregon 26/11, 2020 at 14:44

@Oregon yes you're right ! I'll do it in a while (as a comment of the given answer) – Hord 26/11, 2020 at 14:53

@Oregon but with you're answer no more need for me to comment the first one .. – Hord 26/11, 2020 at 15:20

Here is slightly improved version of the other answer.

using non-default values to unlist
rep.int rather than rep
seq_len rather than :
setDT instead of data.table()
even better with sequence function suggested by @Cole
and further minor improvement with internal vecseq

Together it seems to make difference.

Timings...

library(data.table)
f0 = function(dt) {
  dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]
}
f1 = function(dt) {
  dt2 <- data.table(
    rep(dt$V1, dt$V3),
    rep(dt$V2, dt$V3),
    rep(dt$V3, dt$V3),
    unlist(lapply(dt$V3, function(x){1:x}))
  )
  dt2
}
f2 = function(dt) {
  dt2 = list(
    V1 = rep.int(dt$V1, dt$V3),
    V2 = rep.int(dt$V2, dt$V3),
    V3 = rep.int(dt$V3, dt$V3),
    year = unlist(lapply(dt$V3, seq_len), recursive=FALSE, use.names=FALSE)
  )
  setDT(dt2)
  dt2
}
f3 = function(dt) {
  ## even better with sequence function suggested by @Cole
  dt2 = list(
    V1 = rep.int(dt$V1, dt$V3),
    V2 = rep.int(dt$V2, dt$V3),
    V3 = rep.int(dt$V3, dt$V3),
    year = sequence(dt$V3)
  )
  setDT(dt2)
  dt2
}
f4 = function(dt) {
  dt[, c(lapply(.SD, rep.int, V3), year = .(sequence(V3)))]
}
f5 = function(dt) {
  dt2 = list(
    V1 = rep.int(dt$V1, dt$V3),
    V2 = rep.int(dt$V2, dt$V3),
    V3 = rep.int(dt$V3, dt$V3),
    year = data.table:::vecseq(rep.int(1L,length(dt$V3)), dt$V3, NULL)
  )
  setDT(dt2)
  dt2
}

On a "big" data

dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
system.time(f0(dt))
#   user  system elapsed 
# 22.100  18.914  40.449 
system.time(f1(dt))
#   user  system elapsed 
# 35.866  15.607  51.475 
system.time(f2(dt))
#   user  system elapsed 
# 22.922   6.839  29.760 
system.time(f3(dt))
#   user  system elapsed 
#  6.509   6.723  13.233 
system.time(f4(dt))
#   user  system elapsed 
# 12.140  14.114  26.254 
system.time(f5(dt))
#   user  system elapsed 
#  6.448   4.057  10.506

Anyway, you should try to improve your processes that you are running on expanded dataset because maybe you don't have to expand that in the first place.

For example, in frollmean function there is an argument adaptive which makes it possible to calculate rolling mean on a variable length window, where normally to compute that one would need to expand data first. V3 in your data reminds a lot a length of a window for adaptive moving average.

Oregon answered 26/11, 2020 at 15:0 Comment(6)

Thanks for this answer .. about your comment on expanding : I need to expand, because I'm making predctions year by year .. – Hord 26/11, 2020 at 15:24

I've tested your way on my real data .. very very good improvement (I've had to use rep instead of rep.int, because I've got many kinds of columns) even if, the more columns there is, the more longer it is to write the code (and it makes it horrible code to read ..), but I think I'ts possible to make a more general function from that .. I will let others a chance to beat your way, and if no one does better, I will reward your answer. Many thanks ! – Hord 26/11, 2020 at 16:13

Moving lapply to C/C++ will definitely help but will require to write compiled code. – Oregon 26/11, 2020 at 16:54

Use sequence(V3). It is pretty much the unlist(lapply(...)) code compiled in C. dt[, c(lapply(.SD, rep.int, V3), year = .(sequence(V3)))]. I also agree with @Oregon to try to see if you can do what you need to do on the data instead of expanding it all in the first place. – Ithnan 26/11, 2020 at 17:12

Great find stat.ethz.ch/R-manual/R-devel/library/base/html/sequence.html – Oregon 26/11, 2020 at 17:35

@Ithnan I realized we have internal function for that :) interesting that despite having to materialize rep(1L, n), and then loop over that vector when creating sequence, it is still faster than base R sequence. – Oregon 26/11, 2020 at 17:56

This is a faster implementation, but still long due to the lapply loop in the data.table

dt2 <- data.table(
  rep(dt$V1, dt$V3),
  rep(dt$V2, dt$V3),
  rep(dt$V3, dt$V3),
  unlist(lapply(dt$V3, function(x){1:x}))
)

I hope this is of any help!

Christan answered 5/12, 2019 at 14:45 Comment(2)

Thanks for your answer, but the issue is that the V3 values are given (I use rpois only for having values in the table). So you have to assume that dt is given, with the three columns, and then find a way of duplicating eahc row the number of times given by V3 .. – Hord 6/12, 2019 at 10:6

@Hord - I edited my answer based on your comment. I know it's still long to process, but it is faster than your implementation. The real key would be to improve the line with the lapply. If you find a way to vectorize this without having to loop, it would be much more efficient. I'm a little busy today, but I'll look over it this weekend if no one give a better answer then. – Christan 6/12, 2019 at 14:19

Try this:

  dt2 <- dt[dt[,rep(1:nrow(dt),V3)],]
  dt2[,year:= dt[,sequence(V3)]]

Khz answered 30/11, 2020 at 18:6 Comment(1)

Thank for your answer. The tests I've made showed that it is not better than jangorecki's answer . – Hord 1/12, 2020 at 8:33

Recommended topics

Hot tags