data.table efficient recycling V2
Asked Answered
H

3

2

This is a follow-up to this question : data.table efficient recycling

The difference here is that the number of future years for each line is not necessarily the same ..

I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year.

This can lead to something like that :

library(data.table)
dt <- data.table(1:500000, 500000:1, rpois(500000, 240))
dt2 <- dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]

But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this :

library(data.table)
dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
dt2 <- dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]

My question is : is there a more efficient way to achieve this purpose ?

Thanks for any help !

Hord answered 5/12, 2019 at 14:26 Comment(6)
How much ram do you have to work with? Already with 500000 I get dt2 is 2GbBuilding
I don't know exactly .. but I've a pretty good computerHord
I'm still stuck with this issue ..Hord
@Hord would be useful if you would provide feedback to existing answer, does it solve the problem but is not efficient enough?Oregon
@Oregon yes you're right ! I'll do it in a while (as a comment of the given answer)Hord
@Oregon but with you're answer no more need for me to comment the first one ..Hord
O
5

Here is slightly improved version of the other answer.

  • using non-default values to unlist
  • rep.int rather than rep
  • seq_len rather than :
  • setDT instead of data.table()
  • even better with sequence function suggested by @Cole
  • and further minor improvement with internal vecseq

Together it seems to make difference.

Timings...

library(data.table)
f0 = function(dt) {
  dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]
}
f1 = function(dt) {
  dt2 <- data.table(
    rep(dt$V1, dt$V3),
    rep(dt$V2, dt$V3),
    rep(dt$V3, dt$V3),
    unlist(lapply(dt$V3, function(x){1:x}))
  )
  dt2
}
f2 = function(dt) {
  dt2 = list(
    V1 = rep.int(dt$V1, dt$V3),
    V2 = rep.int(dt$V2, dt$V3),
    V3 = rep.int(dt$V3, dt$V3),
    year = unlist(lapply(dt$V3, seq_len), recursive=FALSE, use.names=FALSE)
  )
  setDT(dt2)
  dt2
}
f3 = function(dt) {
  ## even better with sequence function suggested by @Cole
  dt2 = list(
    V1 = rep.int(dt$V1, dt$V3),
    V2 = rep.int(dt$V2, dt$V3),
    V3 = rep.int(dt$V3, dt$V3),
    year = sequence(dt$V3)
  )
  setDT(dt2)
  dt2
}
f4 = function(dt) {
  dt[, c(lapply(.SD, rep.int, V3), year = .(sequence(V3)))]
}
f5 = function(dt) {
  dt2 = list(
    V1 = rep.int(dt$V1, dt$V3),
    V2 = rep.int(dt$V2, dt$V3),
    V3 = rep.int(dt$V3, dt$V3),
    year = data.table:::vecseq(rep.int(1L,length(dt$V3)), dt$V3, NULL)
  )
  setDT(dt2)
  dt2
}

On a "big" data

dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
system.time(f0(dt))
#   user  system elapsed 
# 22.100  18.914  40.449 
system.time(f1(dt))
#   user  system elapsed 
# 35.866  15.607  51.475 
system.time(f2(dt))
#   user  system elapsed 
# 22.922   6.839  29.760 
system.time(f3(dt))
#   user  system elapsed 
#  6.509   6.723  13.233 
system.time(f4(dt))
#   user  system elapsed 
# 12.140  14.114  26.254 
system.time(f5(dt))
#   user  system elapsed 
#  6.448   4.057  10.506 

Anyway, you should try to improve your processes that you are running on expanded dataset because maybe you don't have to expand that in the first place.

For example, in frollmean function there is an argument adaptive which makes it possible to calculate rolling mean on a variable length window, where normally to compute that one would need to expand data first. V3 in your data reminds a lot a length of a window for adaptive moving average.

Oregon answered 26/11, 2020 at 15:0 Comment(6)
Thanks for this answer .. about your comment on expanding : I need to expand, because I'm making predctions year by year ..Hord
I've tested your way on my real data .. very very good improvement (I've had to use rep instead of rep.int, because I've got many kinds of columns) even if, the more columns there is, the more longer it is to write the code (and it makes it horrible code to read ..), but I think I'ts possible to make a more general function from that .. I will let others a chance to beat your way, and if no one does better, I will reward your answer. Many thanks !Hord
Moving lapply to C/C++ will definitely help but will require to write compiled code.Oregon
Use sequence(V3). It is pretty much the unlist(lapply(...)) code compiled in C. dt[, c(lapply(.SD, rep.int, V3), year = .(sequence(V3)))]. I also agree with @Oregon to try to see if you can do what you need to do on the data instead of expanding it all in the first place.Ithnan
Great find stat.ethz.ch/R-manual/R-devel/library/base/html/sequence.htmlOregon
@Ithnan I realized we have internal function for that :) interesting that despite having to materialize rep(1L, n), and then loop over that vector when creating sequence, it is still faster than base R sequence.Oregon
C
1

This is a faster implementation, but still long due to the lapply loop in the data.table

dt2 <- data.table(
  rep(dt$V1, dt$V3),
  rep(dt$V2, dt$V3),
  rep(dt$V3, dt$V3),
  unlist(lapply(dt$V3, function(x){1:x}))
)

I hope this is of any help!

Christan answered 5/12, 2019 at 14:45 Comment(2)
Thanks for your answer, but the issue is that the V3 values are given (I use rpois only for having values in the table). So you have to assume that dt is given, with the three columns, and then find a way of duplicating eahc row the number of times given by V3 ..Hord
@Hord - I edited my answer based on your comment. I know it's still long to process, but it is faster than your implementation. The real key would be to improve the line with the lapply. If you find a way to vectorize this without having to loop, it would be much more efficient. I'm a little busy today, but I'll look over it this weekend if no one give a better answer then.Christan
K
0

Try this:

  dt2 <- dt[dt[,rep(1:nrow(dt),V3)],]
  dt2[,year:= dt[,sequence(V3)]] 
Khz answered 30/11, 2020 at 18:6 Comment(1)
Thank for your answer. The tests I've made showed that it is not better than jangorecki's answer .Hord

© 2022 - 2024 — McMap. All rights reserved.