Here is slightly improved version of the other answer.
- using non-default values to
unlist
rep.int
rather than rep
seq_len
rather than :
setDT
instead of data.table()
- even better with
sequence
function suggested by @Cole
- and further minor improvement with internal
vecseq
Together it seems to make difference.
Timings...
library(data.table)
f0 = function(dt) {
dt[, c(.SD, .(year = 1:V3)), by = 1:nrow(dt) ]
}
f1 = function(dt) {
dt2 <- data.table(
rep(dt$V1, dt$V3),
rep(dt$V2, dt$V3),
rep(dt$V3, dt$V3),
unlist(lapply(dt$V3, function(x){1:x}))
)
dt2
}
f2 = function(dt) {
dt2 = list(
V1 = rep.int(dt$V1, dt$V3),
V2 = rep.int(dt$V2, dt$V3),
V3 = rep.int(dt$V3, dt$V3),
year = unlist(lapply(dt$V3, seq_len), recursive=FALSE, use.names=FALSE)
)
setDT(dt2)
dt2
}
f3 = function(dt) {
## even better with sequence function suggested by @Cole
dt2 = list(
V1 = rep.int(dt$V1, dt$V3),
V2 = rep.int(dt$V2, dt$V3),
V3 = rep.int(dt$V3, dt$V3),
year = sequence(dt$V3)
)
setDT(dt2)
dt2
}
f4 = function(dt) {
dt[, c(lapply(.SD, rep.int, V3), year = .(sequence(V3)))]
}
f5 = function(dt) {
dt2 = list(
V1 = rep.int(dt$V1, dt$V3),
V2 = rep.int(dt$V2, dt$V3),
V3 = rep.int(dt$V3, dt$V3),
year = data.table:::vecseq(rep.int(1L,length(dt$V3)), dt$V3, NULL)
)
setDT(dt2)
dt2
}
On a "big" data
dt <- data.table(1:5000000, 5000000:1, rpois(5000000, 240))
system.time(f0(dt))
# user system elapsed
# 22.100 18.914 40.449
system.time(f1(dt))
# user system elapsed
# 35.866 15.607 51.475
system.time(f2(dt))
# user system elapsed
# 22.922 6.839 29.760
system.time(f3(dt))
# user system elapsed
# 6.509 6.723 13.233
system.time(f4(dt))
# user system elapsed
# 12.140 14.114 26.254
system.time(f5(dt))
# user system elapsed
# 6.448 4.057 10.506
Anyway, you should try to improve your processes that you are running on expanded dataset because maybe you don't have to expand that in the first place.
For example, in frollmean
function there is an argument adaptive
which makes it possible to calculate rolling mean on a variable length window, where normally to compute that one would need to expand data first.
V3
in your data reminds a lot a length of a window for adaptive moving average.
500000
I getdt2
is 2Gb – Building