data.table efficient recycling
Asked Answered
B

3

2

I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year.

This can lead to something like that :

library(data.table)
dt <- data.table(cbind(1:500000, 500000:1))
dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]

But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this :

library(data.table)
dt <- data.table(cbind(1:50000000, 50000000:1))
dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]

My question is : is there a more efficient to achieve this purpose ?

Thanks for any help !

EDIT : the accepted answer was the most complete (till now), for this formulation of the problem, but I realised that my issue is a little bit more tricky. I will ask another question in order to show it : data.table efficient recycling V2

Belldame answered 3/12, 2019 at 14:51 Comment(3)
Have you tried with rep on sequence of rowsNeed
Could it by XY problem. What is your final application?Submission
Thank you Akrun, but I can't see how to take advantage of your advice ..Belldame
E
0

I'm benchmarking the solutions given so far against my own (which simply uses lapply and rbindlist). I couldn't run the entire task because I run out of memory. That's why I choose a smaller dt:

library(data.table)

dt <- data.table(cbind(1:5000000, 5000000:1))

original <- function() {
  dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]
  dt2
}

sb <- function() {
  dt2 <- dt[CJ(V1, year = 1:10), on = "V1"]
}

gregor <- function() {
  CJDT <- function(...) {
    Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
  }
  years = data.table(year = 1:10, key = "year")
  setkey(dt)
  dt3 = CJDT(dt, years)
  dt3
}

bindlist <- function() {
  dt3 <- rbindlist(lapply(1:10, function(x) {
    dt$year <- x
    dt
  }))
  # dt3 <- setcolorder(dt3, c("nrow", "V1", "V2", "year")) # to get exactly same dt
  # dt3 <- dt3[order(nrow)]
  dt3
}

Benchmark

library(bench)
res <- mark(
  original = original(),
  sb = sb(),
  gregor = gregor(),
  bindlist = bindlist(),
  iterations = 1,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
res
#> # A tibble: 4 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 original      5.88s    5.88s     0.170    1.72GB   16.0  
#> 2 sb            1.76s    1.76s     0.570    1.73GB    0.570
#> 3 gregor        1.87s    1.87s     0.536  972.86MB    0    
#> 4 bindlist   558.69ms 558.69ms     1.79     1.12GB    0

summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 original   10.5   10.5       1         1.81      Inf
#> 2 sb          3.14   3.14      3.35      1.82      Inf
#> 3 gregor      3.34   3.34      3.15      1         NaN
#> 4 bindlist    1      1        10.5       1.18      NaN

Created on 2019-12-03 by the reprex package (v0.3.0)

Now the results are not exactly the same (see commented code in my solution for correcting it) but equivalent to what you are trying to do. My lapply plus rbindlist solution is suprisingly the fastet by a factor of more than 3. This might change on the full task but I doubt it.

Emergence answered 3/12, 2019 at 15:47 Comment(0)
M
2

I would think of this problem as a cross join. There is no built-in way to do cross joins between two data tables (the CJ function works on vectors), but from the discussion on this issue, this function works well:

CJDT <- function(...) {
    Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
}

Using your large example, this works for me:

years = data.table(year = 1:10, key = "year")
setkey(dt)
dt3 = CJDT(dt, years)

Where your method takes longer before running out of memory.

Mcconnell answered 3/12, 2019 at 15:7 Comment(0)
S
2

As mentioned in comments I have a suspicion that the premise of the question might be questionable. In any case, here is a slightly faster alternative:

setkey(dt)
dt[CJ(V1, year = 1:10)]

Benchmarking:

dt <- data.table(cbind(1:50000000, 50000000:1))
microbenchmark::microbenchmark(
  op = dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ],
  sb = {setkey(dt); dt[CJ(V1, year = 1:10)]},
  gr = {setkey(dt); CJDT(dt, data.table(year = 1:10, key = "year"))},
  times = 1
)

Unit: seconds

 expr       min        lq      mean    median        uq       max neval
   op 171.67241 171.67241 171.67241 171.67241 171.67241 171.67241     1
   sb 136.00782 136.00782 136.00782 136.00782 136.00782 136.00782     1
   gr  45.14151  45.14151  45.14151  45.14151  45.14151  45.14151     1
Submission answered 3/12, 2019 at 15:9 Comment(8)
Thanks for benchmarking. Especially for big problems like this, I've started to prefer bench::mark to microbenchmark as it also keeps track of memory usage. If you look at the Github link in my answer, there are other proposals that are a little faster on large data, but use ~50% more memory.Mcconnell
You could also setkey before the benchmark---probably even OP's solution would benefit from it.Mcconnell
"the premise of the question might be questionable" ... what do you mean ?Belldame
@Belldame is it really necessary to this enormous data.table for your final output?Submission
For me, yes it is, of course. But anyway, your answer is interesting and I will study it.Belldame
@Gregor thanks for feedback. I tried to run again but bench::mark() ran out of memory.Submission
@Gregor bench does not track memory usage properly, it is better to not look at that measure at all as it is misleading.Ibeam
@Ibeam thanks---good to know. Do you have a reference where I could read more about that?Mcconnell
E
0

I'm benchmarking the solutions given so far against my own (which simply uses lapply and rbindlist). I couldn't run the entire task because I run out of memory. That's why I choose a smaller dt:

library(data.table)

dt <- data.table(cbind(1:5000000, 5000000:1))

original <- function() {
  dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ]
  dt2
}

sb <- function() {
  dt2 <- dt[CJ(V1, year = 1:10), on = "V1"]
}

gregor <- function() {
  CJDT <- function(...) {
    Reduce(function(DT1, DT2) cbind(DT1, DT2[rep(1:.N, each=nrow(DT1))]), list(...))
  }
  years = data.table(year = 1:10, key = "year")
  setkey(dt)
  dt3 = CJDT(dt, years)
  dt3
}

bindlist <- function() {
  dt3 <- rbindlist(lapply(1:10, function(x) {
    dt$year <- x
    dt
  }))
  # dt3 <- setcolorder(dt3, c("nrow", "V1", "V2", "year")) # to get exactly same dt
  # dt3 <- dt3[order(nrow)]
  dt3
}

Benchmark

library(bench)
res <- mark(
  original = original(),
  sb = sb(),
  gregor = gregor(),
  bindlist = bindlist(),
  iterations = 1,
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
res
#> # A tibble: 4 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 original      5.88s    5.88s     0.170    1.72GB   16.0  
#> 2 sb            1.76s    1.76s     0.570    1.73GB    0.570
#> 3 gregor        1.87s    1.87s     0.536  972.86MB    0    
#> 4 bindlist   558.69ms 558.69ms     1.79     1.12GB    0

summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 original   10.5   10.5       1         1.81      Inf
#> 2 sb          3.14   3.14      3.35      1.82      Inf
#> 3 gregor      3.34   3.34      3.15      1         NaN
#> 4 bindlist    1      1        10.5       1.18      NaN

Created on 2019-12-03 by the reprex package (v0.3.0)

Now the results are not exactly the same (see commented code in my solution for correcting it) but equivalent to what you are trying to do. My lapply plus rbindlist solution is suprisingly the fastet by a factor of more than 3. This might change on the full task but I doubt it.

Emergence answered 3/12, 2019 at 15:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.