data.table performance using .SD with by

N

2

5

I want to filter a large data.table by group. I can use .SD or .I and while I personally think the former is much easier to read, the latter is tremendously faster / uses much less memory (despite using .SDcols).

To some extent it is clear to me why. For .I we just need a vector per group, while for .SD we need a whole data.table. But I thought that by providing a meaningful .SDcol argument I could speed up / save some memory.

However, the benchmarks show that the .SD approach is about 60 times slower and eats up 300 times more memory. Granted, a 4 column .SD data.table will need more than 4 times the size of a vector. But 60 times slower and 300 times more memory? Could somebody enlighten me, why the .SD approach eats up so much memory and is thus so much slower? Is there a way how I could speed up the .SD approach to be faster or is the only option to fall back to the .I approach?

Data Setup

library(data.table)
## data set up

nr <- 1e6
nc <- 100
grp_perc <- .8
DT <- data.table(ids = sample(paste0("id", 
                                     seq(1, round(grp_perc * nr, 0))),
                              nr, TRUE))
cols <- paste("col", seq(1, nc), sep = "_")
DT[, (cols) := replicate(nc, sample(nr), simplify = FALSE)]

Benchmarks

results <- bench::mark(.I = DT[DT[, .(row_id = .I[which.min(col_1)]), 
                                  by = ids]$row_id, c("ids", cols[1:3]), with = FALSE],
                       .SD = DT[, .SD[which.min(col_1)], 
                                by = ids, .SDcols = cols[1:3]], 
                       iterations = 1, filter_gc = FALSE)

summary(results)

# A tibble: 2 x 13
  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result         memory         time   gc       
  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>         <list>         <list> <list>   
1 .I           2.64s   2.64s   0.378      34.4MB    0         1     0      2.64s <df[,4] [571,~ <df[,3] [1,41~ <bch:~ <tibble ~
2 .SD          2.73m   2.73m   0.00612     9.1GB    0.342     1    56      2.73m <df[,4] [571,~ <df[,3] [2,40~ <bch:~ <tibble ~

Nitza answered 24/4, 2020 at 15:55 Comment(2)

My understanding is that the .SD approach is slower because it results in multiple method dispatches of [.data.table. See this Github issue and this SO answer. – Holmes 24/4, 2020 at 16:5

Very insightful. Thanks, this explains the issue very well and I think for similar issues I will look mor on the verbose mode to learn more about the internals of data.table. – Nitza 27/4, 2020 at 7:40

B

3

Here's an approach that is faster than the .I for this particular example. Note that this also changes the order which may not be desirable for you.

DT[order(col_1), .SD[1L], by = ids, .SDcols = cols[1:3]]

As @Ian Campbell mentions, this is a Github issue. The good news is that there are some optimizations, one of which being .SD[1L]. The optimization is that the subsetting is done all in C which makes it very fast.

Here are the benchmarks which includes @sindri_baldur's solution but removes your original .SD attempt - I didn't want to wait 3 minutes :).

# A tibble: 3 x 13
  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 .I          4.54s  4.54s    0.220       30MB    0.880     1     4      4.54s
2 self_join  11.32s 11.32s    0.0883    76.3MB    0         1     0     11.32s
3 use_order   3.55s  3.55s    0.282     58.3MB    0         1     0      3.55s

## show that it's equal but re-ordered:
all.equal(DT[DT[, .(row_id = .I[which.min(col_1)]), 
                by = ids]$row_id, c("ids", cols[1:3]), with = FALSE][order(col_1)],
          DT[order(col_1), .SD[1L], by = ids, .SDcols = cols[1:3]])

## [1] TRUE

Boggess answered 25/4, 2020 at 10:44 Comment(2)

Beautiful! However, you need quite some deep understanding of the internals of data.table to understand that .SD[which.min(col_1)] is so much slower than .SD[1L], but the link posted by @[Ian Campbell] makes that quite clear. – Nitza 27/4, 2020 at 7:32

And I think your concern about the order is unnecessary, since I am anyways aggregating the data - so the original order is meaningless on the aggregated table. – Nitza 27/4, 2020 at 7:39

J

3

Here is a faster way that still uses .SD.

DT[DT[, .(col_1 = min(col_1)), by = ids], 
   on = .(ids, col_1), 
   .SD, .SDcols = c("ids", cols[1:3])]

Joris answered 24/4, 2020 at 16:15 Comment(2)

Not an expert but experience tells me its best to keep things simple and rely on optimised functions when using by. – Joris 24/4, 2020 at 16:16

Thanks +1. I am far to little accustomed to using (self) joins and should definitely use them more often. – Nitza 27/4, 2020 at 7:34

B

3