I want to filter a large data.table
by group. I can use .SD
or .I
and while I personally think the former is much easier to read, the latter is tremendously faster / uses much less memory (despite using .SDcols
).
To some extent it is clear to me why. For .I
we just need a vector per group, while for .SD
we need a whole data.table
. But I thought that by providing a meaningful .SDcol
argument I could speed up / save some memory.
However, the benchmarks show that the .SD
approach is about 60 times slower and eats up 300 times more memory. Granted, a 4 column .SD
data.table will need more than 4 times the size of a vector. But 60 times slower and 300 times more memory? Could somebody enlighten me, why the .SD
approach eats up so much memory and is thus so much slower? Is there a way how I could speed up the .SD
approach to be faster or is the only option to fall back to the .I
approach?
Data Setup
library(data.table)
## data set up
nr <- 1e6
nc <- 100
grp_perc <- .8
DT <- data.table(ids = sample(paste0("id",
seq(1, round(grp_perc * nr, 0))),
nr, TRUE))
cols <- paste("col", seq(1, nc), sep = "_")
DT[, (cols) := replicate(nc, sample(nr), simplify = FALSE)]
Benchmarks
results <- bench::mark(.I = DT[DT[, .(row_id = .I[which.min(col_1)]),
by = ids]$row_id, c("ids", cols[1:3]), with = FALSE],
.SD = DT[, .SD[which.min(col_1)],
by = ids, .SDcols = cols[1:3]],
iterations = 1, filter_gc = FALSE)
summary(results)
# A tibble: 2 x 13 expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list> 1 .I 2.64s 2.64s 0.378 34.4MB 0 1 0 2.64s <df[,4] [571,~ <df[,3] [1,41~ <bch:~ <tibble ~ 2 .SD 2.73m 2.73m 0.00612 9.1GB 0.342 1 56 2.73m <df[,4] [571,~ <df[,3] [2,40~ <bch:~ <tibble ~
[.data.table
. See this Github issue and this SO answer. – Holmesdata.table
. – Nitza