As simple as it is, I don't know a data.table
solution to select the first n rows in groups in a data table. Can you please help me out?
As an alternative:
dt[, .SD[1:3], cyl]
When you look at speed on the example dataset, the head
method is on par with the .I
method of @eddi. Comparing with the microbenchmark
package:
microbenchmark(head = dt[, head(.SD, 3), cyl],
SD = dt[, .SD[1:3], cyl],
I = dt[dt[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")
results in:
Unit: relative
expr min lq mean median uq max neval cld
head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 10 a
SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401 10 b
I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973 10 a
However, data.table
is specifically designed for large datasets. So, running this comparison again:
# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
SD = largeDT[, .SD[1:3], cyl],
I = largeDT[largeDT[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")
results in:
Unit: relative
expr min lq mean median uq max neval cld
head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876 10 b
SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a
Now the .I
method is clearly the fastest one.
Update 2016-02-12:
With the most recent development version of the data.table package, the .I
method still wins. Whether the .SD
method or the head()
method is faster seems to depend on the size of the dataset. Now the benchmark gives:
Unit: relative
expr min lq mean median uq max neval cld
head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213 10 b
SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
However with a somewhat smaller dataset (but still quite big), the odds change:
largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]
the benchmark is now slightly in favor of the head
method over the .SD
method:
Unit: relative
expr min lq mean median uq max neval cld
head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812 10 b
SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
.SD
have recently been optimized... –
Casing rowid
). So, it depends on how recent recently is ;-) –
Nutritious cyl
–
Somnifacient .SD
faster than head
faster than .I
. @Nutritious do you consider updating your answer ? :) –
Penitent data.table
is using (this benchmark is run with 1 thread). Its on my todo-list to enable openmp
on my machine so I can test with more threads. –
Nutritious .I
wins for me, but otherwise, .SD and head are faster. –
Penitent We can use head
with .SD
library(data.table)
dt <- data.table(mtcars)
> dt[, head(.SD, 3), by = "cyl"]
cyl mpg disp hp drat wt qsec vs am gear carb
1: 6 21.0 160.0 110 3.90 2.620 16.46 0 1 4 4
2: 6 21.0 160.0 110 3.90 2.875 17.02 0 1 4 4
3: 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
4: 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1
5: 4 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2
6: 4 22.8 140.8 95 3.92 3.150 22.90 1 0 4 2
7: 8 18.7 360.0 175 3.15 3.440 17.02 0 0 3 2
8: 8 14.3 360.0 245 3.21 3.570 15.84 0 0 3 4
9: 8 16.4 275.8 180 3.07 4.070 17.40 0 0 3 3
head.data.table
(getAnywhere("head.data.table")
) is just calling a (slightly more robust) version of @Jaap's answer. –
Casing © 2022 - 2024 — McMap. All rights reserved.
dt[rowid(cyl)<3]
that is up to almost 100x faster. – Mowry