data.table - select first n rows within group [duplicate]
Asked Answered
N

2

35

As simple as it is, I don't know a data.table solution to select the first n rows in groups in a data table. Can you please help me out?

Nival answered 12/1, 2016 at 20:20 Comment(3)
this is very close to: #10111116Casing
The question this closed question is a duplicate of includes an answer using dt[rowid(cyl)<3] that is up to almost 100x faster.Mowry
rowid is also a lot more readable.Cuthbertson
N
73

As an alternative:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

microbenchmark(head = dt[, head(.SD, 3), cyl],
               SD = dt[, .SD[1:3], cyl], 
               I = dt[dt[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

Unit: relative
 expr      min       lq     mean   median       uq       max neval cld
 head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    10  a 
   SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401    10   b
    I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973    10  a 

However, data.table is specifically designed for large datasets. So, running this comparison again:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
               SD = largeDT[, .SD[1:3], cyl], 
               I = largeDT[largeDT[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

Unit: relative
 expr      min       lq     mean   median       uq     max neval cld
 head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876    10   b
   SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000    10  a 

Now the .I method is clearly the fastest one.


Update 2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213    10   b
   SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

However with a somewhat smaller dataset (but still quite big), the odds change:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812    10   b
   SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 
Nutritious answered 12/1, 2016 at 20:33 Comment(8)
Was this done with 1.9.7? I know some operations with .SD have recently been optimized...Casing
@Casing Yes, but I'm using a relatively old version of 1.9.7 (shortly after the introduction of rowid). So, it depends on how recent recently is ;-)Nutritious
I think the relevant commits were around the beginning of NovemberCasing
performance may depends on cardinality of cylSomnifacient
I don't find the same results anymore, I find .SD faster than head faster than .I. @Nutritious do you consider updating your answer ? :)Penitent
@SamuelAllain I still get more or less the same result. However this might be due to the number of threads data.table is using (this benchmark is run with 1 thread). Its on my todo-list to enable openmp on my machine so I can test with more threads.Nutritious
@Nutritious thanks for testing again. I tested with setDTthreads(1) but still didn't find your results. I think @Somnifacient is right, the results highly depend on the the cardinality of the grouping variable. When cardinality is under 6, .I wins for me, but otherwise, .SD and head are faster.Penitent
I made a simulation with moving cardinality, It indeed affects the ranking of methods. I cannot post my graph on this topic, unfortunately.Penitent
N
19

We can use head with .SD

library(data.table)

dt <- data.table(mtcars)

> dt[, head(.SD, 3), by = "cyl"]

   cyl  mpg  disp  hp drat    wt  qsec vs am gear carb
1:   6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4
2:   6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4
3:   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
4:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1
5:   4 24.4 146.7  62 3.69 3.190 20.00  1  0    4    2
6:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2
7:   8 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2
8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4
9:   8 16.4 275.8 180 3.07 4.070 17.40  0  0    3    3
Nival answered 12/1, 2016 at 20:20 Comment(1)
note that actually head.data.table (getAnywhere("head.data.table")) is just calling a (slightly more robust) version of @Jaap's answer.Casing

© 2022 - 2024 — McMap. All rights reserved.