How to extract the first n rows per group?
Asked Answered
V

4

41

I have a data.table dt. This data.table is sorted first by column date (my grouping variable), then by column age:

library(data.table)
setkeyv(dt, c("date", "age")) # Sorts table first by column "date" then by "age"
> dt
         date age     name
1: 2000-01-01   3   Andrew
2: 2000-01-01   4      Ben
3: 2000-01-01   5  Charlie
4: 2000-01-02   6     Adam
5: 2000-01-02   7      Bob
6: 2000-01-02   8 Campbell

My question is: I am wondering if it's possible to extract the first 2 rows for each unique date? Or phrased more generally:

How to extract the first n rows within each group?

In this example, the result in dt.f would be:

> dt.f = ???????? # function of dt to extract the first 2 rows per unique date
> dt.f
         date age   name
1: 2000-01-01   3 Andrew
2: 2000-01-01   4    Ben
3: 2000-01-02   6   Adam
4: 2000-01-02   7    Bob

p.s. Here is the code to create the aforementioned data.table:

install.packages("data.table")
library(data.table)
date <- c("2000-01-01","2000-01-01","2000-01-01",
    "2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
dt <- data.table(date, age, name)
setkeyv(dt,c("date","age")) # Sorts table first by column "date" then by "age"
Virtuosity answered 1/5, 2013 at 20:8 Comment(0)
R
62

yep, just use .SD and index it as needed.

  DT[, .SD[1:2], by=date]

           date age   name
  1: 2000-01-01   3 Andrew
  2: 2000-01-01   4    Ben
  3: 2000-01-02   6   Adam
  4: 2000-01-02   7    Bob

Edited as per @eddi's suggestion.

@eddi's suggestion is spot on:

Use this instead, for speed:

  DT[DT[, .I[1:2], by = date]$V1]

  # using a slightly larger data set
  > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)
  Unit: milliseconds
      expr       min        lq    median        uq      max neval
   SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719   200
    IStyle  1.675185  2.018773  2.168818  2.269292 11.31072   200
Razzia answered 1/5, 2013 at 20:28 Comment(6)
This is the right answer for readability, but if speed is a concern dt[dt[, .I[1:2], by = date]$V1] is much fasterMisreckon
@Ricardo Saporta This is the perfect answer, it works just beautifully. Thanks for introducing me to microbenchmark(), that's a nice way to checking out the most efficient code.Virtuosity
@Misreckon Thank you for adding this result, its pretty amazing that it speeds it up by an order of magnitude.Virtuosity
@Gravitas the speed-up is highly dependent on how many dates you have in your dt the more you have the larger the speed difference will be (at 500 dates I get a 100x speed-up on my pc)Misreckon
@Gravitas, no sweat! Microbenchmark is pretty great. I learned about it myself on SO.Razzia
Even from readability perspective, the answer from eddi is neat.Infirmity
B
6

Using rowid:

dt[rowid(date) < 3]
#          date age   name
# 1: 2000-01-01   3 Andrew
# 2: 2000-01-01   4    Ben
# 3: 2000-01-02   6   Adam
# 4: 2000-01-02   7    Bob

On larger data, rowid is faster than the by + .SD and by + .I alternatives

DT = data.table(
                date = rep(1:1e5, each = 10),
                age = runif(1e6),
                name = sample(letters, 1e6, replace = TRUE))

system.time({r_I = DT[, .SD[1:2], by=date]})
#    user  system elapsed 
#   14.54    0.56   15.04 

system.time({r_SD = DT[DT[, .I[1:2], by = date]$V1]})
#    user  system elapsed 
#    0.15    0.00    0.15 

system.time({r_rowid = DT[rowid(date) < 3]})
#    user  system elapsed 
#    0.01    0.00    0.02 

all.equal(r_I, r_SD)
# [1] TRUE

all.equal(r_I, r_rowid)
# [1] TRUE
Bertine answered 10/8, 2022 at 21:17 Comment(0)
P
2

Probably not the fastest method, but it provides some flexibility if you don't use keyed variables and need some more flexibility. By changing the selected Row.ID the number of first objects can be adjusted as needed.

dt[, .( age
        , name
        , Row.ID = rank(age)
        )
   , by = list(date)][Row.ID %in% (1:2), .(date
                                           , age
                                           , name
                                           )]
Polyclinic answered 21/6, 2018 at 7:58 Comment(0)
L
0

Here an extended solution. Goal is to remove first row (or others) from filtered data from a data.table in 1 line of code.

I try to explain here with example:

mpg2 <- data.table(mpg)


mpg2



     # manufacturer  model displ year cyl      trans drv cty hwy fl   class
  # 1:         audi     a4   1.8 1999   4   auto(l5)   f  18  29  p compact
  # 2:         audi     a4   1.8 1999   4 manual(m5)   f  21  29  p compact
  # 3:         audi     a4   2.0 2008   4 manual(m6)   f  20  31  p compact
  # 4:         audi     a4   2.0 2008   4   auto(av)   f  21  30  p compact
  # 5:         audi     a4   2.8 1999   6   auto(l5)   f  16  26  p compact
 # ---                                                                     
# 230:   volkswagen passat   2.0 2008   4   auto(s6)   f  19  28  p midsize
# 231:   volkswagen passat   2.0 2008   4 manual(m6)   f  21  29  p midsize
# 232:   volkswagen passat   2.8 1999   6   auto(l5)   f  16  26  p midsize
# 233:   volkswagen passat   2.8 1999   6 manual(m5)   f  18  26  p midsize
# 234:   volkswagen passat   3.6 2008   6   auto(s6)   f  17  26  p midsize


# I wanna first filter the table with year is 1999 and that it is an manual


mpg2[year == "1999" & grepl("manual", trans)]




    # manufacturer               model displ year cyl      trans drv cty hwy fl      class
 # 1:         audi                  a4   1.8 1999   4 manual(m5)   f  21  29  p    compact
 # 2:         audi                  a4   2.8 1999   6 manual(m5)   f  18  26  p    compact
 # 3:         audi          a4 quattro   1.8 1999   4 manual(m5)   4  18  26  p    compact
 # 4:         audi          a4 quattro   2.8 1999   6 manual(m5)   4  17  25  p    compact
 # 5:    chevrolet            corvette   5.7 1999   8 manual(m6)   r  16  26  p    2seater
 # 6:        dodge   dakota pickup 4wd   3.9 1999   6 manual(m5)   4  14  17  r     pickup
 # 7:        dodge   dakota pickup 4wd   5.2 1999   8 manual(m5)   4  11  17  r     pickup
 # 8:        dodge ram 1500 pickup 4wd   5.2 1999   8 manual(m5)   4  11  16  r     pickup
 # 9:         ford        explorer 4wd   4.0 1999   6 manual(m5)   4  15  19  r        suv
# 10:         ford     f150 pickup 4wd   4.2 1999   6 manual(m5)   4  14  17  r     pickup
# 11:         ford     f150 pickup 4wd   4.6 1999   8 manual(m5)   4  13  16  r     pickup
# 12:         ford             mustang   3.8 1999   6 manual(m5)   r  18  26  r subcompact
# 13:         ford             mustang   4.6 1999   8 manual(m5)   r  15  22  r subcompact
# 14:        honda               civic   1.6 1999   4 manual(m5)   f  28  33  r subcompact
# 15:        honda               civic   1.6 1999   4 manual(m5)   f  25  32  r subcompact
# 16:        honda               civic   1.6 1999   4 manual(m5)   f  23  29  p subcompact
# 17:      hyundai              sonata   2.4 1999   4 manual(m5)   f  18  27  r    midsize
# 18:      hyundai              sonata   2.5 1999   6 manual(m5)   f  18  26  r    midsize
# 19:      hyundai             tiburon   2.0 1999   4 manual(m5)   f  19  29  r subcompact
# 20:       nissan              altima   2.4 1999   4 manual(m5)   f  21  29  r    compact
# 21:       nissan              maxima   3.0 1999   6 manual(m5)   f  19  25  r    midsize
# 22:       nissan      pathfinder 4wd   3.3 1999   6 manual(m5)   4  15  17  r        suv
# 23:       subaru        forester awd   2.5 1999   4 manual(m5)   4  18  25  r        suv
# 24:       subaru         impreza awd   2.2 1999   4 manual(m5)   4  19  26  r subcompact
# 25:       subaru         impreza awd   2.5 1999   4 manual(m5)   4  19  26  r subcompact
# 26:       toyota         4runner 4wd   2.7 1999   4 manual(m5)   4  15  20  r        suv
# 27:       toyota         4runner 4wd   3.4 1999   6 manual(m5)   4  15  17  r        suv
# 28:       toyota               camry   2.2 1999   4 manual(m5)   f  21  29  r    midsize
# 29:       toyota               camry   3.0 1999   6 manual(m5)   f  18  26  r    midsize
# 30:       toyota        camry solara   2.2 1999   4 manual(m5)   f  21  29  r    compact
# 31:       toyota        camry solara   3.0 1999   6 manual(m5)   f  18  26  r    compact
# 32:       toyota             corolla   1.8 1999   4 manual(m5)   f  26  35  r    compact
# 33:       toyota   toyota tacoma 4wd   2.7 1999   4 manual(m5)   4  15  20  r     pickup
# 34:       toyota   toyota tacoma 4wd   3.4 1999   6 manual(m5)   4  15  17  r     pickup
# 35:   volkswagen                 gti   2.0 1999   4 manual(m5)   f  21  29  r    compact
# 36:   volkswagen                 gti   2.8 1999   6 manual(m5)   f  17  24  r    compact
# 37:   volkswagen               jetta   1.9 1999   4 manual(m5)   f  33  44  d    compact
# 38:   volkswagen               jetta   2.0 1999   4 manual(m5)   f  21  29  r    compact
# 39:   volkswagen               jetta   2.8 1999   6 manual(m5)   f  17  24  r    compact
# 40:   volkswagen          new beetle   1.9 1999   4 manual(m5)   f  35  44  d subcompact
# 41:   volkswagen          new beetle   2.0 1999   4 manual(m5)   f  21  29  r subcompact
# 42:   volkswagen              passat   1.8 1999   4 manual(m5)   f  21  29  p    midsize
# 43:   volkswagen              passat   2.8 1999   6 manual(m5)   f  18  26  p    midsize
    # manufacturer               model displ year cyl      trans drv cty hwy fl      class


# maybe I wanna order it


mpg2[year == "1999" & grepl("manual", trans)][order(model, -displ)]




    # manufacturer               model displ year cyl      trans drv cty hwy fl      class
 # 1:       toyota         4runner 4wd   3.4 1999   6 manual(m5)   4  15  17  r        suv
 # 2:       toyota         4runner 4wd   2.7 1999   4 manual(m5)   4  15  20  r        suv
 # 3:         audi                  a4   2.8 1999   6 manual(m5)   f  18  26  p    compact
 # 4:         audi                  a4   1.8 1999   4 manual(m5)   f  21  29  p    compact
 # 5:         audi          a4 quattro   2.8 1999   6 manual(m5)   4  17  25  p    compact
 # 6:         audi          a4 quattro   1.8 1999   4 manual(m5)   4  18  26  p    compact
 # 7:       nissan              altima   2.4 1999   4 manual(m5)   f  21  29  r    compact
 # 8:       toyota               camry   3.0 1999   6 manual(m5)   f  18  26  r    midsize
 # 9:       toyota               camry   2.2 1999   4 manual(m5)   f  21  29  r    midsize
# 10:       toyota        camry solara   3.0 1999   6 manual(m5)   f  18  26  r    compact
# 11:       toyota        camry solara   2.2 1999   4 manual(m5)   f  21  29  r    compact
# 12:        honda               civic   1.6 1999   4 manual(m5)   f  28  33  r subcompact
# 13:        honda               civic   1.6 1999   4 manual(m5)   f  25  32  r subcompact
# 14:        honda               civic   1.6 1999   4 manual(m5)   f  23  29  p subcompact
# 15:       toyota             corolla   1.8 1999   4 manual(m5)   f  26  35  r    compact
# 16:    chevrolet            corvette   5.7 1999   8 manual(m6)   r  16  26  p    2seater
# 17:        dodge   dakota pickup 4wd   5.2 1999   8 manual(m5)   4  11  17  r     pickup
# 18:        dodge   dakota pickup 4wd   3.9 1999   6 manual(m5)   4  14  17  r     pickup
# 19:         ford        explorer 4wd   4.0 1999   6 manual(m5)   4  15  19  r        suv
# 20:         ford     f150 pickup 4wd   4.6 1999   8 manual(m5)   4  13  16  r     pickup
# 21:         ford     f150 pickup 4wd   4.2 1999   6 manual(m5)   4  14  17  r     pickup
# 22:       subaru        forester awd   2.5 1999   4 manual(m5)   4  18  25  r        suv
# 23:   volkswagen                 gti   2.8 1999   6 manual(m5)   f  17  24  r    compact
# 24:   volkswagen                 gti   2.0 1999   4 manual(m5)   f  21  29  r    compact
# 25:       subaru         impreza awd   2.5 1999   4 manual(m5)   4  19  26  r subcompact
# 26:       subaru         impreza awd   2.2 1999   4 manual(m5)   4  19  26  r subcompact
# 27:   volkswagen               jetta   2.8 1999   6 manual(m5)   f  17  24  r    compact
# 28:   volkswagen               jetta   2.0 1999   4 manual(m5)   f  21  29  r    compact
# 29:   volkswagen               jetta   1.9 1999   4 manual(m5)   f  33  44  d    compact
# 30:       nissan              maxima   3.0 1999   6 manual(m5)   f  19  25  r    midsize
# 31:         ford             mustang   4.6 1999   8 manual(m5)   r  15  22  r subcompact
# 32:         ford             mustang   3.8 1999   6 manual(m5)   r  18  26  r subcompact
# 33:   volkswagen          new beetle   2.0 1999   4 manual(m5)   f  21  29  r subcompact
# 34:   volkswagen          new beetle   1.9 1999   4 manual(m5)   f  35  44  d subcompact
# 35:   volkswagen              passat   2.8 1999   6 manual(m5)   f  18  26  p    midsize
# 36:   volkswagen              passat   1.8 1999   4 manual(m5)   f  21  29  p    midsize
# 37:       nissan      pathfinder 4wd   3.3 1999   6 manual(m5)   4  15  17  r        suv
# 38:        dodge ram 1500 pickup 4wd   5.2 1999   8 manual(m5)   4  11  16  r     pickup
# 39:      hyundai              sonata   2.5 1999   6 manual(m5)   f  18  26  r    midsize
# 40:      hyundai              sonata   2.4 1999   4 manual(m5)   f  18  27  r    midsize
# 41:      hyundai             tiburon   2.0 1999   4 manual(m5)   f  19  29  r subcompact
# 42:       toyota   toyota tacoma 4wd   3.4 1999   6 manual(m5)   4  15  17  r     pickup
# 43:       toyota   toyota tacoma 4wd   2.7 1999   4 manual(m5)   4  15  20  r     pickup
    # manufacturer               model displ year cyl      trans drv cty hwy fl      class


# My wish would be to extract the model from 1999 with the highest displ (I could use max, but I will extract the first row)


mpg2[year == "1999" & grepl("manual", trans)][order(model, -displ), .SD[1], model]




                  # model manufacturer displ year cyl      trans drv cty hwy fl      class
 # 1:         4runner 4wd       toyota   3.4 1999   6 manual(m5)   4  15  17  r        suv
 # 2:                  a4         audi   2.8 1999   6 manual(m5)   f  18  26  p    compact
 # 3:          a4 quattro         audi   2.8 1999   6 manual(m5)   4  17  25  p    compact
 # 4:              altima       nissan   2.4 1999   4 manual(m5)   f  21  29  r    compact
 # 5:               camry       toyota   3.0 1999   6 manual(m5)   f  18  26  r    midsize
 # 6:        camry solara       toyota   3.0 1999   6 manual(m5)   f  18  26  r    compact
 # 7:               civic        honda   1.6 1999   4 manual(m5)   f  28  33  r subcompact
 # 8:             corolla       toyota   1.8 1999   4 manual(m5)   f  26  35  r    compact
 # 9:            corvette    chevrolet   5.7 1999   8 manual(m6)   r  16  26  p    2seater
# 10:   dakota pickup 4wd        dodge   5.2 1999   8 manual(m5)   4  11  17  r     pickup
# 11:        explorer 4wd         ford   4.0 1999   6 manual(m5)   4  15  19  r        suv
# 12:     f150 pickup 4wd         ford   4.6 1999   8 manual(m5)   4  13  16  r     pickup
# 13:        forester awd       subaru   2.5 1999   4 manual(m5)   4  18  25  r        suv
# 14:                 gti   volkswagen   2.8 1999   6 manual(m5)   f  17  24  r    compact
# 15:         impreza awd       subaru   2.5 1999   4 manual(m5)   4  19  26  r subcompact
# 16:               jetta   volkswagen   2.8 1999   6 manual(m5)   f  17  24  r    compact
# 17:              maxima       nissan   3.0 1999   6 manual(m5)   f  19  25  r    midsize
# 18:             mustang         ford   4.6 1999   8 manual(m5)   r  15  22  r subcompact
# 19:          new beetle   volkswagen   2.0 1999   4 manual(m5)   f  21  29  r subcompact
# 20:              passat   volkswagen   2.8 1999   6 manual(m5)   f  18  26  p    midsize
# 21:      pathfinder 4wd       nissan   3.3 1999   6 manual(m5)   4  15  17  r        suv
# 22: ram 1500 pickup 4wd        dodge   5.2 1999   8 manual(m5)   4  11  16  r     pickup
# 23:              sonata      hyundai   2.5 1999   6 manual(m5)   f  18  26  r    midsize
# 24:             tiburon      hyundai   2.0 1999   4 manual(m5)   f  19  29  r subcompact
# 25:   toyota tacoma 4wd       toyota   3.4 1999   6 manual(m5)   4  15  17  r     pickup
                  # model manufacturer displ year cyl      trans drv cty hwy fl      class


# my goal now is actually from the orginal table to remove those rows, we will you then .I which will return the frow ID from those ones. However it should be a bit differently written. All of this in 1 line of code:
# mpg2[year == "1999" & grepl("manual", trans)] is same as mpg2[mpg2[, .I[year == "1999" & grepl("manual", trans)]]]
# mpg2[year == "1999" & grepl("manual", trans)] is same as mpg2[mpg2[, .I[year == "1999" & grepl("manual", trans)], model]$V1]

# goal is to first get row ID (V1) of filtered data and keep our important variable for future steps


mpg2[, .I[year == "1999" & grepl("manual", trans)], .(model, displ)]



                  # model displ  V1
 # 1:                  a4   1.8   2
 # 2:                  a4   2.8   6
 # 3:          a4 quattro   1.8   8
 # 4:          a4 quattro   2.8  13
 # 5:            corvette   5.7  24
 # 6:   dakota pickup 4wd   3.9  52
 # 7:   dakota pickup 4wd   5.2  56
 # 8: ram 1500 pickup 4wd   5.2  72
 # 9:        explorer 4wd   4.0  79
# 10:     f150 pickup 4wd   4.2  85
# 11:     f150 pickup 4wd   4.6  86
# 12:             mustang   3.8  91
# 13:             mustang   4.6  96
# 14:               civic   1.6 100
# 15:               civic   1.6 102
# 16:               civic   1.6 103
# 17:              sonata   2.4 110
# 18:              sonata   2.5 114
# 19:             tiburon   2.0 117
# 20:              altima   2.4 142
# 21:              maxima   3.0 149
# 22:      pathfinder 4wd   3.3 152
# 23:        forester awd   2.5 160
# 24:         impreza awd   2.2 167
# 25:         impreza awd   2.5 168
# 26:         4runner 4wd   2.7 174
# 27:         4runner 4wd   3.4 177
# 28:               camry   2.2 180
# 29:               camry   3.0 185
# 30:        camry solara   2.2 188
# 31:        camry solara   3.0 192
# 32:             corolla   1.8 196
# 33:   toyota tacoma 4wd   2.7 201
# 34:   toyota tacoma 4wd   3.4 204
# 35:                 gti   2.0 208
# 36:                 gti   2.8 212
# 37:               jetta   1.9 213
# 38:               jetta   2.0 214
# 39:               jetta   2.8 221
# 40:          new beetle   1.9 222
# 41:          new beetle   2.0 224
# 42:              passat   1.8 228
# 43:              passat   2.8 233
                  # model displ  V1


# then order and get the first row


mpg2[, .I[year == "1999" & grepl("manual", trans)], .(model, displ)][order(model, -displ), .SD[1], model]


                  # model displ  V1
 # 1:         4runner 4wd   3.4 177
 # 2:                  a4   2.8   6
 # 3:          a4 quattro   2.8  13
 # 4:              altima   2.4 142
 # 5:               camry   3.0 185
 # 6:        camry solara   3.0 192
 # 7:               civic   1.6 100
 # 8:             corolla   1.8 196
 # 9:            corvette   5.7  24
# 10:   dakota pickup 4wd   5.2  56
# 11:        explorer 4wd   4.0  79
# 12:     f150 pickup 4wd   4.6  86
# 13:        forester awd   2.5 160
# 14:                 gti   2.8 212
# 15:         impreza awd   2.5 168
# 16:               jetta   2.8 221
# 17:              maxima   3.0 149
# 18:             mustang   4.6  96
# 19:          new beetle   2.0 224
# 20:              passat   2.8 233
# 21:      pathfinder 4wd   3.3 152
# 22: ram 1500 pickup 4wd   5.2  72
# 23:              sonata   2.5 114
# 24:             tiburon   2.0 117
# 25:   toyota tacoma 4wd   3.4 204
                  # model displ  V1


# V1 is the row ID from original table, we have then just to remove it from it 


mpg2[!mpg2[, .I[year == "1999" & grepl("manual", trans)], .(model, displ)][order(model, -displ), .SD[1], model]$V1]


     # manufacturer  model displ year cyl      trans drv cty hwy fl   class
  # 1:         audi     a4   1.8 1999   4   auto(l5)   f  18  29  p compact
  # 2:         audi     a4   1.8 1999   4 manual(m5)   f  21  29  p compact
  # 3:         audi     a4   2.0 2008   4 manual(m6)   f  20  31  p compact
  # 4:         audi     a4   2.0 2008   4   auto(av)   f  21  30  p compact
  # 5:         audi     a4   2.8 1999   6   auto(l5)   f  16  26  p compact
 # ---                                                                     
# 205:   volkswagen passat   1.8 1999   4   auto(l5)   f  18  29  p midsize
# 206:   volkswagen passat   2.0 2008   4   auto(s6)   f  19  28  p midsize
# 207:   volkswagen passat   2.0 2008   4 manual(m6)   f  21  29  p midsize
# 208:   volkswagen passat   2.8 1999   6   auto(l5)   f  16  26  p midsize
# 209:   volkswagen passat   3.6 2008   6   auto(s6)   f  17  26  p midsize
Liquefacient answered 20/1, 2021 at 15:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.