Extract row corresponding to minimum value of a variable by group
Asked Answered
I

8

118

I wish to (1) group data by one variable (State), (2) within each group find the row of minimum value of another variable (Employees), and (3) extract the entire row.

(1) and (2) are easy one-liners, and I feel like (3) should be too, but I can't get it.

Here is a sample data set:

> data
  State Company Employees
1    AK       A        82
2    AK       B       104
3    AK       C        37
4    AK       D        24
5    RI       E        19
6    RI       F       118
7    RI       G        88
8    RI       H        42

data <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
        2L), .Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A", 
        "B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L, 
        104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c("State", "Company", 
        "Employees"), class = "data.frame", row.names = c(NA, -8L))

Calculate min by group is easy, using aggregate:

> aggregate(Employees ~ State, data, function(x) min(x))
  State Employees
1    AK        24
2    RI        19

...or data.table:

> library(data.table)
> DT <- data.table(data)
> DT[ , list(Employees = min(Employees)), by = State]
   State Employees
1:    AK        24
2:    RI        19

But how do I extract the entire row corresponding to these min values, i.e. also including Company in the result?

Independency answered 5/6, 2014 at 21:45 Comment(0)
J
97

Slightly more elegant:

library(data.table)
DT[ , .SD[which.min(Employees)], by = State]

   State Company Employees
1:    AK       D        24
2:    RI       E        19

Slighly less elegant than using .SD, but a bit faster (for data with many groups):

DT[DT[ , .I[which.min(Employees)], by = State]$V1]

Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data set has multiple identical min values and you'd like to subset all of them.

See also Subset rows corresponding to max value by group using data.table.

Jolly answered 5/6, 2014 at 21:50 Comment(1)
How to get it if the number should return the minimum but based on the number of frequency that number has been used?Chlorothiazide
V
89

Here a dplyr solution ( Note that I am not a regular user ):

library(dplyr)    
data %>% 
    group_by(State) %>% 
    slice(which.min(Employees))

As jazzurro notes in the comments, as of dplyr version 1.0.0, there is also now a built-in function slice_min:

data %>% 
   group_by(State) %>% 
   slice_min(order_by = Employees)
Votaw answered 5/6, 2014 at 22:8 Comment(2)
With dplyr 1.0.0, you can achieve the same in the following way: group_by(data, State) %>% slice_min(order_by = Employees).Janycejanyte
dplyr 1.1.0 added the experimental argument by: data |> slice_min(Employees, by = State)Amorete
J
48

As this is Google's top hit, I thought I would add some additional options which I find useful to know. The idea is basically to arrange once by Employees and then just take the uniques per State

Either using data.table

library(data.table)
unique(setDT(data)[order(Employees)], by = "State")
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Alternatively, we could also first order and then subset .SD. Both of those operations were optimized in the resent data.table versions and order is seemingly triggers data.table:::forderv, while .SD[1L] triggers Gforce

setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from 'order(...)' to 'forder(DT, ...)'.
# i clause present and columns used in by detected, only these subset: State 
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from '.SD[1L]' to 'list(Company[1L], Employees[1L])'
# GForce optimized j to 'list(`g[`(Company, 1L), `g[`(Employees, 1L))'
# Making each group and running j (GForce TRUE) ... 0 secs
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Or dplyr

library(dplyr)
data %>% 
  arrange(Employees) %>% 
  distinct(State, .keep_all = TRUE)
#   State Company Employees
# 1    RI       E        19
# 2    AK       D        24

Another interesting idea borrowed from @Khashaas awesome answer (with a small modification in form of mult = "first" in order to handle multiple matches) is to first find minimum per group and then perform a binary join back. The advantage of this is both the utilization of data.tables gmin function (which skips the evaluation overhead) and the binary join feature

tmp <- setDT(data)[, .(Employees = min(Employees)), by = State]
data[tmp, on = .(State, Employees), mult = "first"]
#    State Company Employees
# 1:    AK       D        24
# 2:    RI       E        19

Some benchmarks

library(data.table)
library(dplyr)
library(plyr)
library(stringi)
library(microbenchmark)

set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
                   Employees = sample(N*10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
str(DT)
str(DT2)

microbenchmark("(data.table) .SD[which.min]: " = DT[ , .SD[which.min(Employees)], by = State],
               "(data.table) .I[which.min]: " = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
               "(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
               "(data.table) order/.SD[1L]: " = DT[order(Employees), .SD[1L], by = State],
               "(data.table) self join (on):" = {
                 tmp <- DT[, .(Employees = min(Employees)), by = State]
                 DT[tmp, on = .(State, Employees), mult = "first"]},
               "(data.table) self join (setkey):" = {
                 tmp <- DT2[, .(Employees = min(Employees)), by = State] 
                 setkey(tmp, State, Employees)
                 setkey(DT2, State, Employees)
                 DT2[tmp, mult = "first"]},
               "(dplyr) slice(which.min): " = data %>% group_by(State) %>% slice(which.min(Employees)),
               "(dplyr) arrange/distinct: " = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
               "(dplyr) arrange/group_by/slice: " = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
               "(plyr) ddply/which.min: " = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
               "(base) by: " = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))


# Unit: milliseconds
#                             expr        min         lq       mean     median         uq       max neval      cld
#    (data.table) .SD[which.min]:   119.66086  125.49202  145.57369  129.61172  152.02872  267.5713   100    d    
#     (data.table) .I[which.min]:    12.84948   13.66673   19.51432   13.97584   15.17900  109.5438   100 a       
#      (data.table) order/unique:    52.91915   54.63989   64.39212   59.15254   61.71133  177.1248   100  b      
#     (data.table) order/.SD[1L]:    51.41872   53.22794   58.17123   55.00228   59.00966  145.0341   100  b      
#     (data.table) self join (on):   44.37256   45.67364   50.32378   46.24578   50.69411  137.4724   100  b      
# (data.table) self join (setkey):   14.30543   15.28924   18.63739   15.58667   16.01017  106.0069   100 a       
#       (dplyr) slice(which.min):    82.60453   83.64146   94.06307   84.82078   90.09772  186.0848   100   c     
#       (dplyr) arrange/distinct:   344.81603  360.09167  385.52661  379.55676  395.29463  491.3893   100     e   
# (dplyr) arrange/group_by/slice:   367.95924  383.52719  414.99081  397.93646  425.92478  557.9553   100      f  
#         (plyr) ddply/which.min:   506.55354  530.22569  568.99493  552.65068  601.04582  727.9248   100       g 
#                      (base) by:  1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377   100        h
Jinnyjinrikisha answered 24/1, 2017 at 20:27 Comment(2)
Nicely done. Just for completeness / curiosity, I provided a correct plyr solution. Feel free to include in benchmark... I don't expect it will stand up to data.table though.Traffic
@Traffic Updated.Jinnyjinrikisha
R
9

The base function by is often useful for working with block data in data.frames. For example

by(data, data$State, function(x) x[which.min(x$Employees), ] )

It does return the data in a list, but you can collapse that with

do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))
Roentgenotherapy answered 5/6, 2014 at 21:55 Comment(0)
O
5

In base you can use ave to get min per group and compare this with Employees and get a logical vector to subset the data.frame.

data[data$Employees == ave(data$Employees, data$State, FUN=min),]
#  State Company Employees
#4    AK       D        24
#5    RI       E        19

Or compare it already in the function.

data[as.logical(ave(data$Employees, data$State, FUN=function(x) x==min(x))),]
#data[ave(data$Employees, data$State, FUN=function(x) x==min(x))==1,] #Variant
#  State Company Employees
#4    AK       D        24
#5    RI       E        19
Overload answered 18/5, 2020 at 7:7 Comment(0)
T
3

Corrected plyr solution:

ddply(df, .(State), function(x) x[which.min(x$Employees),])
#   State Company Employees
# 1    AK       D        24
# 2    RI       E        19

thanks to @joel.wilson

Traffic answered 5/6, 2014 at 21:45 Comment(0)
P
2

Using collapse

library(collapse)
library(magrittr)
data %>% 
  fgroup_by(State) %>% 
  fsummarise(Employees = fmin(Employees))
Plaint answered 16/3, 2021 at 20:55 Comment(0)
M
0

Another data.table solution:

DT[, E_min := min(Employees), by = State][Employees == E_min]

Pretty straightforward and among the fastest. Below I rerun David Arenburg's benchmarking with this and the other fastest data.table solutions.

library(data.table)
library(microbenchmark)

set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
                   Employees = sample(N * 10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
DT3 <- copy(DT)

microbenchmark(
    "(data.table) min column: " = DT3[, E_min := min(Employees), by = State][Employees == E_min],
    "(data.table) .I[which.min]: " = DT[DT[, .I[which.min(Employees)], by = State]$V1],
    "(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
    "(data.table) self join (setkey):" = {
        tmp <- DT2[, .(Employees = min(Employees)), by = State]
        setkey(tmp, State, Employees)
        setkey(DT2, State, Employees)
        DT2[tmp, mult = "first"]
    }
)

                             expr      min       lq      mean   median        uq      max neval
        (data.table) min column:  44.30078 52.17932  68.31826 58.65887  76.89786 184.0207   100
     (data.table) .I[which.min]:  20.34116 26.31244  39.36874 34.01958  42.65439 124.9204   100
      (data.table) order/unique:  70.07820 80.20577 109.71235 95.25586 114.87695 514.4456   100
 (data.table) self join (setkey): 13.48105 16.06614  22.58310 17.35083  22.31206 161.9103   100

This solution is most useful if you want just a few columns returned among many, e.g. [Employees == E_min, ..columns_to_keep], in which case it can be even faster.

Mime answered 11/2, 2022 at 5:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.