How to select the rows with maximum values in each group with dplyr? [duplicate]
Asked Answered
G

6

163

I would like to select a row with maximum value in each group with dplyr.

Firstly I generate some random data to show my question

set.seed(1)
df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))
df$value <- runif(nrow(df))

In plyr, I could use a custom function to select this row.

library(plyr)
ddply(df, .(A, B), function(x) x[which.max(x$value),])

In dplyr, I am using this code to get the maximum value, but not the rows with maximum value (Column C in this case).

library(dplyr)
df %>% group_by(A, B) %>%
    summarise(max = max(value))

How could I achieve this? Thanks for any suggestion.

sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.2  plyr_1.8.1

loaded via a namespace (and not attached):
[1] assertthat_0.1.0.99 parallel_3.1.0      Rcpp_0.11.1        
[4] tools_3.1.0        
Golfer answered 16/6, 2014 at 6:0 Comment(0)
R
189

Try this:

result <- df %>% 
             group_by(A, B) %>%
             filter(value == max(value)) %>%
             arrange(A,B,C)

Seems to work:

identical(
  as.data.frame(result),
  ddply(df, .(A, B), function(x) x[which.max(x$value),])
)
#[1] TRUE

As pointed out in the comments, slice may be preferred here as per @RoyalITS' answer below if you strictly only want 1 row per group. This answer will return multiple rows if there are multiple with an identical maximum value.

Reluctivity answered 16/6, 2014 at 6:10 Comment(2)
The results are identical in this case because there are no duplicated maximum values present. Otherwise, the filter approach would return all maximum values (rows) per group while the OP's ddply approach with which.max would only return one maximum (the first) per group. To replicate that behavior, another option is to use slice(which.max(value)) in dplyr.Finedraw
According with @Finedraw comment the same result may be obtained using filter(row_number()==1). Remenber to arrange the DF before groupingSniffle
B
109
df %>% group_by(A,B) %>% slice(which.max(value))
Barony answered 24/2, 2016 at 16:40 Comment(1)
slice_max(.data, order_by, ..., n, prop, with_ties = TRUE) also now a good option...Stipel
A
86

You can use top_n

df %>% group_by(A, B) %>% top_n(n=1)

This will rank by the last column (value) and return the top n=1 rows.

Currently, you can't change the this default without causing an error (See https://github.com/hadley/dplyr/issues/426)

Allaround answered 16/6, 2014 at 6:14 Comment(2)
note that ties will cause duplicates.Roil
if your value is not the last column you can use the wt parameter of top_n df %>% group_by(A,B) %>% top_n(n=1, wt = value)Pitre
I
19

This more verbose solution provides greater control on what happens in case of duplicate maximum value (in this example, it will take one of the corresponding rows randomly)

library(dplyr)
df %>% group_by(A, B) %>%
  mutate(the_rank  = rank(-value, ties.method = "random")) %>%
  filter(the_rank == 1) %>% select(-the_rank)
Ilan answered 18/7, 2016 at 7:59 Comment(1)
I think this is a great method, thanks for sharing.Barracoon
E
2

More generally, I think you might want to get "top" of the rows that are sorted within a given group.

For the case of where a single value is max'd out, you have essentially sorted by only one column. However, it's often useful to hierarchically sort by multiple columns (for example: a date column and a time-of-day column).

# Answering the question of getting row with max "value".
df %>% 
  # Within each grouping of A and B values.
  group_by( A, B) %>% 
  # Sort rows in descending order by "value" column.
  arrange( desc(value) ) %>% 
  # Pick the top 1 value
  slice(1) %>% 
  # Remember to ungroup in case you want to do further work without grouping.
  ungroup()

# Answering an extension of the question of 
# getting row with the max value of the lowest "C".
df %>% 
  # Within each grouping of A and B values.
  group_by( A, B) %>% 
  # Sort rows in ascending order by C, and then within that by 
  # descending order by "value" column.
  arrange( C, desc(value) ) %>% 
  # Pick the one top row based on the sort
  slice(1) %>% 
  # Remember to ungroup in case you want to do further work without grouping.
  ungroup()
Eckenrode answered 16/1, 2019 at 19:6 Comment(2)
When you arrange( desc(value) ) then value are grouped? I would like to have there say sum of C.Tarnopol
@PrzemyslawRemin, I'm not sure I fully understand where the "sum of C" is or how you mean to use it. In dplyr it's nice to separate your steps. You could df %>% group_by( A, B) %>% mutate( s = sum(C) ) which will put the sum of C within each group as a (repeated) value s within each group (each row gets the same sum value within a group). And then you could ungroup and use one of the other methods described to filter max of C, such as ... %>% ungroup() %>% slice(which.max(C)).Eckenrode
J
1

For me, it helped to count the number of values per group. Copy the count table into a new object. Then filter for the max of the group based on the first grouping characteristic. For example:

count_table  <- df %>%
                group_by(A, B) %>%
                count() %>%
                arrange(A, desc(n))

count_table %>% 
    group_by(A) %>%
    filter(n == max(n))

or

count_table %>% 
    group_by(A) %>%
    top_n(1, n)
Juniorjuniority answered 1/2, 2019 at 14:39 Comment(1)
If you run this on the data in the question, I think you'll see it's answering the wrong question. The question in the post is about how to find the rows that have the maximum value (the number in the value column). This answer ignores the value column to find the most common B value for each A.Argumentative

© 2022 - 2024 — McMap. All rights reserved.