Using dplyr window functions to calculate percentiles
Asked Answered
K

11

66

I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.

Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:

library(dplyr)
library(tidyr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

# old dplyr solution 
mtcars %>% group_by(cyl) %>% 
  do(data.frame(p=p, stats=quantile(.$mpg, probs=p), 
                n = length(.$mpg), avg = mean(.$mpg))) %>%
  spread(p, stats) %>%
  select(1, 4:6, 3, 2)

# note: the select and spread statements are just to get the data into
#       the format in which I'd like to see it, but are not critical

Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.

Thank you

Kelby answered 27/5, 2015 at 16:38 Comment(1)
I should add that this code also uses the "tidyr" package, which is where the "spread" function comes fromKelby
B
100

In dplyr 1.0, summarise can return multiple values, allowing the following:

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>%  
  summarise(quantile = scales::percent(c(0.25, 0.5, 0.75)),
            mpg = quantile(mpg, c(0.25, 0.5, 0.75)))

Or, you can avoid a separate line to name the quantiles by going with enframe:

mtcars %>% 
  group_by(cyl) %>%  
  summarise(enframe(quantile(mpg, c(0.25, 0.5, 0.75)), "quantile", "mpg"))
    cyl quantile   mpg
  <dbl> <chr>    <dbl>
1     4 25%       22.8
2     4 50%       26  
3     4 75%       30.4
4     6 25%       18.6
5     6 50%       19.7
6     6 75%       21  
7     8 25%       14.4
8     8 50%       15.2
9     8 75%       16.2

NOTE: As of dplyr 1.1.0, returning multiple rows per group with summarise is deprecated. Instead, use reframe, as in:

mtcars %>% 
 group_by(cyl) %>%  
 reframe(enframe(quantile(mpg, c(0.25, 0.5, 0.75)), "quantile", "mpg"))

Answer for previous versions of dplyr

library(tidyverse)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(x=list(enframe(quantile(mpg, probs=c(0.25,0.5,0.75)), "quantiles", "mpg"))) %>% 
  unnest(x)
    cyl quantiles   mpg
1     4       25% 22.80
2     4       50% 26.00
3     4       75% 30.40
4     6       25% 18.65
5     6       50% 19.70
6     6       75% 21.00
7     8       25% 14.40
8     8       50% 15.20
9     8       75% 16.25

This can be turned into a more general function using tidyeval:

q_by_group = function(data, value.col, ..., probs=seq(0,1,0.25)) {

  groups=enquos(...)
  
  data %>% 
    group_by(!!!groups) %>% 
    summarise(x = list(enframe(quantile({{value.col}}, probs=probs), "quantiles", "mpg"))) %>% 
    unnest(x)
}

q_by_group(mtcars, mpg)
q_by_group(mtcars, mpg, cyl)
q_by_group(mtcars, mpg, cyl, vs, probs=c(0.5,0.75))
q_by_group(iris, Petal.Width, Species)
Burschenschaft answered 27/5, 2015 at 17:54 Comment(7)
Thanks -- this is the answer I was looking for, which is that you can do it, but not in a seamless way with a single call to quantile (and that it is an open issue in dplyr development).Kelby
what if i want to summarise_all ?Mexicali
This is so incredibly obscure, but I love it.Malpractice
@OmarAbdEl-Naser e.g., use summarise_all(.funs = function(x) list(enframe(quantile(x, probs = c(0.25,0.5,0.75), na.rm = TRUE)))) %>% unnestMalpractice
@Burschenschaft How can you create a new variable in the same dataset with quantile? The downside of using it in summarize is that it collapses your dataset, when I usually want to calculate percentiles and simultaneously create a new variable while maintaining my dataset instead of collapsing. Is there an easier way than having to join it back to the original dataset?Urinalysis
For example, if you want to create a column with, say, deciles, you could do: mtcars %>% mutate(mpg.decile = cut(mpg, breaks=quantile(mpg, probs=seq(0,1,0.1)), labels=10:1, include.lowest=TRUE)).Burschenschaft
If you want the percentile for each value in a column, you can use the ecdf function (empirical cumulative distribution function): mtcars %>% mutate(mpg.pctile = ecdf(mpg)(mpg)).Burschenschaft
S
45

If you're up for using purrr::map, you can do it like this!

library(tidyverse)

mtcars %>%
  tbl_df() %>%
  nest(-cyl) %>%
  mutate(Quantiles = map(data, ~ quantile(.$mpg)),
         Quantiles = map(Quantiles, ~ bind_rows(.) %>% gather())) %>% 
  unnest(Quantiles)

#> # A tibble: 15 x 3
#>      cyl key   value
#>    <dbl> <chr> <dbl>
#>  1     6 0%     17.8
#>  2     6 25%    18.6
#>  3     6 50%    19.7
#>  4     6 75%    21  
#>  5     6 100%   21.4
#>  6     4 0%     21.4
#>  7     4 25%    22.8
#>  8     4 50%    26  
#>  9     4 75%    30.4
#> 10     4 100%   33.9
#> 11     8 0%     10.4
#> 12     8 25%    14.4
#> 13     8 50%    15.2
#> 14     8 75%    16.2
#> 15     8 100%   19.2

Created on 2018-11-10 by the reprex package (v0.2.1)

One nice thing about this approach is the output is tidy, one observation per row.

Sholom answered 24/3, 2017 at 17:32 Comment(5)
Thanks, I think this is the cleanest approach.Haimes
The only thing I'd add is a "spread" at the end to make things tabular for presentation purposes, i.e. %>% spread(names,x) Sequin
Trying this now, I get stopped with 'tidy.numeric' is deprecated.Contactor
Thanks @doconnor. I've updated my answer to not use broom anymore.Sholom
It's been bugging me that the mutate portion couldn't be done in one line with built-in tidyverse functionality, but I just realized theenframe function turns this into a one-liner: mutate(Quantiles = map(data, ~ enframe(quantile(.$mpg), "quantile"))).Burschenschaft
P
18

This is a dplyr approach that uses the tidy() function of the broom package, unfortunately it still requires do(), but it is a lot simpler.

library(dplyr)
library(broom)

mtcars %>%
    group_by(cyl) %>%
    do( tidy(t(quantile(.$mpg))) )

which gives:

    cyl   X0.  X25.  X50.  X75. X100.
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1     4  21.4 22.80  26.0 30.40  33.9
2     6  17.8 18.65  19.7 21.00  21.4
3     8  10.4 14.40  15.2 16.25  19.2

Note the use of t() since the broom package does not have a method for named numerics.

This is based on my earlier answer for summary() here.

Plasma answered 7/5, 2016 at 13:57 Comment(2)
If you also want to change column names you could even use tidy::spread() instead of t() and stringr::str_c(): mtcars %>% group_by(cyl) %>% do(tidy(quantile(.$mpg))) %>% mutate(names = stringr::str_c("Q", names)) %>% tidyr::spread(names, x). It is more verbose but gives you some freedom in adjustments.Macule
tidy has been deprecated in favor of tibble::as_tibble()Penland
P
12

Not sure how to avoid do() in dplyr, but you can do this with c() and as.list() with data.table in a pretty straightforward manner:

require(data.table) 
as.data.table(mtcars)[, c(as.list(quantile(mpg, probs=p)), 
                        avg=mean(mpg), n=.N), by=cyl]
#    cyl   25%  50%   75%      avg  n
# 1:   6 18.65 19.7 21.00 19.74286  7
# 2:   4 22.80 26.0 30.40 26.66364 11
# 3:   8 14.40 15.2 16.25 15.10000 14

Replace by with keyby if you want them ordered by cyl column.

Pahoehoe answered 27/5, 2015 at 16:49 Comment(2)
Good. I was aware of the as.list method in [.data.table and I tried it in dplyr but failed.Cohesive
This is a nice solution -- I wish I could use it for my particular project but can't for reasons unrelated to the answer itselfKelby
P
8

Answered many diffrent ways. dplyr distinct made the difference for what I wanted to do..

mtcars %>%
   select(cyl, mpg) %>%
   group_by(cyl) %>%
   mutate( qnt_0   = quantile(mpg, probs= 0),
           qnt_25  = quantile(mpg, probs= 0.25),
           qnt_50  = quantile(mpg, probs= 0.5),
           qnt_75  = quantile(mpg, probs= 0.75),
           qnt_100 = quantile(mpg, probs= 1),
              mean = mean(mpg),
                sd = sd(mpg)
          ) %>%
   distinct(qnt_0 ,qnt_25 ,qnt_50 ,qnt_75 ,qnt_100 ,mean ,sd)

renders

# A tibble: 3 x 8
# Groups:   cyl [3]
  qnt_0 qnt_25 qnt_50 qnt_75 qnt_100  mean    sd   cyl
  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl>
1  17.8   18.6   19.7   21      21.4  19.7  1.45     6
2  21.4   22.8   26     30.4    33.9  26.7  4.51     4
3  10.4   14.4   15.2   16.2    19.2  15.1  2.56     8
Pitre answered 16/5, 2019 at 0:49 Comment(2)
Is there a reason to do mutate() then distinct() instead of summarize()?Twain
The reason for the "distinct()" was to distill only one raw per "cyl". There are always more than one way to slice an orange. I'd probably use summarize today.Pitre
P
6

This solution uses dplyr and tidyr only, lets you specify your quantiles in the dplyr chain, and takes advantage of tidyr::crossing() to "stack" multiple copies of the dataset prior to grouping and summarising.

diamonds %>%  # Initial data
  tidyr::crossing(pctile = 0:4/4) %>%  # Specify quantiles; crossing() is like expand.grid()
  dplyr::group_by(cut, pctile) %>%  # Indicate your grouping var, plus your quantile var
  dplyr::summarise(quantile_value = quantile(price, unique(pctile))) %>%  # unique() is needed
  dplyr::mutate(pctile = sprintf("%1.0f%%", pctile*100))  # Optional prettification

Result:

# A tibble: 25 x 3
# Groups:   cut [5]
         cut pctile quantile_value
       <ord>  <chr>          <dbl>
 1      Fair     0%         337.00
 2      Fair    25%        2050.25
 3      Fair    50%        3282.00
 4      Fair    75%        5205.50
 5      Fair   100%       18574.00
 6      Good     0%         327.00
 7      Good    25%        1145.00
 8      Good    50%        3050.50
 9      Good    75%        5028.00
10      Good   100%       18788.00
11 Very Good     0%         336.00
12 Very Good    25%         912.00
13 Very Good    50%        2648.00
14 Very Good    75%        5372.75
15 Very Good   100%       18818.00
16   Premium     0%         326.00
17   Premium    25%        1046.00
18   Premium    50%        3185.00
19   Premium    75%        6296.00
20   Premium   100%       18823.00
21     Ideal     0%         326.00
22     Ideal    25%         878.00
23     Ideal    50%        1810.00
24     Ideal    75%        4678.50
25     Ideal   100%       18806.00

The unique() is necessary to let dplyr::summarise() know that you only want one value per group.

Proterozoic answered 1/3, 2018 at 18:40 Comment(0)
B
4

Here is a solution using a combination of dplyr, purrr, and rlang:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

p_names <- paste0(p*100, "%")
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% 
  set_names(nm = p_names)

# dplyr/purrr/rlang solution 
mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg), funs(!!!p_funs))
#> # A tibble: 3 x 4
#>     cyl `25%` `50%` `75%`
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4  22.8  26    30.4
#> 2     6  18.6  19.7  21  
#> 3     8  14.4  15.2  16.2


#Especially useful if you want to summarize more variables
mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg, drat), funs(!!!p_funs))
#> # A tibble: 3 x 7
#>     cyl `mpg_25%` `drat_25%` `mpg_50%` `drat_50%` `mpg_75%` `drat_75%`
#>   <dbl>     <dbl>      <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
#> 1     4      22.8       3.81      26         4.08      30.4       4.16
#> 2     6      18.6       3.35      19.7       3.9       21         3.91
#> 3     8      14.4       3.07      15.2       3.12      16.2       3.22

Created on 2018-10-01 by the reprex package (v0.2.0).

Edit (2019-04-17):

As of dplyr 0.8.0, the funs function has been deprecated in favor of using list to pass the desired functions into scoped dplyr functions. As a result of this, the implementation above gets slightly more straightfoward. We no longer need to worry about unquoting the functions with the !!!. Please see the below reprex:

library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.5.2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(purrr)

# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

p_names <- paste0(p*100, "%")
p_funs <- map(p, ~partial(quantile, probs = .x, na.rm = TRUE)) %>% 
  set_names(nm = p_names)

# dplyr/purrr/rlang solution 
mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg), p_funs)
#> # A tibble: 3 x 4
#>     cyl `25%` `50%` `75%`
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     4  22.8  26    30.4
#> 2     6  18.6  19.7  21  
#> 3     8  14.4  15.2  16.2


#Especially useful if you want to summarize more variables
mtcars %>% 
  group_by(cyl) %>% 
  summarize_at(vars(mpg, drat), p_funs)
#> # A tibble: 3 x 7
#>     cyl `mpg_25%` `drat_25%` `mpg_50%` `drat_50%` `mpg_75%` `drat_75%`
#>   <dbl>     <dbl>      <dbl>     <dbl>      <dbl>     <dbl>      <dbl>
#> 1     4      22.8       3.81      26         4.08      30.4       4.16
#> 2     6      18.6       3.35      19.7       3.9       21         3.91
#> 3     8      14.4       3.07      15.2       3.12      16.2       3.22

Created on 2019-04-17 by the reprex package (v0.2.0).

Brilliant answered 1/10, 2018 at 16:2 Comment(3)
that's very helpful. Don't know why this one did not have any upvotes yet.Malpractice
Wrapping the three lines into one function makes it a bit neater, using p_funs<-function() {etc}. One needs to use !!!p_funs() in the funs call in this caseMalpractice
With the new version of dplyr the funs function is soft-deprecated and now you actually only need to call p_funs within summarize_at. Please see my edit aboveBrilliant
J
2

Yet another way to accomplish this, with unnest_wider/longer

    mtcars %>%
       group_by(cyl) %>%
       summarise(quants = list(quantile(mpg, probs = c(.01, .1, .25, .5, .75, .90,.99)))) %>%
       unnest_wider(quants)

And if you wanted to do it for multiple variables, you could gather before the grouping:

mtcars %>%
   gather(key = 'metric', value = 'value', -cyl) %>%
   group_by(cyl, metric) %>%
   summarise(quants = list(quantile(value, probs = c(.01, .1, .25, .5, .75, .90,.99)))) %>%
  unnest_wider(quants)

Jacquline answered 20/8, 2020 at 14:17 Comment(0)
D
1

Here's a fairly readable solution that uses dplyr and purrr to return quantiles in a tidy format:

Code

library(dplyr)
library(purrr)

mtcars %>% 
    group_by(cyl) %>% 
    do({x <- .$mpg
        map_dfr(.x = c(.25, .5, .75),
                .f = ~ data_frame(Quantile = .x,
                                  Value = quantile(x, probs = .x)))
       })

Result

# A tibble: 9 x 3
# Groups:   cyl [3]
    cyl Quantile Value
  <dbl>    <dbl> <dbl>
1     4     0.25 22.80
2     4     0.50 26.00
3     4     0.75 30.40
4     6     0.25 18.65
5     6     0.50 19.70
6     6     0.75 21.00
7     8     0.25 14.40
8     8     0.50 15.20
9     8     0.75 16.25
Dissepiment answered 20/11, 2017 at 20:12 Comment(0)
M
0

do() is in fact the correct idiom, since it’s designed for group-wise transformations. Think of it as an lapply() that maps over groups of a data frame. (For such a specialized function, a generic name like “do” is not ideal. But it’s probably too late to change it.)

Morally, within each cyl group, you want to apply quantile() to the mpg column:

library(dplyr)

p <- c(.2, .5, .75)

mtcars %>% 
  group_by(cyl) %>%
  do(quantile(.$mpg, p))

#> Error: Results 1, 2, 3 must be data frames, not numeric

Except that doesn’t work because quantile() doesn’t return a data frame; you must convert its output, explicitly. Since this alteration amounts to wrapping quantile() with a data frame, you can use the gestalt function composition operator %>>>%:

library(gestalt)
library(tibble)

quantile_tbl <- quantile %>>>% enframe("quantile")

mtcars %>% 
  group_by(cyl) %>%
  do(quantile_tbl(.$mpg, p))

#> # A tibble: 9 x 3
#> # Groups:   cyl [3]
#>     cyl quantile value
#>   <dbl> <chr>    <dbl>
#> 1     4 20%       22.8
#> 2     4 50%       26  
#> 3     4 75%       30.4
#> 4     6 20%       18.3
#> 5     6 50%       19.7
#> 6     6 75%       21  
#> 7     8 20%       13.9
#> 8     8 50%       15.2
#> 9     8 75%       16.2
Mixie answered 26/10, 2018 at 9:36 Comment(0)
M
0

You can use q_summarise() from my package timeplyr.

It's both tidy-based (using data-masking rules) and very fast as it uses collapse and data.table under the hood.

# To install, uncomment the below line
# remotes::install_github("NicChr/timeplyr")

library(tidyverse)
library(timeplyr)

mtcars %>%
  q_summarise(mpg, .by = cyl, probs = p)
#>    cyl   p25  p50   p75
#> 1:   4 22.80 26.0 30.40
#> 2:   6 18.65 19.7 21.00
#> 3:   8 14.40 15.2 16.25

mtcars %>%
  q_summarise(mpg, .by = cyl, probs = p, pivot = "long")
#>    cyl .quantile   mpg
#> 1:   4       p25 22.80
#> 2:   4       p50 26.00
#> 3:   4       p75 30.40
#> 4:   6       p25 18.65
#> 5:   6       p50 19.70
#> 6:   6       p75 21.00
#> 7:   8       p25 14.40
#> 8:   8       p50 15.20
#> 9:   8       p75 16.25

# Comparison when there are lots of groups

df <- tibble(g = sample.int(10^4, replace = TRUE),
                             x = rnorm(10^4))

bench::mark(timeplyr = q_summarise(df, x, .by = g,
                        pivot = "long", probs = seq(0, 1, 0.25)),
            dplyr = q_by_group(df, x, g, probs = seq(0, 1, 0.25)),
            check = FALSE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 timeplyr     27.7ms   31.4ms    29.7      2.06MB     5.95
#> 2 dplyr          1.5s     1.5s     0.665    5.33MB     5.99

Created on 2023-07-10 with reprex v2.0.2

Madelynmademoiselle answered 10/7, 2023 at 9:50 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.