Relative frequencies / proportions with dplyr

Asked 4/7, 2014 at 14:31 Answered 24/12, 2023 at 11:5

244

Suppose I want to calculate the proportion of different values within each group. For example, using the mtcars data, how do I calculate the relative frequency of number of gears by am (automatic/manual) in one go with dplyr?

library(dplyr)
data(mtcars)
mtcars <- tbl_df(mtcars)

# count frequency
mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n())

# am gear  n
#  0    3 15 
#  0    4  4 
#  1    4  8  
#  1    5  5

What I would like to achieve:

am gear  n rel.freq
 0    3 15      0.7894737
 0    4  4      0.2105263
 1    4  8      0.6153846
 1    5  5      0.3846154

Denbrook answered 4/7, 2014 at 14:31 Comment(3)

Are those percentages the actual numbers you want? Where are they coming from, algebraically? Ah, 79% is 15/(15+4), 21% is 4/(15+4) and then for am==1 62% is 8/(8+5) etc. Got it. – Hamite 4/7, 2014 at 14:35

@Hamite Yes, those are the number I want and Frank is correct, they sum to 100% by the am variable (79+21) and (62+38).. – Denbrook 4/7, 2014 at 14:41

This really seems to be looking for a native dplyr implementation of prop.table()/sweep(). Also, in other questions some people are asking for the option to include zero-counts for variables or variable-interactions – Linolinocut 26/4, 2016 at 21:25

440

Try this:

mtcars %>%
  group_by(am, gear) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

#   am gear  n      freq
# 1  0    3 15 0.7894737
# 2  0    4  4 0.2105263
# 3  1    4  8 0.6153846
# 4  1    5  5 0.3846154

From the dplyr vignette:

When you group by multiple variables, each summary peels off one level of the grouping. That makes it easy to progressively roll-up a dataset.

Thus, after the summarise, the last grouping variable specified in group_by, 'gear', is peeled off. In the mutate step, the data is grouped by the remaining grouping variable(s), here 'am'. You may check grouping in each step with groups.

The outcome of the peeling is of course dependent of the order of the grouping variables in the group_by call. You may wish to do a subsequent group_by(am), to make your code more explicit.

For rounding and prettification, please refer to the nice answer by @Tyler Rinker.

Vinni answered 4/7, 2014 at 14:42 Comment(3)

I just discovered that solution too, but I don't know why sum(n) works over the am group and not the gear group too... – Hamite 4/7, 2014 at 14:44

See the vignette: "When you group by multiple variables, each summary peels off one level of the grouping." – Vinni 4/7, 2014 at 14:44

could also replace the summarise(n = n()) line with tally() since it is a wrapper for that very statement. – Immolate 24/9, 2021 at 12:6

You can use count() function, which has however a different behaviour depending on the version of dplyr:

dplyr 0.7.1: returns an ungrouped table: you need to group again by am
dplyr < 0.7.1: returns a grouped table, so no need to group again, although you might want to ungroup() for later manipulations

dplyr 0.7.1

mtcars %>%
  count(am, gear) %>%
  group_by(am) %>%
  mutate(freq = n / sum(n))

dplyr < 0.7.1

mtcars %>%
  count(am, gear) %>%
  mutate(freq = n / sum(n))

This results into a grouped table, if you want to use it for further analysis, it might be useful to remove the grouped attribute with ungroup().

Tonytonya answered 8/8, 2016 at 18:31 Comment(1)

This seems an invalid answer on dplyr 0.7.1. It does the frequency calculation overall on "gear", instead of within each level of "am". – Thant 19/7, 2017 at 14:16

@Henrik's is better for usability as this will make the column character and no longer numeric but matches what you asked for...

mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = paste0(round(100 * n/sum(n), 0), "%"))

##   am gear  n rel.freq
## 1  0    3 15      79%
## 2  0    4  4      21%
## 3  1    4  8      62%
## 4  1    5  5      38%

EDIT Because Spacedman asked for it :-)

as.rel_freq <- function(x, rel_freq_col = "rel.freq", ...) {
    class(x) <- c("rel_freq", class(x))
    attributes(x)[["rel_freq_col"]] <- rel_freq_col
    x
}

print.rel_freq <- function(x, ...) {
    freq_col <- attributes(x)[["rel_freq_col"]]
    x[[freq_col]] <- paste0(round(100 * x[[freq_col]], 0), "%")   
    class(x) <- class(x)[!class(x)%in% "rel_freq"]
    print(x)
}

mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq = n/sum(n)) %>%
  as.rel_freq()

## Source: local data frame [4 x 4]
## Groups: am
## 
##   am gear  n rel.freq
## 1  0    3 15      79%
## 2  0    4  4      21%
## 3  1    4  8      62%
## 4  1    5  5      38%

Kiva answered 4/7, 2014 at 14:44 Comment(2)

You could always create an S3 "percentage" class with a format method that adds a percent sign... #overkill – Hamite 4/7, 2014 at 14:49

Implementing this might be interesting too: #13483930 – Hamite 4/7, 2014 at 15:13

Despite the many answers, one more approach which uses prop.table in combination with 'dplyr' or 'data.table'.

Since 'dplyr' v. >= 1.1.0 we can use the .by argument in mutate:

library(dplyr)

mtcars %>%
  count(am, gear) %>% 
  mutate(freq = prop.table(n), .by = am)

#>   am gear  n      freq
#> 1  0    3 15 0.7894737
#> 2  0    4  4 0.2105263
#> 3  1    4  8 0.6153846
#> 4  1    5  5 0.3846154

Before 'dplyr' v. < 1.1.0 one approach would be:

mtcars %>%
  group_by(am, gear) %>% 
  tally() %>% 
  mutate(freq = prop.table(n))

#> # A tibble: 4 × 4
#> # Groups:   am [2]
#>      am  gear     n  freq
#>   <dbl> <dbl> <int> <dbl>
#> 1     0     3    15 0.789
#> 2     0     4     4 0.211
#> 3     1     4     8 0.615
#> 4     1     5     5 0.385

With 'data.table' we can do:

library(data.table)
cars_dt <- as.data.table(mtcars)
cars_dt[, .(n = .N), keyby = .(am, gear)][, freq := prop.table(n), by = "am"][]

#>    am gear  n      freq
#> 1:  0    3 15 0.7894737
#> 2:  0    4  4 0.2105263
#> 3:  1    4  8 0.6153846
#> 4:  1    5  5 0.3846154

^{Created on 2022-10-22 with reprex v2.0.2}

Walkerwalkietalkie answered 30/7, 2019 at 19:26 Comment(0)

For the sake of completeness of this popular question, since version 1.0.0 of dplyr, parameter .groups controls the grouping structure of the summarise function after group_by summarise help.

With .groups = "drop_last", summarise drops the last level of grouping. This was the only result obtained before version 1.0.0.

library(dplyr)
library(scales)

original <- mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n()) %>%
  mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)

original
#> # A tibble: 4 x 4
#> # Groups:   am [2]
#>      am  gear     n rel.freq
#>   <dbl> <dbl> <int> <chr>   
#> 1     0     3    15 78.9%   
#> 2     0     4     4 21.1%   
#> 3     1     4     8 61.5%   
#> 4     1     5     5 38.5%

new_drop_last <- mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n(), .groups = "drop_last") %>%
  mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))

dplyr::all_equal(original, new_drop_last)
#> [1] TRUE

With .groups = "drop", all levels of grouping are dropped. The result is turned into an independent tibble with no trace of the previous group_by

# .groups = "drop"
new_drop <- mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n(), .groups = "drop") %>%
  mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))

new_drop
#> # A tibble: 4 x 4
#>      am  gear     n rel.freq
#>   <dbl> <dbl> <int> <chr>   
#> 1     0     3    15 46.9%   
#> 2     0     4     4 12.5%   
#> 3     1     4     8 25.0%   
#> 4     1     5     5 15.6%

If .groups = "keep", same grouping structure as .data (mtcars, in this case). summarise does not peel off any variable used in the group_by.

Finally, with .groups = "rowwise", each row is it's own group. It is equivalent to "keep" in this situation

# .groups = "keep"
new_keep <- mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n(), .groups = "keep") %>%
  mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))

new_keep
#> # A tibble: 4 x 4
#> # Groups:   am, gear [4]
#>      am  gear     n rel.freq
#>   <dbl> <dbl> <int> <chr>   
#> 1     0     3    15 100.0%  
#> 2     0     4     4 100.0%  
#> 3     1     4     8 100.0%  
#> 4     1     5     5 100.0%

# .groups = "rowwise"
new_rowwise <- mtcars %>%
  group_by (am, gear) %>%
  summarise (n=n(), .groups = "rowwise") %>%
  mutate(rel.freq =  scales::percent(n/sum(n), accuracy = 0.1))

dplyr::all_equal(new_keep, new_rowwise)
#> [1] TRUE

Another point that can be of interest is that sometimes, after applying group_by and summarise, a summary line can help.

# create a subtotal line to help readability
subtotal_am <- mtcars %>%
  group_by (am) %>% 
  summarise (n=n()) %>%
  mutate(gear = NA, rel.freq = 1)
#> `summarise()` ungrouping output (override with `.groups` argument)

mtcars %>% group_by (am, gear) %>%
  summarise (n=n()) %>% 
  mutate(rel.freq = n/sum(n)) %>%
  bind_rows(subtotal_am) %>%
  arrange(am, gear) %>%
  mutate(rel.freq =  scales::percent(rel.freq, accuracy = 0.1))
#> `summarise()` regrouping output by 'am' (override with `.groups` argument)
#> # A tibble: 6 x 4
#> # Groups:   am [2]
#>      am  gear     n rel.freq
#>   <dbl> <dbl> <int> <chr>   
#> 1     0     3    15 78.9%   
#> 2     0     4     4 21.1%   
#> 3     0    NA    19 100.0%  
#> 4     1     4     8 61.5%   
#> 5     1     5     5 38.5%   
#> 6     1    NA    13 100.0%

^{Created on 2020-11-09 by the reprex package (v0.3.0)}

Hope you find this answer useful.

Paulie answered 9/11, 2020 at 21:39 Comment(0)

I wrote a small function for this repeating task:

count_pct <- function(df) {
  return(
    df %>%
      tally %>% 
      mutate(n_pct = 100*n/sum(n))
  )
}

I can then use it like:

mtcars %>% 
  group_by(cyl) %>% 
  count_pct

It returns:

# A tibble: 3 x 3
    cyl     n n_pct
  <dbl> <int> <dbl>
1     4    11  34.4
2     6     7  21.9
3     8    14  43.8

Cental answered 18/3, 2019 at 10:52 Comment(0)

Here is a general function implementing Henrik's solution on dplyr 0.7.1.

freq_table <- function(x, 
                       group_var, 
                       prop_var) {
  group_var <- enquo(group_var)
  prop_var  <- enquo(prop_var)
  x %>% 
    group_by(!!group_var, !!prop_var) %>% 
    summarise(n = n()) %>% 
    mutate(freq = n /sum(n)) %>% 
    ungroup
}

Thant answered 19/7, 2017 at 14:18 Comment(0)

Also, try add_count() (to get around pesky group_by .groups).

mtcars %>% 
  count(am, gear) %>% 
  add_count(am, wt = n, name = "nn") %>% 
  mutate(proportion = n / nn)

Fauman answered 4/3, 2021 at 9:59 Comment(0)

Here is a base R answer using aggregate and ave :

df1 <- with(mtcars, aggregate(list(n = mpg), list(am = am, gear = gear), length))
df1$prop <- with(df1, n/ave(n, am, FUN = sum))
#Also with prop.table
#df1$prop <- with(df1, ave(n, am, FUN = prop.table))
df1

#  am gear  n      prop
#1  0    3 15 0.7894737
#2  0    4  4 0.2105263
#3  1    4  8 0.6153846
#4  1    5  5 0.3846154

We can also use prop.table but the output displays differently.

prop.table(table(mtcars$am, mtcars$gear), 1)
   
#            3         4         5
#  0 0.7894737 0.2105263 0.0000000
#  1 0.0000000 0.6153846 0.3846154

Bedfordshire answered 19/9, 2020 at 7:57 Comment(0)

This answer is based upon Matifou's answer.

First I modified it to ensure that I don't get the freq column returned as a scientific notation column by using the scipen option.

Then I multiple the answer by 100 to get a percent rather than decimal to make the freq column easier to read as a percentage.

getOption("scipen") 
options("scipen"=10) 
mtcars %>%
count(am, gear) %>% 
mutate(freq = (n / sum(n)) * 100)

Enterpriser answered 22/12, 2016 at 11:28 Comment(0)

Here's simple and idiomatic solution that respects prior grouping (computing proportions within each group and doesn't make you write any variable names twice (a common source of bugs!)

props = function(data, ...) {
  data %>%
  count(...) %>%
  mutate(prop = n / sum(n), .keep="unused", by=...)
}
mtcars %>% group_by(cyl,vs) %>% prop(gear, carb)

prop sums to one within each original group.

If you're not familiar with the ... syntax. It's incredibly useful for writing reusable functions.

Liquorice answered 24/12, 2023 at 11:5 Comment(1)

That's right, I use = for assignment, and as far as I'm concerned anyone who doesn't is a luddite :P – Liquorice 24/12, 2023 at 11:6

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags