Dplyr function to compute average, n, sd and standard error
Asked Answered
S

2

6

I find myself writing this bit of code all the time to produce standard errors for group means ( to then use for plotting confidence intervals).

It would be nice to write my own function to do this in one line of code, though. I have read the nse vignette in dplyr on non-standard evaluation and this blog post as well. I get it somewhat, but I'm too much of a noob to figure this out on my own. Can anyone help out?

var1<-sample(c('red', 'green'), size=10, replace=T)
var2<-rnorm(10, mean=5, sd=1)
df<-data.frame(var1, var2)
df %>% 
group_by(var1) %>% 
summarize(avg=mean(var2), n=n(), sd=sd(var2), se=sd/sqrt(n))
Schlegel answered 30/5, 2017 at 15:32 Comment(2)
Can you show what you have tried? Where did you get stuck? Have a look at some of the questions in [nse] tag.Kerianne
Well, I was playing around with this code in the blog post: codemean_mpg = function(data, ..., x) { data %>% group_by_(.dots = lazyeval::lazy_dots(...)) %>% summarize(mean_mpg = ~mean(x)) } mtcars %>% mean_mpg(cyl, gear, mpg) code It returned the error Not a VectorSchlegel
F
10

You can use the function enquo to explicitly name the variables in your function call:

my_fun <- function(x, cat_var, num_var){
  cat_var <- enquo(cat_var)
  num_var <- enquo(num_var)

  x %>%
    group_by(!!cat_var) %>%
    summarize(avg = mean(!!num_var), n = n(), 
              sd = sd(!!num_var), se = sd/sqrt(n))
}

which gives you:

> my_fun(df, var1, var2)
# A tibble: 2 x 5
    var1      avg     n        sd        se
  <fctr>    <dbl> <int>     <dbl>     <dbl>
1  green 4.873617     7 0.7515280 0.2840509
2    red 5.337151     3 0.1383129 0.0798550

and that matches the ouput of your example:

> df %>% 
+   group_by(var1) %>% 
+   summarize(avg=mean(var2), n=n(), sd=sd(var2), se=sd/sqrt(n))
# A tibble: 2 x 5
    var1      avg     n        sd        se
  <fctr>    <dbl> <int>     <dbl>     <dbl>
1  green 4.873617     7 0.7515280 0.2840509
2    red 5.337151     3 0.1383129 0.0798550

EDIT:

The OP has asked to remove the group_by statement from the function to add the ability to group_by more than one variables. There are two ways to go about this IMO. First, you could simply remove the group_by statement and pipe a grouped data frame into the function. That method would look like this:

my_fun <- function(x, num_var){
  num_var <- enquo(num_var)

  x %>%
    summarize(avg = mean(!!num_var), n = n(), 
              sd = sd(!!num_var), se = sd/sqrt(n))
}

df %>%
  group_by(var1) %>%
  my_fun(var2)

Another way to go about this is to use ... and quos to allow for the function to capture multiple arguments for the group_by statement. That would look like this:

#first, build the new dataframe
var1<-sample(c('red', 'green'), size=10, replace=T)
var2<-rnorm(10, mean=5, sd=1)
var3 <- sample(c("A", "B"), size = 10, replace = TRUE)
df<-data.frame(var1, var2, var3)

# using the first version `my_fun`, it would look like this
df %>%
  group_by(var1, var3) %>%
  my_fun(var2)

# A tibble: 4 x 6
# Groups:   var1 [?]
    var1   var3      avg     n        sd        se
  <fctr> <fctr>    <dbl> <int>     <dbl>     <dbl>
1  green      A 5.248095     1       NaN       NaN
2  green      B 5.589881     2 0.7252621 0.5128378
3    red      A 5.364265     2 0.5748759 0.4064986
4    red      B 4.908226     5 1.1437186 0.5114865

# Now doing it with a new function `my_fun2`
my_fun2 <- function(x, num_var, ...){
  group_var <- quos(...)
  num_var <- enquo(num_var)

  x %>%
    group_by(!!!group_var) %>%
    summarize(avg = mean(!!num_var), n = n(), 
              sd = sd(!!num_var), se = sd/sqrt(n))
}

df %>%
  my_fun2(var2, var1, var3)

# A tibble: 4 x 6
# Groups:   var1 [?]
    var1   var3      avg     n        sd        se
  <fctr> <fctr>    <dbl> <int>     <dbl>     <dbl>
1  green      A 5.248095     1       NaN       NaN
2  green      B 5.589881     2 0.7252621 0.5128378
3    red      A 5.364265     2 0.5748759 0.4064986
4    red      B 4.908226     5 1.1437186 0.5114865
Frenzy answered 30/5, 2017 at 18:28 Comment(4)
You should probably note that this only works in the dev version of dplyr, not the current CRAN version, which OP is most likely using.Kerianne
I'm finally returning to this; I had forgotten I had asked this. But is it possible to not include the categorical grouping variables in the function? Sometimes I group by one, sometimes by two grouping variables. I'd like to keep that flexibility outside the custom function. But I don't know if that is possible.Schlegel
I have added an edit that will let you do this in 2 different waysFrenzy
This is great, an I have been using this but I feel like a function like this should be in some package somwhere. Does anyone know if this is somewhere in a tidyverse friendly package?Schlegel
A
2
library(dplyr)

sum_stats <- function(df, ..., by, col_names = "{.col}_{.fn}") {
  df |>
    summarize(across(c(...), list(mean = mean, n = length, sd = sd, 
                                  se = ~ sd(.) / sqrt(length(.))),
                     .names = col_names),
              .by = {{by}})
}

Usage

# single variable summary stats (no grouping variable)
mtcars |>
  sum_stats(mpg, col_names = "{.fn}")
#       mean  n       sd       se
# 1 20.09062 32 6.026948 1.065424
# multiple variables using tidy-select syntax (no grouping variable)
mtcars |>
  sum_stats(disp:hp)
#   disp_mean disp_n  disp_sd  disp_se  hp_mean hp_n    hp_sd    hp_se
# 1  230.7219     32 123.9387 21.90947 146.6875   32 68.56287 12.12032
# multiple variables and multiple grouping variables
mtcars |>
  sum_stats(mpg, disp:hp, by = c(cyl, vs))
#   cyl vs mpg_mean mpg_n    mpg_sd    mpg_se disp_mean disp_n   disp_sd   disp_se  hp_mean hp_n    hp_sd     hp_se
# 1   6  0 20.56667     3 0.7505553 0.4333333    155.00      3  8.660254  5.000000 131.6667    3 37.52777 21.666667
# 2   4  1 26.73000    10 4.7481107 1.5014845    103.62     10 27.824641  8.798924  81.8000   10 21.87236  6.916647
# 3   6  1 19.12500     4 1.6317169 0.8158584    204.55      4 44.742634 22.371317 115.2500    4  9.17878  4.589390
# 4   8  0 15.10000    14 2.5600481 0.6842016    353.10     14 67.771324 18.112648 209.2143   14 50.97689 13.624146
# 5   4  0 26.00000     1        NA        NA    120.30      1        NA        NA  91.0000    1       NA        NA
  1. Using across() and a list of functions would reduce typing for multiple variables and allows for the use of the .names argument.
  2. NSE setup has been reduced by the introduction of the embrace operator: {{ }}, which combines enquo and !!.
  3. The use of the ellipsis (...) will allow you to pass multiple variables to summarize and allows for the use of tidy-select syntax.
  4. I added a col_names argument. This allows you more control over the column name output. It takes a glue style sytnax with {.col} representing the column name and {.fn} representing the function name. See the .names argument of ?across for more details.
  5. I put named arguments after ... to force the use of named-argument pairs in the function call (e.g. by = cyl), which I think is clearer but you can move these around.
Applicant answered 18/7, 2024 at 18:8 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.