Dplyr function to compute average, n, sd and standard error

Asked 30/5, 2017 at 15:32 Answered 18/7, 2024 at 18:8

I find myself writing this bit of code all the time to produce standard errors for group means ( to then use for plotting confidence intervals).

It would be nice to write my own function to do this in one line of code, though. I have read the nse vignette in dplyr on non-standard evaluation and this blog post as well. I get it somewhat, but I'm too much of a noob to figure this out on my own. Can anyone help out?

var1<-sample(c('red', 'green'), size=10, replace=T)
var2<-rnorm(10, mean=5, sd=1)
df<-data.frame(var1, var2)
df %>% 
group_by(var1) %>% 
summarize(avg=mean(var2), n=n(), sd=sd(var2), se=sd/sqrt(n))

Schlegel answered 30/5, 2017 at 15:32 Comment(2)

Can you show what you have tried? Where did you get stuck? Have a look at some of the questions in [nse] tag. – Kerianne 30/5, 2017 at 15:46

Well, I was playing around with this code in the blog post: codemean_mpg = function(data, ..., x) { data %>% group_by_(.dots = lazyeval::lazy_dots(...)) %>% summarize(mean_mpg = ~mean(x)) } mtcars %>% mean_mpg(cyl, gear, mpg) code It returned the error Not a Vector – Schlegel 30/5, 2017 at 16:23

You can use the function enquo to explicitly name the variables in your function call:

my_fun <- function(x, cat_var, num_var){
  cat_var <- enquo(cat_var)
  num_var <- enquo(num_var)

  x %>%
    group_by(!!cat_var) %>%
    summarize(avg = mean(!!num_var), n = n(), 
              sd = sd(!!num_var), se = sd/sqrt(n))
}

which gives you:

> my_fun(df, var1, var2)
# A tibble: 2 x 5
    var1      avg     n        sd        se
  <fctr>    <dbl> <int>     <dbl>     <dbl>
1  green 4.873617     7 0.7515280 0.2840509
2    red 5.337151     3 0.1383129 0.0798550

and that matches the ouput of your example:

> df %>% 
+   group_by(var1) %>% 
+   summarize(avg=mean(var2), n=n(), sd=sd(var2), se=sd/sqrt(n))
# A tibble: 2 x 5
    var1      avg     n        sd        se
  <fctr>    <dbl> <int>     <dbl>     <dbl>
1  green 4.873617     7 0.7515280 0.2840509
2    red 5.337151     3 0.1383129 0.0798550

EDIT:

The OP has asked to remove the group_by statement from the function to add the ability to group_by more than one variables. There are two ways to go about this IMO. First, you could simply remove the group_by statement and pipe a grouped data frame into the function. That method would look like this:

my_fun <- function(x, num_var){
  num_var <- enquo(num_var)

  x %>%
    summarize(avg = mean(!!num_var), n = n(), 
              sd = sd(!!num_var), se = sd/sqrt(n))
}

df %>%
  group_by(var1) %>%
  my_fun(var2)

Another way to go about this is to use ... and quos to allow for the function to capture multiple arguments for the group_by statement. That would look like this:

#first, build the new dataframe
var1<-sample(c('red', 'green'), size=10, replace=T)
var2<-rnorm(10, mean=5, sd=1)
var3 <- sample(c("A", "B"), size = 10, replace = TRUE)
df<-data.frame(var1, var2, var3)

# using the first version `my_fun`, it would look like this
df %>%
  group_by(var1, var3) %>%
  my_fun(var2)

# A tibble: 4 x 6
# Groups:   var1 [?]
    var1   var3      avg     n        sd        se
  <fctr> <fctr>    <dbl> <int>     <dbl>     <dbl>
1  green      A 5.248095     1       NaN       NaN
2  green      B 5.589881     2 0.7252621 0.5128378
3    red      A 5.364265     2 0.5748759 0.4064986
4    red      B 4.908226     5 1.1437186 0.5114865

# Now doing it with a new function `my_fun2`
my_fun2 <- function(x, num_var, ...){
  group_var <- quos(...)
  num_var <- enquo(num_var)

  x %>%
    group_by(!!!group_var) %>%
    summarize(avg = mean(!!num_var), n = n(), 
              sd = sd(!!num_var), se = sd/sqrt(n))
}

df %>%
  my_fun2(var2, var1, var3)

# A tibble: 4 x 6
# Groups:   var1 [?]
    var1   var3      avg     n        sd        se
  <fctr> <fctr>    <dbl> <int>     <dbl>     <dbl>
1  green      A 5.248095     1       NaN       NaN
2  green      B 5.589881     2 0.7252621 0.5128378
3    red      A 5.364265     2 0.5748759 0.4064986
4    red      B 4.908226     5 1.1437186 0.5114865

Frenzy answered 30/5, 2017 at 18:28 Comment(4)

You should probably note that this only works in the dev version of dplyr, not the current CRAN version, which OP is most likely using. – Kerianne 30/5, 2017 at 21:21

I'm finally returning to this; I had forgotten I had asked this. But is it possible to not include the categorical grouping variables in the function? Sometimes I group by one, sometimes by two grouping variables. I'd like to keep that flexibility outside the custom function. But I don't know if that is possible. – Schlegel 30/10, 2017 at 15:9

I have added an edit that will let you do this in 2 different ways – Frenzy 30/10, 2017 at 15:50

This is great, an I have been using this but I feel like a function like this should be in some package somwhere. Does anyone know if this is somewhere in a tidyverse friendly package? – Schlegel 23/9, 2022 at 14:22

library(dplyr)

sum_stats <- function(df, ..., by, col_names = "{.col}_{.fn}") {
  df |>
    summarize(across(c(...), list(mean = mean, n = length, sd = sd, 
                                  se = ~ sd(.) / sqrt(length(.))),
                     .names = col_names),
              .by = {{by}})
}

Usage

# single variable summary stats (no grouping variable)
mtcars |>
  sum_stats(mpg, col_names = "{.fn}")
#       mean  n       sd       se
# 1 20.09062 32 6.026948 1.065424

# multiple variables using tidy-select syntax (no grouping variable)
mtcars |>
  sum_stats(disp:hp)
#   disp_mean disp_n  disp_sd  disp_se  hp_mean hp_n    hp_sd    hp_se
# 1  230.7219     32 123.9387 21.90947 146.6875   32 68.56287 12.12032

# multiple variables and multiple grouping variables
mtcars |>
  sum_stats(mpg, disp:hp, by = c(cyl, vs))
#   cyl vs mpg_mean mpg_n    mpg_sd    mpg_se disp_mean disp_n   disp_sd   disp_se  hp_mean hp_n    hp_sd     hp_se
# 1   6  0 20.56667     3 0.7505553 0.4333333    155.00      3  8.660254  5.000000 131.6667    3 37.52777 21.666667
# 2   4  1 26.73000    10 4.7481107 1.5014845    103.62     10 27.824641  8.798924  81.8000   10 21.87236  6.916647
# 3   6  1 19.12500     4 1.6317169 0.8158584    204.55      4 44.742634 22.371317 115.2500    4  9.17878  4.589390
# 4   8  0 15.10000    14 2.5600481 0.6842016    353.10     14 67.771324 18.112648 209.2143   14 50.97689 13.624146
# 5   4  0 26.00000     1        NA        NA    120.30      1        NA        NA  91.0000    1       NA        NA

Using across() and a list of functions would reduce typing for multiple variables and allows for the use of the .names argument.
NSE setup has been reduced by the introduction of the embrace operator: {{ }}, which combines enquo and !!.
The use of the ellipsis (...) will allow you to pass multiple variables to summarize and allows for the use of tidy-select syntax.
I added a col_names argument. This allows you more control over the column name output. It takes a glue style sytnax with {.col} representing the column name and {.fn} representing the function name. See the .names argument of ?across for more details.
I put named arguments after ... to force the use of named-argument pairs in the function call (e.g. by = cyl), which I think is clearer but you can move these around.

Applicant answered 18/7, 2024 at 18:8 Comment(0)

Recommended topics

Hot tags