dplyr function with optional grouping only when argument provided

Asked 16/10, 2018 at 12:34 Answered 18/6, 2024 at 11:31

r dplyr metaprogramming tidyverse quasiquotes

I need to write a dplyr function that creates a customised area plot. So here's my attempt.

area_plot <- function(data, what, by){
  by <- ensym(by)
  what <- ensym(what)

  data %>% 
    filter(!is.na(!!by)) %>% 
    group_by(date, !!by) %>% 
    summarise(!!what := sum(!!what, na.rm = TRUE)) %>% 
    complete(date, !!by, fill = rlang::list2(!!what := 0)) %>% 
    ggplot(aes(date, !!what, fill = !!by)) +
    geom_area(position = 'stack') +
    scale_x_date(breaks = '1 month', date_labels = '%Y-%m', expand = c(.01, .01)) +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, vjust = .4)) +
    labs(fill = '')
}

But I've been wondering if there is any default value for by argument that would output geom_area plot for all groups together. I know that I can use if to define data used in ggplot2 first and do something like this inside a function:

if (by != 'default') {
    data <- data %>% 
    filter(!is.na(!!by)) %>% 
    group_by(date, !!by) %>% 
    summarise(!!what := sum(!!what, na.rm = TRUE)) %>% 
    complete(date, !!by, fill = rlang::list2(!!what := 0))}

ggplot(data, aes(date, !!what, fill = !!by)) +
geom_area(position = 'stack') +
scale_x_date(breaks = '1 month', date_labels = '%Y-%m', expand = c(.01, .01)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = .4)) +
labs(fill = '')

But I ponder if there's a neat trick to provide some value (eg. constant) to group_by that would make summarise preserving original structure (so basically, do nothing) despite being called. A behaviour similar to that when you provide a constant to some aesthetic in ggplot2.

Please see the sample of the data attached. group is an optional grouping variable.

structure(list(date = structure(c(17052, 17654, 17111, 17402, 
17090, 17765, 17181, 17301, 17496, 17051, 16980, 17155, 17599, 
16986, 17607, 17620, 17328, 17085, 17666, 17759, 17238, 16975, 
17242, 17322, 17625, 17598, 17124, 17648, 17675, 17613, 17044, 
16984, 16968, 17421, 17152, 17148, 17418, 17017, 17655, 17148, 
16981, 17644, 17149, 17090, 17548, 17474, 17564, 17530, 17237, 
17679, 17166, 17470, 17427, 17306, 17677, 17600, 17458, 17697, 
17602, 16990, 17111, 17150, 17561, 17406, 17135, 17181, 17014, 
17419, 17273, 17416, 17101, 17367, 17170, 17015, 17386, 17444, 
17507, 17592, 17058, 17292, 16966, 17756, 17239, 17479, 17260, 
17477, 16989, 17032, 17219, 17430, 17696, 17487, 17578, 17759, 
17269, 17634, 17279, 17478, 17222, 17296), class = "Date"), count = c(2, 
4, 2, 3, 6, 1, 4, 8, 1, 5, 1, 5, 1, 1, 2, 6, 3, 5, 2, 7, 3, 4, 
1, 3, 4, 2, 4, 1, 2, 3, 16, 1, 5, 4, 3, 4, 4, 6, 1, 3, 3, 1, 
3, 10, 5, 1, 4, 2, 2, 4, 5, 26, 4, 9, 3, 1, 3, 1, 4, 1, 2, 3, 
1, 13, 3, 1, 3, 1, 1, 3, 1, 3, 3, 4, 1, 2, 2, 3, 1, 9, 3, 1, 
2, 1, 4, 2, 1, 2, 4, 3, 2, 3, 1, 6, 5, 1, 2, 2, 3, 4), group = c("NON-FOOD", 
NA, NA, NA, NA, "MIX", NA, NA, "MIX", NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, "FOOD", NA, "FOOD", NA, NA, "MIX", 
NA, NA, NA, "FOOD", "FOOD", NA, NA, NA, NA, "FOOD", NA, NA, "FOOD", 
NA, NA, NA, "FOOD", NA, NA, NA, NA, NA, NA, NA, NA, "MIX", NA, 
NA, "FOOD", NA, "FOOD", NA, NA, "FOOD", NA, "FOOD", NA, NA, "NON-FOOD", 
NA, NA, "MIX", "NON-FOOD", NA, NA, NA, NA, NA, NA, "IMAGE", NA, 
"FOOD", NA, NA, NA, "FOOD", NA, "FOOD", NA, NA, NA, NA, NA, NA, 
NA, NA, "FOOD", "FOOD", NA, NA, NA)), row.names = c(73008L, 535553L, 
122359L, 321655L, 105632L, 646925L, 172409L, 256204L, 394666L, 
72385L, 20180L, 156162L, 478525L, 91409L, 485397L, 501386L, 277336L, 
100902L, 549629L, 640676L, 209400L, 16603L, 224543L, 272638L, 
505291L, 475497L, 131845L, 529041L, 558295L, 491746L, 67156L, 
23499L, 11150L, 334454L, 154958L, 150674L, 333348L, 45599L, 536064L, 
150673L, 20668L, 524095L, 151809L, 105713L, 433853L, 375687L, 
445626L, 420587L, 208594L, 562514L, 162403L, 372594L, 338509L, 
259784L, 560356L, 480072L, 361471L, 579474L, 481262L, 26469L, 
122119L, 152537L, 443426L, 325045L, 140531L, 171908L, 43547L, 
333968L, 237152L, 332106L, 114754L, 298081L, 164923L, 43577L, 
311250L, 350267L, 404348L, 470188L, 78329L, 250086L, 9486L, 638289L, 
209638L, 379370L, 227299L, 377487L, 26333L, 55058L, 195261L, 
340666L, 578515L, 387600L, 457752L, 640729L, 235389L, 514348L, 
240303L, 378836L, 197409L, 252746L), class = "data.frame")

Surreptitious answered 16/10, 2018 at 12:34 Comment(2)

You can take an argument for your grouping variable and default it to NULL. Then check whether that argument is null: if it isn't, group by that variable, and if it is null, skip the grouping step. – Caravansary 16/10, 2018 at 12:55

@camille, thanks for your suggestion. Could you elaborate how is that different than setting default for NA and checking if is.na afterwards? – Surreptitious 16/10, 2018 at 13:1

Here's one way to do the first few steps of your function (I didn't go into all the ggplot stuff, just how you could approach grouping). In general, to set a default "do nothing" action, such as default to not grouping, you'll use argument = NULL in your function--you can look around at other functions' doc pages to see how this is done. Here's an SO post on the difference between NA and NULL.

I'm not super adept at working with quosures, but I've built a few functions and often rely on some rlang/tidyselect helper functions, such as rlang::quo_is_null that I'm using here. Someone else may be able to rewrite this without helpers.

First to see the behavior you're looking for, as the grouped or not grouped summaries:

library(tidyverse)

# grouped
df %>%
  filter(!is.na(group)) %>%
  group_by(group) %>%
  summarise(count = sum(count, na.rm = TRUE))
#> # A tibble: 4 x 2
#>   group    count
#>   <chr>    <dbl>
#> 1 FOOD        34
#> 2 IMAGE        1
#> 3 MIX          8
#> 4 NON-FOOD     6

# not grouped
df %>%
  # add in if you want to filter ungrouped data
  summarise(count = sum(count, na.rm = TRUE))
#>   count
#> 1   347

Then in the function, I create what_var as the quosure version of what (rlang experts, feel free to correct me on this terminology...?). I generally add _var to names to keep track of what's the original argument and what's been enquoed already. Check for whether the argument by is null by creating a quosure of by and checking whether that is null. If it's not null, i.e. if some column name was supplied for by, filter and group by that quosure. If it is null, just pass along the original data frame. I pass the data to a new variable in the else statement to avoid operating on the original data frame. Then, regardless of whether the data is grouped, summarize what.

to_group_or_not_to_group <- function(data, what, by = NULL) {
  what_var <- enquo(what)

  if(!rlang::quo_is_null(enquo(by))) {
    by_var <- enquo(by)

    grouped_or_not <- data %>%
      filter(!is.na(!!by_var)) %>%
      group_by(!!by_var)
  } else {
    grouped_or_not <- data
  }

  grouped_or_not %>%
    summarise(!!quo_name(what_var) := sum(!!what_var, na.rm = TRUE))

}

Verify that you got your targeted results. With a grouping variable:

df %>%
  to_group_or_not_to_group(what = count, by = group)
#> # A tibble: 4 x 2
#>   group    count
#>   <chr>    <dbl>
#> 1 FOOD        34
#> 2 IMAGE        1
#> 3 MIX          8
#> 4 NON-FOOD     6

Supplying NULL as the (absence of) grouping variable:

df %>%
  to_group_or_not_to_group(what = count, by = NULL)
#>   count
#> 1   347

Without a grouping variable, falling back on the default by = NULL:

df %>%
  to_group_or_not_to_group(what = count)
#>   count
#> 1   347

^{Created on 2018-10-16 by the reprex package (v0.2.1)}

Caravansary answered 16/10, 2018 at 13:28 Comment(0)

for a one-line solution, you could use a combination of across() and any_of(), together with as_label() (and enquo if you use it within a function):

library(dplyr, warn.conflicts = FALSE)

group_maybe <- function(df, by=NULL){
  df %>% 
    group_by(across(any_of(as_label(enquo(by)))))
}

group_maybe(iris, by = Species)
#> # A tibble: 150 × 5
#> # Groups:   Species [3]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> # ℹ 148 more rows

group_maybe(iris)
#> # A tibble: 150 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> # ℹ 148 more rows

^{Created on 2024-06-18 with reprex v2.1.0}

Sotelo answered 18/6, 2024 at 11:31 Comment(0)

Recommended topics

Hot tags