dplyr tidyr – How to generate case_when with dynamic conditons?
Asked Answered
T

6

9

Is there a way to dynamically/programmatically generate case_when conditions in dplyr with different column names and/or different numbers of conditions? I have an interactive script that I'm trying to convert into a function. There's a lot of repeated code in the case_when statements and I'm wondering if it can be automated somehow without my needing to write everything from scratch again and again.

Here's a dummy dataset:

test_df = tibble(low_A=c(5, 15, NA),
                 low_TOT=c(NA, 10, NA),
                 low_B=c(20, 25, 30),
                 high_A=c(NA, NA, 10),
                 high_TOT=c(NA, 40, NA),
                 high_B=c(60, 20, NA))

expected_df = tibble(low_A=c(5, 15, NA),
                     low_TOT=c(NA, 10, NA),
                     low_B=c(20, 25, 30),
                     ans_low=c(5, 10, 30),
                     high_A=c(NA, NA, 10),
                     high_TOT=c(NA, 40, NA),
                     high_B=c(60, 20, NA),
                     ans_high=c(60, 40, 10))

> expected_df
# A tibble: 3 x 8
  low_A low_TOT low_B ans_low high_A high_TOT high_B ans_high
  <dbl>   <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
1     5      NA    20       5     NA       NA     60       60
2    15      10    25      10     NA       40     20       40
3    NA      NA    30      30     10       NA     NA       10

The logic I want is that if the ._TOT column has a value, use that. If not, then try column ._A, and if not, then column ._B. Note that I intentionally didn't put ._TOT as the first column for a group. I could just use coalesce() in that case, but I want a general solution irrespective of column order.

Of course, all of this is easy to do with a couple of case_when statements. My issues are that:

  1. I'm trying to make a general function and so don't want interactive/tidy evaluation.
  2. I have a whole bunch of columns like this. All ending with one of _TOT, _A, _B but with different prefixes (e.g., low_TOT, low_A, low_B, high_TOT, high_A, high_B,..... and I don't want to rewrite a bunch of case_when functions again and again.

What I have right now looks like this (where I'm writing a case_when for each prefix):

def my_function = function(df) { 
    df %>% mutate(
          # If a total low doesn't exist, use A (if exists) or B (if exists)
          "ans_low" := case_when(
            !is.na(.data[["low_TOT"]]) ~ .data[["low_TOT"]],
            !is.na(.data[["low_A"]]) ~ .data[["low_A"]],
            !is.na(.data[["low_B"]]) ~ .data[["low_B"]],
          ),

          # If a total high doesn't exist, use A (if exists) or B (if exists)
          "ans_high" := case_when(
            !is.na(.data[["high_TOT"]]) ~ .data[["high_TOT"]],
            !is.na(.data[["high_A"]]) ~ .data[["high_R"]],
            !is.na(.data[["high_B"]]) ~ .data[["high_B"]],
              
         # Plus a whole bunch of similar case_when functions...
}

And what I'd like is to ideally get a way to dynamically generate case_when functions with different conditions so that I'm not writing a new case_when each time by exploiting the fact that:

  1. All the three conditions have the same general form, and the same structure for the variable names, but with a different prefix (high_, low_, etc.).
  2. They have the same formula of the form !is.na( .data[[ . ]]) ~ .data[[ . ]], where the dot(.) is the dynamically generated name of the column.

What I'd like is something like:

def my_function = function(df) { 
    df %>% mutate(
          "ans_low" := some_func(prefix="Low"),
          "ans_high" := some_func(prefix="High")
}

I tried creating my own case_when generator to replace the standard case_when as shown below, but I'm getting an error. I'm guessing that's because .data doesn't really work outside of the tidyverse functions?

some_func = function(prefix) {
  case_when(
    !is.na(.data[[ sprintf("%s_TOT", prefix) ]]) ~ .data[[ sprintf("%s_TOT", prefix) ]],
    !is.na(.data[[ sprintf("%s_A", prefix) ]]) ~ .data[[ sprintf("%s_A", prefix) ]],
    !is.na(.data[[ sprintf("%s_B", prefix) ]]) ~ .data[[ sprintf("%s_B", prefix) ]]
  )
}

Something else I'm curious about is making an even more general case_when generator. In the examples thus far, it's only the names (prefix) of the columns that are changing. What if I wanted to

  1. change the number and names of suffixes (e.g., high_W, high_X, high_Y, high_Z, low_W, low_X, low_Y, low_Z, .......) and so make a character vector of suffixes an argument of some_func
  2. change the form of the formula. Right now, it's of the form !is.na(.data[[ . ]]) ~ .data[[ . ]] for all the conditions, but what if I wanted to make this an argument of some_func? For example, !is.na(.data[[ . ]]) ~ sprintf("%s is missing", .)

I'd be happy with just getting it to work with different prefixes but it'd be very cool to understand how I could achieve something even more general with arbitrary (but common) suffixes and arbitrary formulae such that I can do some_func(prefix, suffixes, formula).

Trici answered 22/7, 2021 at 18:11 Comment(6)
Please show a small reproducible exampleDenise
It's easier to help you if you include a simple reproducible example with sample input and desired output that can be used to test and verify possible solutions. If you are just trying to get the first non-NA value, a function like coalesce() is probably more appropriate.Tyburn
See now. I added a simple data set as an example and rewrote the question to be clearer and shorter. It's still a little long because, really, I'm asking 3 questions about the increasing levels of generality that I'd like (and to see if that's possible, to begin with).Trici
A coalesce() could be a potential answer, but I'm more interested in dynamically generating conditions (is.na is just the particular example here and coalesce also requires a specific column order). I'm really trying to understand how to program with dplyr better and achieve higher levels of abstraction/generality.Trici
I also just tried coalese() with a prior reordering of the columns but it gives the same major issue: I have to write a whole bunch of coalesce statements now. I want to exploit the common prefix of the groups of columns so I don't have to write 10 different case_when or coalese statements.Trici
Why not split up the columns by prefix, using the results of gsub("_(TOT|[A-Z]+)$", "", ...) on colnames() to determine how they're split up? This accounts for indefinitely many column suffixes: *_TOT, *_A, *_B, *_C, ..., *_Z, *_AA, *_AB, ..., and so forth. Then for each of those splits ("low_" and "high_"), sort their colnames() by suffix, given by str_extract("_(TOT|[A-Z]+)$"); you'll obviously have to reorder "_TOT" as coming first. Then mutate(paste0("ans_", prefix) = coalesce(everything())), and cbind() or bind_cols() all results back together.Evenhanded
A
8

Here is a custom case_when function that you can call with purrr::reduce and a vector of strings parts of your variable names (in the example c("low", "high"):

library(dplyr)
library(purrr)

my_case_when <- function(df, x) {
  
  mutate(df,
         "ans_{x}" := case_when(
           !is.na(!! sym(paste0(x, "_TOT"))) ~ !! sym(paste0(x, "_TOT")),
           !is.na(!! sym(paste0(x, "_A"))) ~ !! sym(paste0(x, "_A")),
           !is.na(!! sym(paste0(x, "_B"))) ~ !! sym(paste0(x, "_B"))
           )
  )
}

test_df %>% 
  reduce(c("low", "high"), my_case_when, .init = .)

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Created on 2021-07-22 by the reprex package (v0.3.0)

I also have a package on Github {dplyover} which is made for this kind of cases. For your example with more than two variables I would use dplyover::over together with a special syntax to evaluate strings as variable names. We can further use dplyover::cut_names("_TOT") to extract the string parts of the variable names that come before or after "_TOT" (in the example this is "low" and "high").

We can either use case_when:

library(dplyr)
library(dplyover) # https://github.com/TimTeaFan/dplyover

test_df %>% 
  mutate(over(cut_names("_TOT"),
              list(ans = ~ case_when(
                  !is.na(.("{.x}_TOT")) ~ .("{.x}_TOT"),
                  !is.na(.("{.x}_A")) ~ .("{.x}_A"),
                  !is.na(.("{.x}_B")) ~ .("{.x}_B")
                  )),
              .names = "{fn}_{x}")
         )

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Or somewhat easier coalesce:

test_df %>% 
  mutate(over(cut_names("_TOT"),
              list(ans = ~ coalesce(.("{.x}_TOT"),
                                    .("{.x}_A"),
                                    .("{.x}_B"))),
              .names = "{fn}_{x}")
  )

#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

Created on 2021-07-22 by the reprex package (v0.3.0)

Abase answered 22/7, 2021 at 20:19 Comment(3)
Color me intrigued by dplyover! How might one do this for arbitrarily many suffixes? Consider: _TOT, _A, _B, ..., _Z, _AA, _AB, ..., and so forth; defined by the regex _(TOT|[A-Z]+)$.Evenhanded
@Greg: We can use regex in the selection functions cut_names and extract_names. However, within the case_when function above, we would need to hard code all suffixes, at least if we use over. There are also over2 and over2x which will also allow a .y argument , but at the end it depends on how the case_when function should look like.Abase
@Greg, I just found a solution (with some edits of my own), that allows for arbitrary suffixes. Look up my solution below.Trici
P
6

Updated Solution I think this solution solely based on base R may help you.

fn <- function(data) {
  
  do.call(cbind, lapply(unique(gsub("([[:alpha:]]+)_.*", "\\1", names(test_df))), function(x) {
    tmp <- test_df[paste0(x, c("_TOT", "_A", "_B"))]
    tmp[[paste(x, "ans", sep = "_")]] <- Reduce(function(a, b) {
      i <- which(is.na(a))
      a[i] <- b[i]
      a
    }, tmp)
    tmp
  }))
}

fn(test_df)

fn(test_df)

   high_TOT high_A high_B high_ans low_TOT low_A low_B low_ans
1       NA     NA     60       60      NA     5    20       5
2       40     NA     20       40      10    15    25      10
3       NA     10     NA       10      NA    NA    30      30
Paule answered 22/7, 2021 at 20:15 Comment(0)
A
6

At the risk of not answering the question, I think the easiest way to approach this is to just reshape and use coalesce(). Your data structure requires two pivots either way (I think) but this requires no careful thinking about what prefixes are present.

library(tidyverse)

test_df <- tibble(
  low_A = c(5, 15, NA),
  low_TOT = c(NA, 10, NA),
  low_B = c(20, 25, 30),
  high_A = c(NA, NA, 10),
  high_TOT = c(NA, 40, NA),
  high_B = c(60, 20, NA)
)

test_df %>%
  rowid_to_column() %>%
  pivot_longer(cols = -rowid, names_to = c("prefix", "suffix"), names_sep = "_") %>%
  pivot_wider(names_from = suffix, values_from = value) %>%
  mutate(ans = coalesce(TOT, A, B)) %>%
  pivot_longer(cols = c(-rowid, -prefix), names_to = "suffix") %>%
  pivot_wider(names_from = c(prefix, suffix), names_sep = "_", values_from = value)
#> # A tibble: 3 x 9
#>   rowid low_A low_TOT low_B low_ans high_A high_TOT high_B high_ans
#>   <int> <dbl>   <dbl> <dbl>   <dbl>  <dbl>    <dbl>  <dbl>    <dbl>
#> 1     1     5      NA    20       5     NA       NA     60       60
#> 2     2    15      10    25      10     NA       40     20       40
#> 3     3    NA      NA    30      30     10       NA     NA       10

Note also that case_when has no tidy evaluation, and so just not using mutate simplifies your some_func a lot. You already got an answer using !!sym inside mutate, so here is a version that illustrates a simpler way. I prefer not to use tidyeval unless necessary because I want to use a mutate chain, and here it's not really needed.

some_func <- function(df, prefix) {
  ans <- str_c(prefix, "_ans")
  TOT <- df[[str_c(prefix, "_TOT")]]
  A <- df[[str_c(prefix, "_A")]]
  B <- df[[str_c(prefix, "_B")]]
  
  df[[ans]] <- case_when(
    !is.na(TOT) ~ TOT,
    !is.na(A) ~ A,
    !is.na(B) ~ B
  )
  df
}

reduce(c("low", "high"), some_func, .init = test_df)
#> # A tibble: 3 x 8
#>   low_A low_TOT low_B high_A high_TOT high_B low_ans high_ans
#>   <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10
Alfy answered 22/7, 2021 at 20:58 Comment(3)
My instinct too was to just "to just reshape and use coalesce()". I am curious, though: how might we generalize to arbitrarily many alphabetical suffixes _A, _B, ..., _Z, _AA, _AB, and so forth? And for prefixes that might themselves contain _, as with another_prefix_A? Perhaps by pivoting the column names into a name column, and then by splitting each name into (1) the substring matching the regex _(TOT|[A-Z]+)$, and (2) the substring of everything that came before.Evenhanded
To handle arbitrary suffixes, I'd probably want to order the list of suffixes and splice into coalesce. For complex prefixes, pivot_longer supports names_pattern, so you can use use a regex to select the groups you want (e.g. (^.*)_([^_]+$) would (I think) make the suffix the last _ before the end of the string, and the prefix everything before that underscore.Alfy
I really liked your base R solution (way more than the multiple pivots). Unfortunately, yours doesn't really solve the problem of generating case_when conditions dynamically so I had to choose TimeTeaFan's as the accepted answer. But if I had to do this again, I'd definitely use your base R solution which is much easier to grasp.Trici
T
3

Thanks for all your answers folks! Calum You's answer specifically made me realise that sticking to the Tidyverse all the time isn't necessarily the best and sometimes base R has a better, simpler, and more elegant solution.

Thans to a ton of searching and this excellent post by noahm on the RStduio community, I was also able to come up with a solution of my own that does what I was looking for:

library(tidyverse)
library(rlang)
library(glue)

make_expr = function(prefix, suffix) {
  rlang::parse_expr(glue::glue('!is.na(.data[[\"{prefix}_{suffix}\"]]) ~ .data[[\"{prefix}_{suffix}\"]]'))
}

make_conds = function(prefixes, suffixes){
  map2(prefixes, suffixes, make_expr)
}

ans_df = test_df %>%  
    mutate(
        "ans_low" := case_when(
            !!! make_conds( prefixes=c("low"), suffixes=c("TOT", "A", "B") ) 
        ),
        "ans_high" := case_when(
            !!! make_conds( prefixes=c("high"), suffixes=c("TOT", "A", "B") ) 
        )
    )

# The ans is the same as the expected solution
> all_equal(ans_df, expected_df)
[1] TRUE

I've also checked that this works inside of a function (which was another important consideration for me).

One benefits of this solution is that the suffixes are not hard-coded and achieve at least the first level of generality I was looking for.

I imagine some string manipulation with replacements could possibly also allow for generality with the structure of the formulae. Ultimately, general formulae would require a string templating solution of some sort because with this structure, you can just keep that into glue.

Trici answered 22/7, 2021 at 22:23 Comment(0)
K
2

This does not generate any case_when, but you can create the two new columns as follows. Of course this could also be a function with test_df, ans_order, and and_groups as arguments.

ans_order <- c('TOT', 'A', 'B')
ans_groups <- c('low', 'high')

test_df[paste0('ans_', ans_groups)] <- 
  apply(outer(ans_groups, ans_order, paste, sep = '_'), 1, 
        function(x) do.call(dplyr::coalesce, test_df[x]))

test_df
#>   low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
#> 1     5      NA    20     NA       NA     60       5       60
#> 2    15      10    25     NA       40     20      10       40
#> 3    NA      NA    30     10       NA     NA      30       10

If you'd rather not use any packages, another option is

test_df[paste0('ans_', ans_groups)] <- 
  apply(outer(ans_groups, ans_order, paste, sep = '_'), 1, 
        function(x) Reduce(function(x, y) ifelse(is.na(x), y, x), test_df[x]))
Kutz answered 22/7, 2021 at 22:49 Comment(0)
C
1

Though the answer has been accepted, I feel this can be done (even for any number of column sets) in dplyr only without the need of writing a custom function earlier.

test_df %>%
  mutate(across(ends_with('_TOT'), ~ coalesce(., 
                                              get(gsub('_TOT', '_A', cur_column())), 
                                              get(gsub('_TOT', '_B', cur_column()))
                                              ),
                .names = "ans_{gsub('_TOT', '', .col)}"))

# A tibble: 3 x 8
  low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
  <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
1     5      NA    20     NA       NA     60       5       60
2    15      10    25     NA       40     20      10       40
3    NA      NA    30     10       NA     NA      30       10

A complete base R approach

Reduce(function(.x, .y) {
  xx <- .x[paste0(.y, c('_TOT', '_A', '_B'))]
  .x[[paste0('ans_',.y)]] <- apply(xx, 1, \(.z) head(na.omit(.z), 1))
  .x
}, unique(gsub('([_]*)_.*', '\\1', names(test_df))),
init = test_df)

# A tibble: 3 x 8
  low_A low_TOT low_B high_A high_TOT high_B ans_low ans_high
  <dbl>   <dbl> <dbl>  <dbl>    <dbl>  <dbl>   <dbl>    <dbl>
1     5      NA    20     NA       NA     60       5       60
2    15      10    25     NA       40     20      10       40
3    NA      NA    30     10       NA     NA      30       10
Cask answered 23/7, 2021 at 5:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.