Is there a way to dynamically/programmatically generate case_when
conditions in dplyr
with different column names and/or different numbers of conditions? I have an interactive script that I'm trying to convert into a function. There's a lot of repeated code in the case_when
statements and I'm wondering if it can be automated somehow without my needing to write everything from scratch again and again.
Here's a dummy dataset:
test_df = tibble(low_A=c(5, 15, NA),
low_TOT=c(NA, 10, NA),
low_B=c(20, 25, 30),
high_A=c(NA, NA, 10),
high_TOT=c(NA, 40, NA),
high_B=c(60, 20, NA))
expected_df = tibble(low_A=c(5, 15, NA),
low_TOT=c(NA, 10, NA),
low_B=c(20, 25, 30),
ans_low=c(5, 10, 30),
high_A=c(NA, NA, 10),
high_TOT=c(NA, 40, NA),
high_B=c(60, 20, NA),
ans_high=c(60, 40, 10))
> expected_df
# A tibble: 3 x 8
low_A low_TOT low_B ans_low high_A high_TOT high_B ans_high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5 NA 20 5 NA NA 60 60
2 15 10 25 10 NA 40 20 40
3 NA NA 30 30 10 NA NA 10
The logic I want is that if the ._TOT
column has a value, use that. If not, then try column ._A
, and if not, then column ._B
. Note that I intentionally didn't put ._TOT
as the first column for a group. I could just use coalesce() in that case, but I want a general solution irrespective of column order.
Of course, all of this is easy to do with a couple of case_when
statements. My issues are that:
- I'm trying to make a general function and so don't want interactive/tidy evaluation.
- I have a whole bunch of columns like this. All ending with one of
_TOT, _A, _B
but with different prefixes (e.g.,low_TOT, low_A, low_B, high_TOT, high_A, high_B,.....
and I don't want to rewrite a bunch ofcase_when
functions again and again.
What I have right now looks like this (where I'm writing a case_when
for each prefix):
def my_function = function(df) {
df %>% mutate(
# If a total low doesn't exist, use A (if exists) or B (if exists)
"ans_low" := case_when(
!is.na(.data[["low_TOT"]]) ~ .data[["low_TOT"]],
!is.na(.data[["low_A"]]) ~ .data[["low_A"]],
!is.na(.data[["low_B"]]) ~ .data[["low_B"]],
),
# If a total high doesn't exist, use A (if exists) or B (if exists)
"ans_high" := case_when(
!is.na(.data[["high_TOT"]]) ~ .data[["high_TOT"]],
!is.na(.data[["high_A"]]) ~ .data[["high_R"]],
!is.na(.data[["high_B"]]) ~ .data[["high_B"]],
# Plus a whole bunch of similar case_when functions...
}
And what I'd like is to ideally get a way to dynamically generate case_when
functions with different conditions so that I'm not writing a new case_when
each time by exploiting the fact that:
- All the three conditions have the same general form, and the same structure for the variable names, but with a different prefix (
high_
,low_
, etc.). - They have the same formula of the form
!is.na( .data[[ . ]]) ~ .data[[ . ]]
, where the dot(.
) is the dynamically generated name of the column.
What I'd like is something like:
def my_function = function(df) {
df %>% mutate(
"ans_low" := some_func(prefix="Low"),
"ans_high" := some_func(prefix="High")
}
I tried creating my own case_when
generator to replace the standard case_when
as shown below, but I'm getting an error. I'm guessing that's because .data
doesn't really work outside of the tidyverse functions?
some_func = function(prefix) {
case_when(
!is.na(.data[[ sprintf("%s_TOT", prefix) ]]) ~ .data[[ sprintf("%s_TOT", prefix) ]],
!is.na(.data[[ sprintf("%s_A", prefix) ]]) ~ .data[[ sprintf("%s_A", prefix) ]],
!is.na(.data[[ sprintf("%s_B", prefix) ]]) ~ .data[[ sprintf("%s_B", prefix) ]]
)
}
Something else I'm curious about is making an even more general case_when
generator. In the examples thus far, it's only the names (prefix) of the columns that are changing. What if I wanted to
- change the number and names of suffixes (e.g.,
high_W, high_X, high_Y, high_Z, low_W, low_X, low_Y, low_Z, .......
) and so make a character vector of suffixes an argument ofsome_func
- change the form of the formula. Right now, it's of the form
!is.na(.data[[ . ]]) ~ .data[[ . ]]
for all the conditions, but what if I wanted to make this an argument ofsome_func
? For example,!is.na(.data[[ . ]]) ~ sprintf("%s is missing", .)
I'd be happy with just getting it to work with different prefixes but it'd be very cool to understand how I could achieve something even more general with arbitrary (but common) suffixes and arbitrary formulae such that I can do some_func(prefix, suffixes, formula)
.
coalesce()
is probably more appropriate. – Tyburncoalesce()
could be a potential answer, but I'm more interested in dynamically generating conditions (is.na
is just the particular example here andcoalesce
also requires a specific column order). I'm really trying to understand how to program with dplyr better and achieve higher levels of abstraction/generality. – Tricicoalese()
with a prior reordering of the columns but it gives the same major issue: I have to write a whole bunch ofcoalesce
statements now. I want to exploit the common prefix of the groups of columns so I don't have to write 10 differentcase_when
orcoalese
statements. – Tricigsub("_(TOT|[A-Z]+)$", "", ...)
oncolnames()
to determine how they're split up? This accounts for indefinitely many column suffixes:*_TOT
,*_A
,*_B
,*_C
, ...,*_Z
,*_AA
,*_AB
, ..., and so forth. Then for each of those splits ("low_"
and"high_"
), sort theircolnames()
by suffix, given bystr_extract("_(TOT|[A-Z]+)$")
; you'll obviously have to reorder"_TOT"
as coming first. Thenmutate(paste0("ans_", prefix) = coalesce(everything()))
, andcbind()
orbind_cols()
all results back together. – Evenhanded