To streamline data wrangling, I write a wrapper function consisted of several "verb functions" that process the data. Each one performs one task on the data. However, not all tasks are applicable to all datasets that pass through this process, and sometimes, for certain data, I might want to switch off some "verb functions", and skip them.
I'm trying to understand whether there's a conventional/canonical way to build such workflow within a wrapper function in R. Importantly, a way that will be efficient, both performance-wise and concise code.
Example
As part of data wrangling, I want to carry out several steps:
- Clean up column headers (using
janitor::clean_names()
) - Recode values in the data, such that
TRUE
andFALSE
are replaced with1
and0
(usinggsub()
). - Recode string values to lowercase (using
tolower()
). - Pivot wider based on specific
id
column (usingtidyr::pivot_wider
) - Drop rows with
NA
values (usingdplyr::drop_na()
)
Toy data
library(stringi)
library(tidyr)
set.seed(2021)
# simulate data
df <-
data.frame(id = 1:20,
isMale = rep(c("true", "false"), times = 10),
WEIGHT = sample(50:100, 20),
hash_Numb = stri_rand_strings(20, 5)) %>%
cbind(., score = sample(200:800, size = 20))
# sprinkle NAs randomly
df[c("isMale", "WEIGHT", "hash_Numb", "score")] <-
lapply(df[c("isMale", "WEIGHT", "hash_Numb", "score")], function(x) {
x[sample(seq_along(x), 0.25 * length(x))] <- NA
x
})
df <-
df %>%
tidyr::expand_grid(., Condition = c("A","B"))
df
#> # A tibble: 40 x 6
#> id isMale WEIGHT hash_Numb score Condition
#> <int> <chr> <int> <chr> <int> <chr>
#> 1 1 <NA> 56 EvRAq NA A
#> 2 1 <NA> 56 EvRAq NA B
#> 3 2 false 87 <NA> 322 A
#> 4 2 false 87 <NA> 322 B
#> 5 3 true 95 13pXe 492 A
#> 6 3 true 95 13pXe 492 B
#> 7 4 <NA> 88 4WMBS 626 A
#> 8 4 <NA> 88 4WMBS 626 B
#> 9 5 true NA Nrl1W 396 A
#> 10 5 true NA Nrl1W 396 B
#> # ... with 30 more rows
Created on 2021-03-03 by the reprex package (v0.3.0)
The data shows test scores of 20 people who took a test under two conditions. For each person we also know the gender (isMale
), the weight in kilograms(WEIGHT
), and a unique hash_number
.
Data cleanup and wrangling
Before this data is sent to analysis, it needs to be cleaned up, according to a certain chain of steps, which I laid out above.
library(janitor)
library(dplyr)
# helper function
convert_true_false_to_1_0 <- function(x) {
first_pass <- gsub("^(?:TRUE)$", 1, x, ignore.case = TRUE)
gsub("^(?:FALSE)$", 0, first_pass, ignore.case = TRUE)
}
# chain of steps
df %>%
janitor::clean_names() %>%
mutate(across(everything(), convert_true_false_to_1_0)) %>%
mutate(across(everything(), tolower)) %>%
pivot_wider(names_from = condition, values_from = score) %>%
drop_na()
My Question: How to pack this process in a wrapper that allows to flexibly switch some steps off?
One idea I have in my mind is to use a %>%
pipe with conditionals such as:
my_wrangling_wrapper <- function(dat,
clean_names = TRUE,
convert_tf_to_1_0 = TRUE,
convert_to_lower = TRUE,
pivot_widr = TRUE,
drp_na = TRUE){
dat %>%
{if (clean_names) janitor::clean_names(.) else .} %>%
{if (convert_tf_to_1_0) mutate(., across(everything(), convert_true_false_to_1_0)) else .} %>%
{if (convert_to_lower) mutate(., across(everything(), tolower)) else .} %>%
{if (pivot_widr) pivot_wider(., names_from = condition, values_from = score) else .} %>%
{if (drp_na) drop_na(.) else .}
}
This way, all steps are defaulted to happen, unless turned off:
- Use-case #1 -- Default run:
> my_wrangling_wrapper(dat = df)
## # A tibble: 6 x 6
## id is_male weight hash_numb a b
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 3 1 95 13pxe 492 492
## 2 9 1 54 hgzxp 519 519
## 3 12 0 72 vwetc 446 446
## 4 15 1 52 qadxc 501 501
## 5 17 1 71 g42vg 756 756
## 6 18 0 80 qiejd 712 712
- Use-case #2 -- Don't convert
true
/false
to1
/0
and don't dropNA
s:
> my_wrangling_wrapper(dat = df, convert_tf_to_1_0 = FALSE, drp_na = FALSE)
## # A tibble: 20 x 6
## id is_male weight hash_numb a b
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 NA 56 evraq NA NA
## 2 2 false 87 NA 322 322
## 3 3 true 95 13pxe 492 492
## 4 4 NA 88 4wmbs 626 626
## 5 5 true NA nrl1w 396 396
## 6 6 false NA 4oq74 386 386
## 7 7 true NA gg23f NA NA
## 8 8 false 94 NA NA NA
## 9 9 true 54 hgzxp 519 519
## 10 10 false 97 NA 371 371
## 11 11 true 90 NA 768 768
## 12 12 false 72 vwetc 446 446
## 13 13 NA NA jkhjh 338 338
## 14 14 false NA 0swem 778 778
## 15 15 true 52 qadxc 501 501
## 16 16 false 75 NA 219 219
## 17 17 true 71 g42vg 756 756
## 18 18 false 80 qiejd 712 712
## 19 19 NA 68 tadad NA NA
## 20 20 NA 53 iyw3o NA NA
My problem
Although the solution I came up with does work, I've learned that relying on the pipe operator is not advised within functions, because it slows down the process (see reference). Also, since %>%
is not part of base R
, there has to be a way to achieve the same "tweakable wrapping" functionality without the pipe. So I wonder: is there a conventional way to write a wrapper function that could be tweaked to turn off some of its components, and still overall remain performance-efficient?
{It's worth mentioning that I've asked a similar question regarding building a wrapper for ggplot
, turning geoms
off as desired. The answer was great but not applicable to the current question.}
my_wrangling_wrapper(dat = df)
I get different results when I redo your code. – Untune