I'm trying to automatically calculate the mean score per row for multiple groups of columns. E.g. a set of columns could represent items of different scales. The columns are also systematically named (scale_itemnumber).
For example, the dummy data frame below has items from three different scales. (It can happen that not all items of each scale are included, indicated here as the missing VAR_3).
#library(tidyverse)
set.seed(123)
df <- tibble( G_1 = sample(1:5, size = 10000, replace = TRUE),
G_2 = sample(1:5, size = 10000, replace = TRUE),
G_3 = sample(1:5, size = 10000, replace = TRUE),
MOT_1 = sample(1:5, size = 10000, replace = TRUE),
MOT_2 = sample(1:5, size = 10000, replace = TRUE),
MOT_3 = sample(1:5, size = 10000, replace = TRUE),
VAR_1 = sample(1:5, size = 10000, replace = TRUE),
VAR_2 = sample(1:5, size = 10000, replace = TRUE),
VAR_4 = sample(1:5, size = 10000, replace = TRUE))
What I'm trying to do is to create an extra column for each construct (with dynamic names such as mean_G, mean_MOT, mean_VAR) that represents the row mean for their respective set of columns.
# A tibble: 6 x 12
G_1 G_2 G_3 MOT_1 MOT_2 MOT_3 VAR_1 VAR_2 VAR_4 mean_G mean_MOT mean_VAR
<int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 3 3 1 1 1 1 1 5 4 2.33 1 3.33
2 3 5 3 3 2 1 4 3 5 3.67 2 4
3 2 5 4 5 3 2 4 1 1 3.67 3.33 2
4 2 5 4 4 4 1 2 5 4 3.67 3 3.67
5 3 4 2 1 4 5 2 2 3 3 3.33 2.33
6 5 3 4 4 3 4 1 1 4 4 3.67 2
I actually have a working approach using rowwise() and c_across() in combination with purrr but its execution is just so slow compared to doing it manually (mutate + rowMeans combo). However, the true df has way more scales with many more items, so I would rather not have to hard code every mean column and insert each item (especially as the exact selection included might also vary per data frame).
#functional but slow approach
#get list of variable prefixes
var_names <- str_extract(names(df), "^.*(?=(_))") %>%
unique()
#use map and c_across to calculate the means rowwise per variable group
df_functional <-
df %>%
bind_cols(
map_dfc(.x = var_names,
.f = ~ .y %>%
rowwise() %>%
transmute(!!str_c("mean_", .x) := mean(c_across(starts_with(.x)))),
.y = .))
#manual approach
df_manual <- df %>% mutate(mean_G = rowMeans(select(., G_1, G_2, G_3)),
mean_MOT = rowMeans(select(., MOT_1, MOT_2, MOT_3)),
mean_VAR = rowMeans(select(., VAR_1, VAR_2, VAR_4)))
The result is identical but the dynamic/functional approach is significantly slower! Not sure what this would look like for dfs with many more columns/groups. How could I speed this up while still keeping the flexibility of the dynamic approach?
> identical(df_manual, df_functional)
[1] TRUE
#Benchmark (using the microbenchmark package)
benchmark
Unit: milliseconds
expr min lq mean median uq max neval
functional 37198.3569 38592.6855 48313.00156 52936.3254 55349.0561 59831.0141 100
manual 16.0662 18.0139 27.53403 19.9085 22.9384 138.5401 100
rename_with
and using... map(rowMeans) %>% setNames(paste0("mean_", names(.)))
instead. Anyway + 1. Much faster than my approach. – Garratt