Mutate across multiple columns to create new variable sets
Asked Answered
S

3

11

I have a panel dataset at the country and year level, and I'd like to create a two new variables based on existing ones.

year country var1 var2 var3 var 4 mean_var1 relmean_var1
1910 GER 1 4 10 6 3 0.333
1911 GER 2 3 11 7 1.5 1.3333
1910 FRA 5 6 8 9 3 1.66667
1911 FRA 1 4 10 9 1.5 .66667

What I'd like to do is create two new variables set : (1) a variable set of the average for each year (across countries) and (2) a variable set of the country value relative to the year-average. For example, for var1(1) would yield mean_var1 and (2) relmean_var1 and I'd want these for all the other variables. In total, there are over 1000 variables in the dataset, but I would only apply this function to about 6.

I have code that works for the first part, but I'd like to combine it as efficiently as possible with the second.

library(dplyr)
library(purrr)
df<- df%>% 
            group_by(year) %>%
            mutate_at(.funs = list(mean = ~mean(.)), .vars = c("var1", "var1", "var1", "var4"))

This code yields new variables called var1_mean (I would prefer mean_var1: how do I change this name?)

For the second step, I've tried:

df <- df %>%
map2_dfr(.x = d.test %>%
            select(var1, var2),
          .y = d.test %>%
            select(var1_mean, var2_mean), 
          ~ .x / .y) %>%
   setNames(c("relmean_var1", "relmean_var2"))

and I get errors

""Error in select(., var1, var2) : object 'd.test' not found."

. (I got this set up from this question)

I also tried:

 map2(var1, var1_mean, ~ df[[.x]] / df[[.y]]) %>% 
   set_names(cols) %>% 
   bind_cols(df, .)

And got

"Error in map2(var1, var1_mean, ~df[[.x]]/df[[.y]]) : object 'var1' not found

What's the best way to combine these two goals? Ideally with the naming scheme mean_var1 for (1) and relmean_var1 for (2)

Edit: input dataframe should look like this:

data <- tibble::tribble(
  ~year, ~country, ~var1, ~var2, ~var3, ~var.4,
  1910L,    "GER",    1L,    4L,   10L,     6L,
  1911L,    "GER",    2L,    3L,   11L,     7L,
  1910L,    "FRA",    5L,    6L,    8L,     9L,
  1911L,    "FRA",    1L,    4L,   10L,     9L
)

output dataframe should look like this (for all variables, just showing var1 as an example, but should be the same format for var2 through var4):

datanew  <- tibble::tribble(
  ~year, ~country, ~var1, ~var2, ~var3, ~var.4, ~mean_var1 , ~relmean_var1
  1910L,    "GER",    1L,    4L,   10L,     6L,     3L,        .3333L,
  1911L,    "GER",    2L,    3L,   11L,     7L,     1.5L,     1.3333L,
  1910L,    "FRA",    5L,    6L,    8L,     9L,     3L,       1.6667L,
  1911L,    "FRA",    1L,    4L,   10L,     9L      1.5L,      .6667L,
)
Southworth answered 24/11, 2021 at 13:9 Comment(8)
Could you clarify what your final 2 data.frames would look like? Could you edit your post and include the final data sets would be in the end, given your example data? That would be very helpful.Kilkenny
@Kilkenny I added example columns for var1. Let me know if it makes sense. Basically, mean_var1 = mean(var1), by(year) so it's the same across all countries for a given year. Then the relmean_var1 is the country's value relative to the overall average (var1/mean_var1), so it takes a different value for every country-year. This calculation would be done for all the variables.Southworth
Pierre, your first table looks good for the input. @Kilkenny is asking that you please do a second similar table for the expected dataset(s). There's a few ways we can interpret your request, but displaying the desired output would nail it down.Legalize
Also, for future requests, define your input dataset with code. See @danloo's response below for one way to do it.Legalize
Hi @wibeasley, thank you for your response. I will use danloos format in the future. The final dataset should look like the first one with 8 additional columns for mean_var1,..., mean_var4,....; relmean_var1, relmean_var4. Let me see if I can post it as two dataframes if that would be more clearSouthworth
@Legalize let me know if the edits at the bottom are the appropriate format!Southworth
@PierreRoubaix, that looks great. Thank you for helping me understand. One nitpick: does datanew$var2 have wrong values in the first & last row?Legalize
@Legalize yes, you're right. I'd changed the original values to make the math more clear when I added the desired second database. I fixed it so that it's clear for others who may want to reference this thread in the future. Thanks for your help!Southworth
K
20

This might be easier in long format, but here's an option you can pursue as wide data.

Using the latest version of dplyr you can mutate across and include .names argument to define how your want your new columns to look.

library(tidyverse)

my_col <- c("var1", "var2", "var3", "var4")

df %>%
  group_by(year) %>%
  mutate(across(my_col, mean, .names = "mean_{col}")) %>%
  mutate(across(my_col, .names = "relmean_{col}") / across(paste0("mean_", my_col)))

Output

   year country  var1  var2  var3  var4 mean_var1 mean_var2 mean_var3 mean_var4 relmean_var1 relmean_var2 relmean_var3 relmean_var4
  <int> <chr>   <int> <int> <int> <int>     <dbl>     <dbl>     <dbl>     <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
1  1910 GER         1     4    10     6       3         5         9         7.5        0.333        0.8          1.11         0.8  
2  1911 GER         2     3    11     7       1.5       3.5      10.5       8          1.33         0.857        1.05         0.875
3  1910 FRA         5     6     8     9       3         5         9         7.5        1.67         1.2          0.889        1.2  
4  1911 FRA         1     4    10     9       1.5       3.5      10.5       8          0.667        1.14         0.952        1.12
Kilkenny answered 24/11, 2021 at 19:23 Comment(2)
That's pretty slick & compact using dplyr::across(). It looks like you could include both year & country breakdowns by repeating the last three lines and piping to them (with only minor tweaks to group_by and the new variable names).Legalize
Yes, this does exactly what I wanted in such a nice and concise way. Thank you @Ben!Southworth
T
4
library(tidyverse)

data <- tibble::tribble(
  ~year, ~country, ~var1, ~var2, ~var3, ~var.4,
  1910L,    "GER",    1L,    2L,   10L,     6L,
  1911L,    "GER",    2L,    3L,   11L,     7L,
  1910L,    "FRA",    5L,    6L,    8L,     9L,
  1911L,    "FRA",    1L,    3L,   10L,     9L
)

data_long <-
  data %>%
  pivot_longer(-c(year, country))

data_long
#> # A tibble: 16 x 4
#>     year country name  value
#>    <int> <chr>   <chr> <int>
#>  1  1910 GER     var1      1
#>  2  1910 GER     var2      2
#>  3  1910 GER     var3     10
#>  4  1910 GER     var.4     6
#>  5  1911 GER     var1      2
#>  6  1911 GER     var2      3
#>  7  1911 GER     var3     11
#>  8  1911 GER     var.4     7
#>  9  1910 FRA     var1      5
#> 10  1910 FRA     var2      6
#> 11  1910 FRA     var3      8
#> 12  1910 FRA     var.4     9
#> 13  1911 FRA     var1      1
#> 14  1911 FRA     var2      3
#> 15  1911 FRA     var3     10
#> 16  1911 FRA     var.4     9

means_country <-
  data_long %>%
  group_by(country) %>%
  summarise(mean_country_value = mean(value))

means_years <-
  data_long %>%
  group_by(year) %>%
  summarise(mean_year_value = mean(value))

data %>%
  left_join(means_country) %>%
  left_join(means_years)
#> Joining, by = "country"
#> Joining, by = "year"
#> # A tibble: 4 x 8
#>    year country  var1  var2  var3 var.4 mean_country_value mean_year_value
#>   <int> <chr>   <int> <int> <int> <int>              <dbl>           <dbl>
#> 1  1910 GER         1     2    10     6               5.25            5.88
#> 2  1911 GER         2     3    11     7               5.25            5.75
#> 3  1910 FRA         5     6     8     9               6.38            5.88
#> 4  1911 FRA         1     3    10     9               6.38            5.75

Created on 2021-11-24 by the reprex package (v2.0.1)

Teston answered 24/11, 2021 at 13:24 Comment(0)
L
1

Here's an extension of @danlooo's approach so the country-level mean and year-level mean are in the same dataset (if that's desired). The notable difference is using two mutates in the pipe chain, instead of two summarizes, then joining.

Consider if you really want it wide again. Usually it's easier to keep it long (eg, remove the final call to tidyr::pivot_wider()).

ds <- tibble::tribble(
  ~year, ~country, ~var1, ~var2, ~var3,  ~var4,
  1910L,    "GER",    1L,    4L,   10L,     6L,
  1911L,    "GER",    2L,    3L,   11L,     7L,
  1910L,    "FRA",    5L,    6L,    8L,     9L,
  1911L,    "FRA",    1L,    4L,   10L,     9L
)  

ds |> 
  dplyr::mutate(
    year = as.character(year)   # To help the pivot below
  ) |> 
  tidyr::pivot_longer(
    cols         = -c(year, country), 
    names_to     = "key",
    names_prefix = "^var"
  ) |> 
  dplyr::group_by(country, key) |> 
  dplyr::mutate(
    m_c   = mean(value),  # Mean for the Country (and variable)
    r_c   = value / m_c,  # Relative mean for the Country (and variable)
  ) |> 
  dplyr::ungroup() |> 
  dplyr::group_by(year, key) |> 
  dplyr::mutate(
    m_y   = mean(value),  # Mean for the Year (and variable)
    r_y   = value / m_y,  # Relative mean for the Year (and variable)
  ) |> 
  dplyr::ungroup() |> 
  dplyr::mutate(
    year  = as.integer(year)  # Return it to a number
  ) |> 
  tidyr::pivot_wider(
    id_cols = c(year, country),
    names_from = key,
    names_glue = "{.value}_{key}",
    values_from = c(value, m_c, r_c, m_y, r_y)
  )

Output (wide) I prefer longer descriptive variable names like @danlooo's, but I wanted everything to fit in the SO screen:

   year country value_1 value_2 value_3 value_4 m_c_1 m_c_2 m_c_3 m_c_4 r_c_1 r_c_2 r_c_3 r_c_4 m_y_1 m_y_2 m_y_3 m_y_4 r_y_1 r_y_2 r_y_3 r_y_4
  <int> <chr>     <int>   <int>   <int>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  1910 GER           1       4      10       6   1.5   3.5  10.5   6.5 0.667 1.14  0.952 0.923   3     5     9     7.5 0.333 0.8   1.11  0.8  
2  1911 GER           2       3      11       7   1.5   3.5  10.5   6.5 1.33  0.857 1.05  1.08    1.5   3.5  10.5   8   1.33  0.857 1.05  0.875
3  1910 FRA           5       6       8       9   3     5     9     9   1.67  1.2   0.889 1       3     5     9     7.5 1.67  1.2   0.889 1.2  
4  1911 FRA           1       4      10       9   3     5     9     9   0.333 0.8   1.11  1       1.5   3.5  10.5   8   0.667 1.14  0.952 1.12 

Output (long --without the final tidyr::pivot_wider())

# A tibble: 16 x 8
    year country key   value   m_c   r_c   m_y   r_y
   <int> <chr>   <chr> <int> <dbl> <dbl> <dbl> <dbl>
 1  1910 GER     1         1   1.5 0.667   3   0.333
 2  1910 GER     2         4   3.5 1.14    5   0.8  
 3  1910 GER     3        10  10.5 0.952   9   1.11 
 ...
15  1911 FRA     3        10   9   1.11   10.5 0.952
16  1911 FRA     4         9   9   1       8   1.12 
Legalize answered 24/11, 2021 at 19:55 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.