I have a panel dataset at the country and year level, and I'd like to create a two new variables based on existing ones.
year | country | var1 | var2 | var3 | var 4 | mean_var1 | relmean_var1 |
---|---|---|---|---|---|---|---|
1910 | GER | 1 | 4 | 10 | 6 | 3 | 0.333 |
1911 | GER | 2 | 3 | 11 | 7 | 1.5 | 1.3333 |
1910 | FRA | 5 | 6 | 8 | 9 | 3 | 1.66667 |
1911 | FRA | 1 | 4 | 10 | 9 | 1.5 | .66667 |
What I'd like to do is create two new variables set : (1) a variable set of the average for each year (across countries) and (2) a variable set of the country value relative to the year-average. For example, for var1(1) would yield mean_var1 and (2) relmean_var1 and I'd want these for all the other variables. In total, there are over 1000 variables in the dataset, but I would only apply this function to about 6.
I have code that works for the first part, but I'd like to combine it as efficiently as possible with the second.
library(dplyr)
library(purrr)
df<- df%>%
group_by(year) %>%
mutate_at(.funs = list(mean = ~mean(.)), .vars = c("var1", "var1", "var1", "var4"))
This code yields new variables called var1_mean (I would prefer mean_var1: how do I change this name?)
For the second step, I've tried:
df <- df %>%
map2_dfr(.x = d.test %>%
select(var1, var2),
.y = d.test %>%
select(var1_mean, var2_mean),
~ .x / .y) %>%
setNames(c("relmean_var1", "relmean_var2"))
and I get errors
""Error in select(., var1, var2) : object 'd.test' not found."
. (I got this set up from this question)
I also tried:
map2(var1, var1_mean, ~ df[[.x]] / df[[.y]]) %>%
set_names(cols) %>%
bind_cols(df, .)
And got
"Error in map2(var1, var1_mean, ~df[[.x]]/df[[.y]]) : object 'var1' not found
What's the best way to combine these two goals? Ideally with the naming scheme mean_var1 for (1) and relmean_var1 for (2)
Edit: input dataframe should look like this:
data <- tibble::tribble(
~year, ~country, ~var1, ~var2, ~var3, ~var.4,
1910L, "GER", 1L, 4L, 10L, 6L,
1911L, "GER", 2L, 3L, 11L, 7L,
1910L, "FRA", 5L, 6L, 8L, 9L,
1911L, "FRA", 1L, 4L, 10L, 9L
)
output dataframe should look like this (for all variables, just showing var1 as an example, but should be the same format for var2 through var4):
datanew <- tibble::tribble(
~year, ~country, ~var1, ~var2, ~var3, ~var.4, ~mean_var1 , ~relmean_var1
1910L, "GER", 1L, 4L, 10L, 6L, 3L, .3333L,
1911L, "GER", 2L, 3L, 11L, 7L, 1.5L, 1.3333L,
1910L, "FRA", 5L, 6L, 8L, 9L, 3L, 1.6667L,
1911L, "FRA", 1L, 4L, 10L, 9L 1.5L, .6667L,
)
datanew$var2
have wrong values in the first & last row? – Legalize