mutate_at does not create variable suffixes in some cases?
Asked Answered
U

2

6

I have been playing with dplyr::mutate_at to create new variables by applying the same function to some of the columns. When I name my function in the .funs argument, the mutate call creates new columns with a suffix instead of replacing the existing ones, which is a cool option that I discovered in this thread.

df = data.frame(var1=1:2, var2=4:5, other=9)
df %>% mutate_at(vars(contains("var")), .funs=funs('sqrt'=sqrt))
####   var1 var2 other var1_sqrt var2_sqrt
#### 1    1    4     9  1.000000  2.000000
#### 2    2    5     9  1.414214  2.236068

However, I noticed that when the vars argument used to point my columns returns only one column instead of several, the resulting new column drops the initial name: it gets named sqrt instead of other_sqrt here:

df %>% mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt))
####   var1 var2 other sqrt
#### 1    1    4     9    3
#### 2    2    5     9    3

I would like to understand why this behaviour happens, and how to avoid it because I don't know in advance how many columns the contains() will return.

EDIT: The newly created columns must inherit the original name of the original columns, plus the suffix 'sqrt' at the end.

Thanks

Umeh answered 4/2, 2018 at 22:49 Comment(4)
I think if you flip the perspective, then adding sqrt in the second case is fine. However, in the first case, it cannot name multiple new columns the same, so it is forced to use the original column-names as prefixes...Intervene
@Intervene Oooh that's clever, thanks, it might be the reason; I was taking a side-effect as the main behavior. That would answer the why... but i still don't know how to control the behaviour so that it always returns columns in the same way..Umeh
I didn't find a direct elegant solution upon a short search, but fully understand your wish for consistencyIntervene
@Intervene thanks. Well, anyway you helped me figure out what's happening, i upvote it if you post your comment as answer.Umeh
G
3

Here is another idea. We can add setNames(sub("^sqrt$", "other_sqrt", names(.))) after the mutate_at call. The idea is to replace the column name sqrt with other_sqrt. The pattern ^sqrt$ should only match the derived column sqrt if there is only one column named other, which is demonstrated in Example 1. If there are more than one columns with other, such as Example 2, the setNames would not change the column names.

library(dplyr)

# Example 1
df <- data.frame(var1 = 1:2, var2 = 4:5, other = 9)

df %>% 
  mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
  setNames(sub("^sqrt$", "other_sqrt", names(.)))
#   var1 var2 other other_sqrt
# 1    1    4     9          3
# 2    2    5     9          3

# Example 2
df2 <- data.frame(var1 = 1:2, var2 = 4:5, other1 = 9, other2 = 16)

df2 %>% 
  mutate_at(vars(contains("other")), funs("sqrt" = sqrt(.))) %>%
  setNames(sub("^sqrt$", "other_sqrt", names(.)))
#   var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1    1    4      9     16           3           4
# 2    2    5      9     16           3           4

Or we can design a function to check how many columns contain the string other before manipulating the data frame.

mutate_sqrt <- function(df, string){
  string_col <- grep(string, names(df), value = TRUE)
  df2 <- df %>% mutate_at(vars(contains(string)), funs("sqrt" = sqrt(.)))
  if (length(string_col) == 1){
    df2 <- df2 %>%  setNames(sub("^sqrt$", paste(string_col, "sqrt", sep = "_"), names(.)))
  }
  return(df2)
}

mutate_sqrt(df, "other")
#   var1 var2 other other_sqrt
# 1    1    4     9          3
# 2    2    5     9          3

mutate_sqrt(df2, "other")
#   var1 var2 other1 other2 other1_sqrt other2_sqrt
# 1    1    4      9     16           3           4
# 2    2    5      9     16           3           4 
Guiscard answered 5/2, 2018 at 1:23 Comment(5)
If that is what you want, the solution you provided will also not work.Guiscard
Not sure what you are talking about. My solution works as the same as your solution. Since it checks the column names while your solution creates the entire column with "other_fake" name, my solution would be more efficient in terms of performance. I do not understand your last comment. If the data frame looks like data.frame(var1=1, some_other_var = 9), you can do data.frame(var1=1, some_other_var = 9) %>% mutate_sqrt("some_other_var").Guiscard
ok, example: let's assume that df is some random weather data collected from the web, i don't control the exact names of the variables but I know for sure that temperature variables always contain the string "temp" somewhere. I want to automatically find those variables, duplicate them while applying the "sqrt' function and adding the suffix "sqrt" to the newly created columns, based on their previous names. Is is clearer like this? :-) and your function returns the right output when more than 2 columns, but with just one column it does not return the right colname.Umeh
In your example, a single mutate_at will always add _sqrt to all variables unless there is only one column called temp. Both your solution and my solution would be able to add _sqrt to the that column. I don't understand why you think your solution works but not mine.Guiscard
@Umeh I see. Please see my update. I updated the mutate_sqrt function. It should work now.Guiscard
U
2

I just figured out a (not so clean) way to do it; I add a extra dummy variable to the dataset, with a name that ensures that it will be selected and that we don't fall into the 1-variable case, and after the calculation I remove the 2 dummies, like this:

df %>% mutate(other_fake=NA) %>% 
  mutate_at(vars(contains("other")), .funs=funs('sqrt'=sqrt)) %>% 
  select(-contains("other_fake"))
####   var1 var2 other other_sqrt
#### 1    1    4     9          3
#### 2    2    5     9          3
Umeh answered 4/2, 2018 at 23:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.