row-wise operations, select helpers and the mutate function in dplyr
Asked Answered
B

2

7

I will use the following data set to illustrate my questions:

my_df <- data.frame(
    a = 1:10,
    b = 10:1
)
colnames(my_df) <- c("a", "b")

Part 1

I use the mutate() function to create two new variables in my data set and I would like to compute the row means of the two new columns inside the same mutate() call. However, I would really like to be able to use the select() helpers such as starts_with(), ends_with() or contains().

My first try:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(ends_with("2")))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: No tidyselect variables were registered.

I understand why there is an error - the select() function is not given any .data argument. So I change the code in...

... my second try by adding "." inside the select() function:

my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(., ends_with("2")))
    )
    a  b a_2 b_2 mean
1   1 10   1 100  NaN
2   2  9   4  81  NaN
3   3  8   9  64  NaN
4   4  7  16  49  NaN
5   5  6  25  36  NaN
6   6  5  36  25  NaN
7   7  4  49  16  NaN
8   8  3  64   9  NaN
9   9  2  81   4  NaN
10 10  1 100   1  NaN

The new problem after the second try is that the mean column does not contain the mean of a_2 and b_2 as expected, but contains NaNs only. After studying the code a bit, I understood the second problem. The added "." in the select() function refers to the original my_df data frame, which does not have the a_2 and b_2 columns. So it makes sense that NaNs are produced because I am asking R to compute the means of nonexistent values.

I then tried to use dplyr functions such as current_vars() to see if it would make a difference:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2,
        mean = rowMeans(select(current_vars(), ends_with("2")))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: Variable context not set.

However, this is obviously NOT the way to use this function. The solution is to simply add a second mutate() function:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2
    ) %>%
    mutate(mean = rowMeans(select(., ends_with("2"))))
    a  b a_2 b_2 mean
1   1 10   1 100 50.5
2   2  9   4  81 42.5
3   3  8   9  64 36.5
4   4  7  16  49 32.5
5   5  6  25  36 30.5
6   6  5  36  25 30.5
7   7  4  49  16 32.5
8   8  3  64   9 36.5
9   9  2  81   4 42.5
10 10  1 100   1 50.5

Question 1: Is there any way to perform this task in the same mutate() call? Using a second mutate() function is not really an issue anyway; however, I am curious to know if there exists a way to refer to currently existing variables. The mutate() function allows for the usage of variables as soon as they are created inside the same mutate() call; however, this becomes problematic when functions are nested as shown in my example above.

Part 2

I also realize that using rowMeans() works in my solution; however, it is not really a dplyr-way of doing things especially because I need to use select() inside it. So, I decided to use the rowwise() and mean() functions instead. But once again, I would like to use one of the select() helpers for that and not have to list all variables in a c() function. I tried:

 my_df %>%
    mutate(
        a_2 = a^2,
        b_2 = b^2
    ) %>%
    rowwise() %>%
    mutate(
        mean = mean(ends_with("2"))
    )
Error in mutate_impl(.data, dots) : 
  Evaluation error: No tidyselect variables were registered.

I suspect that the error in the code is due to the fact that ends_with() is not inside select(), but I am showing this to ask whether there is a way to list the variables I want without having to specify them individually.

Thank you for your time.

Boarfish answered 20/1, 2018 at 6:31 Comment(5)
Your question in #2 baffles me. my_df %>% mutate(a_2 = a^2, b_2 = b^2) %>% rowwise()%>% select(. , ends_with("2")) is the object that you want to run means() upon, but this will never work because rowMeans() is designed to work horizontally while means() is not.Murrain
@InfiniteFlashChess What do you mean "for #1, I'm referencing"? Also, with regards to question #2, what package does the means() function belong to? And yes, I specified in the question that I am trying to compute horizontal means. This is why I used rowMeans() in the first part and a combination of rowwise() and mean() in the second part.Boarfish
well, the point is that the function mean() won't operate that way you intend it to. I was "referencing #1" because it seemed worthy of a bounty. Likely, we'll need Hadley (or someone very proficient here) to answer it :)Murrain
@InfiniteFlashChess I understand that. The input to the mean function is a numeric vector. It is actually possible to combine rowwise() and mean(); however, you need to manually specify column names in a c() function. I was just wondering if there existed a way to use one of the select helpers to perform the same task.Boarfish
SavedByJESUS, would definitely consider bountying Problem #1 and have someone attempt to answer it (I am interested in performing #1 properly as well!)Murrain
A
0

Fortunately, since dplyr > 1.0.0 there is a dplyr-way to do exactly what you were looking for by using c_across. This is helpful because it extends the solution to other functions that may have a Row implementation like RowMeans().

Try this:

my_df %>%
  mutate(
    a_2 = a^2,
    b_2 = b^2,
    ) %>% 
  rowwise() %>% 
  mutate( mean = mean(c_across(ends_with("2"))) )
Anastomose answered 1/2, 2022 at 20:46 Comment(0)
V
2

A bit late, but here is a solution to problem 1, for the reference.

If you had to do it without pipes, you would write:

tmp1 = mutate(my_df, a_2 = a^2, b_2 = b^2)
tmp2 = select(tmp1, ends_with("2"))
tmp3 = rowMeans(tmp2)
tmp4 = mutate(tmp1, m=tmp3)

Or, with less intermediate steps:

tmp1 = mutate(my_df, a_2 = a^2, b_2 = b^2)
tmp4 = mutate(tmp1, m=rowMeans(select(tmp1, ends_with("2"))) )

Note that computing tmp4 requires using tmp1 twice. So in the piped version you will need also to reference . explicitly a second time (as usual the first reference is implicit, as the first argument to mutate):

my_df %>%
  mutate(a_2 = a^2, b_2 = b^2) %>%
  mutate(mean = rowMeans(select(., ends_with("2"))) )

For problem #2: avoiding the call rowMeans is trickier, and maybe not desirable (?)

Vasileior answered 18/6, 2018 at 19:14 Comment(0)
A
0

Fortunately, since dplyr > 1.0.0 there is a dplyr-way to do exactly what you were looking for by using c_across. This is helpful because it extends the solution to other functions that may have a Row implementation like RowMeans().

Try this:

my_df %>%
  mutate(
    a_2 = a^2,
    b_2 = b^2,
    ) %>% 
  rowwise() %>% 
  mutate( mean = mean(c_across(ends_with("2"))) )
Anastomose answered 1/2, 2022 at 20:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.