dplyr rowwise sum and other functions like max
Asked Answered
M

3

11

If I wanted to sum over some variables in a data-frame using dplyr, I could do:

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

> select(iris, starts_with('Petal')) %>% rowSums()
  [1] 1.6 1.6 1.5 1.7 1.6 2.1 1.7 1.7 1.6 1.6 1.7 1.8 1.5 1.2 1.4 1.9 1.7 1.7 2.0 1.8 1.9 1.9 1.2 2.2 2.1 1.8 2.0 1.7 1.6 1.8 1.8 1.9 1.6 1.6 1.7 1.4
 [37] 1.5 1.5 1.5 1.7 1.6 1.6 1.5 2.2 2.3 1.7 1.8 1.6 1.7 1.6 6.1 6.0 6.4 5.3 6.1 5.8 6.3 4.3 5.9 5.3 4.5 5.7 5.0 6.1 4.9 5.8 6.0 5.1 6.0 5.0 6.6 5.3
 [73] 6.4 5.9 5.6 5.8 6.2 6.7 6.0 4.5 4.9 4.7 5.1 6.7 6.0 6.1 6.2 5.7 5.4 5.3 5.6 6.0 5.2 4.3 5.5 5.4 5.5 5.6 4.1 5.4 8.5 7.0 8.0 7.4 8.0 8.7 6.2 8.1
[109] 7.6 8.6 7.1 7.2 7.6 7.0 7.5 7.6 7.3 8.9 9.2 6.5 8.0 6.9 8.7 6.7 7.8 7.8 6.6 6.7 7.7 7.4 8.0 8.4 7.8 6.6 7.0 8.4 8.0 7.3 6.6 7.5 8.0 7.4 7.0 8.2
[145] 8.2 7.5 6.9 7.2 7.7 6.9

That's fine, but I would have thought rowwise accomplishes the same thing, but it doesn't,

> select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
[1] 743.6

What I particularly want to do is select a set of columns, and create a new variable each value of which is the maximum value of each row of the selected columns. For example, if I selected the "Petal" columns, by maximum values would be 1.4, 1.4, 1.3 and so on.

I could do it like this:

> select(iris, starts_with('Petal')) %>% apply(1, max)

and that's fine. But I'm just curious as to why the rowwise approach doesn't work. I realize I am using rowwise incorrectly, I'm just not sure why it is wrong.

Monohydric answered 21/3, 2018 at 0:27 Comment(0)
M
10

In short: you are expecting the "sum" function to be aware of dplyr data structures like a data frame grouped by row. sum is not aware of it so it just takes the sum of the whole data.frame.

Here is a brief explanation. This:

select(iris, starts_with('Petal')) %>% rowwise() %>% sum()

Can be rewritten without using the pipe operator as the following:

data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)

As you can see you were constructing something called a tibble. Then the rowwise call adds additional information on this object and specifies that it should be grouped row-wise.

However only the functions aware of this grouping like summarize and mutate can work like intended. Base R functions like sum are not aware of these objects and treat them as any standard data.frames. And the standard approach for sum() is to sum the entire data frame.

Using mutate works:

select(iris, starts_with('Petal')) %>%
  rowwise() %>%
  mutate(sum = sum(Petal.Width, Petal.Length))

Result:

Source: local data frame [150 x 3]
Groups: <by row>

# A tibble: 150 x 3
   Petal.Length Petal.Width   sum
          <dbl>       <dbl> <dbl>
 1         1.40       0.200  1.60
 2         1.40       0.200  1.60
 3         1.30       0.200  1.50
 ...
Miaow answered 21/3, 2018 at 1:0 Comment(0)
A
12

The problem is that the entire data frame is passed as dot despite the rowwise. To handle this use do which will interpret dot as meaning just the current row. One further problem is that the dot within do will represent the row as a list so convert it appropriately.

library(dplyr)

iris %>%
  slice(1:6) %>%
  select(starts_with('Petal')) %>% 
  rowwise() %>%
  do( (.) %>% as.data.frame %>% mutate(sum = sum(.)) ) %>%
  ungroup

giving:

# A tibble: 6 x 3
  Petal.Length Petal.Width   sum
*        <dbl>       <dbl> <dbl>
1         1.40       0.200  1.60
2         1.40       0.200  1.60
3         1.30       0.200  1.50
4         1.50       0.200  1.70
5         1.40       0.200  1.60
6         1.70       0.400  2.10

dplyr 1.0 - added later

Since this was asked dplyr 1.0 was released and it has cur_data() which can be used to simplify the above eliminating the need for do. cur_data() within a rowwise block refers only to the current row.

iris %>%
  slice(1:6) %>%
  select(starts_with('Petal')) %>% 
  rowwise() %>%
  mutate(sum = sum(cur_data())) %>%
  ungroup
Alton answered 21/3, 2018 at 0:55 Comment(0)
M
10

In short: you are expecting the "sum" function to be aware of dplyr data structures like a data frame grouped by row. sum is not aware of it so it just takes the sum of the whole data.frame.

Here is a brief explanation. This:

select(iris, starts_with('Petal')) %>% rowwise() %>% sum()

Can be rewritten without using the pipe operator as the following:

data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)

As you can see you were constructing something called a tibble. Then the rowwise call adds additional information on this object and specifies that it should be grouped row-wise.

However only the functions aware of this grouping like summarize and mutate can work like intended. Base R functions like sum are not aware of these objects and treat them as any standard data.frames. And the standard approach for sum() is to sum the entire data frame.

Using mutate works:

select(iris, starts_with('Petal')) %>%
  rowwise() %>%
  mutate(sum = sum(Petal.Width, Petal.Length))

Result:

Source: local data frame [150 x 3]
Groups: <by row>

# A tibble: 150 x 3
   Petal.Length Petal.Width   sum
          <dbl>       <dbl> <dbl>
 1         1.40       0.200  1.60
 2         1.40       0.200  1.60
 3         1.30       0.200  1.50
 ...
Miaow answered 21/3, 2018 at 1:0 Comment(0)
P
8

You can skip the use of select if you use c_across to select the variables you want to sum:

iris %>% 
  rowwise() %>% 
  mutate(sum = sum(c_across(starts_with("Petal"))), .keep = "used") %>% 
  ungroup()

Output

If you want keep all the columns in your data frame then remove the .keep argument.

 Petal.Length Petal.Width   sum
          <dbl>       <dbl> <dbl>
 1          1.4         0.2   1.6
 2          1.4         0.2   1.6
 3          1.3         0.2   1.5
 4          1.5         0.2   1.7
 5          1.4         0.2   1.6
 6          1.7         0.4   2.1
 7          1.4         0.3   1.7
 8          1.5         0.2   1.7
 9          1.4         0.2   1.6
10          1.5         0.1   1.6
# ... with 140 more rows

Similarly, with max:

iris %>% 
    rowwise() %>% 
    mutate(max = max(c_across(starts_with("Petal"))), .keep = "used") %>% 
    ungroup()

Note

If a row-wise aggregation function already exists it very likely much faster than using rowwise. For example, to get row sums, the row-wise aggregation function rowSums is available in base R and can be implemented like so with across not c_across:

# dplyr 1.1.0 use pick instead of across
iris %>% 
  mutate(sum = rowSums(across(starts_with("Petal"))), .keep = "used")
Paraph answered 30/10, 2021 at 1:1 Comment(2)
Is c_across() the same as across() ?Tater
@Tater Good question -- no, it is not. See my SO answer for more detail.Paraph

© 2022 - 2024 — McMap. All rights reserved.