Applying a function to every row of a table using dplyr?
Asked Answered
Z

8

148

When working with plyr I often found it useful to use adply for scalar functions that I have to apply to each and every row.

e.g.

data(iris)
library(plyr)
head(
     adply(iris, 1, transform , Max.Len= max(Sepal.Length,Petal.Length))
    )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
1          5.1         3.5          1.4         0.2  setosa     5.1
2          4.9         3.0          1.4         0.2  setosa     4.9
3          4.7         3.2          1.3         0.2  setosa     4.7
4          4.6         3.1          1.5         0.2  setosa     4.6
5          5.0         3.6          1.4         0.2  setosa     5.0
6          5.4         3.9          1.7         0.4  setosa     5.4

Now I'm using dplyr more, I'm wondering if there is a tidy/natural way to do this? As this is NOT what I want:

library(dplyr)
head(
     mutate(iris, Max.Len= max(Sepal.Length,Petal.Length))
    )
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
1          5.1         3.5          1.4         0.2  setosa     7.9
2          4.9         3.0          1.4         0.2  setosa     7.9
3          4.7         3.2          1.3         0.2  setosa     7.9
4          4.6         3.1          1.5         0.2  setosa     7.9
5          5.0         3.6          1.4         0.2  setosa     7.9
6          5.4         3.9          1.7         0.4  setosa     7.9
Zamora answered 16/2, 2014 at 23:21 Comment(7)
i recently asked if there was an equivalent of mdply in dplyr, and hadley suggested that they might be brewing something based on do. I guess it would also work here.Gilmagilman
Eventually dplyr will have something like rowwise() which would group by each individual rowUnbreathed
@Unbreathed thx, shouldn't it just behave like adply when you don't use a grouping though? as its closely integrated function is called group_by NOT split_byZamora
@StephenHenderson no, because you also need some way to operate on the table as a whole.Unbreathed
Related post: Call apply-like function on each row of dataframe with multiple arguments from each rowRoughspoken
I know this is a really old question, but iris %>% mutate(Max.Len=pmax(Sepal.Length,Sepal.Width) ) works alsoSpiegeleisen
@Spiegeleisen Yes but that method doesn't generalise. There is no psum, pmean or pmedian for instance.Zamora
L
229

As of dplyr 0.2 (I think) rowwise() is implemented, so the answer to this problem becomes:

iris %>% 
  rowwise() %>% 
  mutate(Max.Len= max(Sepal.Length,Petal.Length))

Non rowwise alternative

Five years (!) later this answer still gets a lot of traffic. Since it was given, rowwise is increasingly not recommended, although lots of people seem to find it intuitive. Do yourself a favour and go through Jenny Bryan's Row-oriented workflows in R with the tidyverse material to get a good handle on this topic.

The most straightforward way I have found is based on one of Hadley's examples using pmap:

iris %>% 
  mutate(Max.Len= purrr::pmap_dbl(list(Sepal.Length, Petal.Length), max))

Using this approach, you can give an arbitrary number of arguments to the function (.f) inside pmap.

pmap is a good conceptual approach because it reflects the fact that when you're doing row wise operations you're actually working with tuples from a list of vectors (the columns in a dataframe).

Lost answered 14/7, 2014 at 0:20 Comment(18)
I've changed this (from the above) to the ideal answer as I think this is the intended usage.Zamora
is it possible to add the values of a dynamically formed datatframe? So in this data frame the column names are not known. I am able to add if column names are known.Afloat
#28807766 just found the answer. In this they are using correlation instead of sum. But same concept.Afloat
If it does not work, make sure you are actually using dplyr::mutate not plyr::mutate - drove me nutsMil
Thanks YAK, this bit me too. If you include both plyr and dplyr packages, you are almost certainly using the wrong mutate unless you explicitly provide scope dplyr::mutate.Dumuzi
Be cautious using this approach as mutate() can slow down dramatically given large numbers of groups. Groups in the order of 10^5 can mean mutation takes minutes instead of microseconds. See this gist.Upswing
If you are reading this for the future: consider whether your problem can be vectorised first... For example here, there is already a vectorised max function called pmax, so using rowwise is doubly inefficent: more to type and more time to process. See mnel's answer below for the general case when no vectorised function readily exists.Capping
It does not work with the output of the function to be apply by row is something more complex like a list. What is suggested to use in that case?Lidia
@ErickChacon I recommend asking that as a separate questionLost
Thanks so much - I've been banging my head against this issue for a while. It does seem odd that a) this isnt the default behaviour for something built for data frames? b) why would it be slower - shouldn't rowwise mean less processing?Akmolinsk
This function is currently in "questioning" stage of lifecycle. tidyverse.org/lifecycle/#questioningImportunacy
you do not need rowwise to mutate(Max.Len= max(Sepal.Length,Petal.Length)), and it will also have the side effect of grouping and outputting a tibble, instead of a data.frame.Snell
@Snell have you compared the output with and without? It makes a big different to what max returnsLost
@Lost Hmm.. I'm really not liking the purrr::pmap_dbl method. It's a real conceptual leap. Wouldn't want to teach that in software carpentry ;)Zamora
@StephenHenderson agreed it's a challenge. I've now really come around to it conceptually, but it depends on an understanding of what's actually happening. rowwise feels very natural to new users in my experienceLost
I'm sure this answer will continue to get lots of traffic...It's worth mentioning, as of mid-2020, rowwise() is now accepted (and recommended) in dplyr 1.0.0, and will likely stick around. Hadley was convinced of it's utility and intuitiveness over the purrr solutions. See the official documentationCounterproductive
How to give the names of the variables in a tidyselect way for the pmap approach?Mettah
"although lots of people seem to find it intuitive" <-- It's R's mission to do the opposite.Nickelous
D
24

The idiomatic approach will be to create an appropriately vectorised function.

R provide pmax which is suitable here, however it also provides Vectorize as a wrapper for mapply to allow you to create a vectorised arbitrary version of an arbitrary function.

library(dplyr)
# use base R pmax (vectorized in C)
iris %>% mutate(max.len = pmax(Sepal.Length, Petal.Length))
# use vectorize to create your own function
# for example, a horribly inefficient get first non-Na value function
# a version that is not vectorized
coalesce <- function(a,b) {r <- c(a[1],b[1]); r[!is.na(r)][1]}
# a vectorized version
Coalesce <- Vectorize(coalesce, vectorize.args = c('a','b'))
# some example data
df <- data.frame(a = c(1:5,NA,7:10), b = c(1:3,NA,NA,6,NA,10:8))
df %>% mutate(ab =Coalesce(a,b))

Note that implementing the vectorization in C / C++ will be faster, but there isn't a magicPony package that will write the function for you.

Demesne answered 17/2, 2014 at 0:13 Comment(2)
thx, this is a great answer, is excellent general R style -idiomatic as you say, but I don't think its really addressing my question whether there is a dplyr way... as it would be simpler without dplyr e.g. with(df, Coalesce(a,b)) Perhaps, that's a kind of answer though - don't use dplyr for that?Zamora
Have to admit I double checked that there isn't a magicPony package. Too badCanaan
J
22

You need to group by row:

iris %>% group_by(1:n()) %>% mutate(Max.Len= max(Sepal.Length,Petal.Length))

This is what the 1 did in adply.

Japonica answered 16/2, 2014 at 23:29 Comment(5)
It seems like there should be a simpler or "nicer" syntax.Zamora
@StephenHenderson, there may be, I'm not a dplyr expert. Hopefully someone else will come along with something better. Note I cleaned it up a bit with 1:n().Japonica
I suspect you are right, but I sort of feel like the default behaviour with no grouping should be like the group_by(1:n()) behaviour. If no-one has any other ideas in the morning I'll tick yours ;)Zamora
Also, note that this is somewhat in contravention of documentation for n: "This function is implemented special for each data source and can only be used from within summarise.", though it seems to work.Japonica
Can you refer to Sepal.Length and Petal.Length by their index number in some way? If you have lots of variables did would be handy. Like ... Max.len = max( [c(1,3)] ) ?Astor
H
19

Update 2017-08-03

After writing this, Hadley changed some stuff again. The functions that used to be in purrr are now in a new mixed package called purrrlyr, described as:

purrrlyr contains some functions that lie at the intersection of purrr and dplyr. They have been removed from purrr in order to make the package lighter and because they have been replaced by other solutions in the tidyverse.

So, you will need to install + load that package to make the code below work.

Original post

Hadley frequently changes his mind about what we should use, but I think we are supposed to switch to the functions in purrr to get the by row functionality. At least, they offer the same functionality and have almost the same interface as adply from plyr.

There are two related functions, by_row and invoke_rows. My understanding is that you use by_row when you want to loop over rows and add the results to the data.frame. invoke_rows is used when you loop over rows of a data.frame and pass each col as an argument to a function. We will only use the first.

Examples

library(tidyverse)

iris %>% 
  by_row(..f = function(this_row) {
    browser()
  })

This lets us see the internals (so we can see what we are doing), which is the same as doing it with adply.

Called from: ..f(.d[[i]], ...)
Browse[1]> this_row
# A tibble: 1 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl>  <fctr>
1          5.1         3.5          1.4         0.2  setosa
Browse[1]> Q

By default, by_row adds a list column based on the output:

iris %>% 
  by_row(..f = function(this_row) {
      this_row[1:4] %>% unlist %>% mean
  })

gives:

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species      .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>    <list>
1           5.1         3.5          1.4         0.2  setosa <dbl [1]>
2           4.9         3.0          1.4         0.2  setosa <dbl [1]>
3           4.7         3.2          1.3         0.2  setosa <dbl [1]>
4           4.6         3.1          1.5         0.2  setosa <dbl [1]>
5           5.0         3.6          1.4         0.2  setosa <dbl [1]>
6           5.4         3.9          1.7         0.4  setosa <dbl [1]>
7           4.6         3.4          1.4         0.3  setosa <dbl [1]>
8           5.0         3.4          1.5         0.2  setosa <dbl [1]>
9           4.4         2.9          1.4         0.2  setosa <dbl [1]>
10          4.9         3.1          1.5         0.1  setosa <dbl [1]>
# ... with 140 more rows

if instead we return a data.frame, we get a list with data.frames:

iris %>% 
  by_row( ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
    )
  })

gives:

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species                 .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr>               <list>
1           5.1         3.5          1.4         0.2  setosa <data.frame [1 × 2]>
2           4.9         3.0          1.4         0.2  setosa <data.frame [1 × 2]>
3           4.7         3.2          1.3         0.2  setosa <data.frame [1 × 2]>
4           4.6         3.1          1.5         0.2  setosa <data.frame [1 × 2]>
5           5.0         3.6          1.4         0.2  setosa <data.frame [1 × 2]>
6           5.4         3.9          1.7         0.4  setosa <data.frame [1 × 2]>
7           4.6         3.4          1.4         0.3  setosa <data.frame [1 × 2]>
8           5.0         3.4          1.5         0.2  setosa <data.frame [1 × 2]>
9           4.4         2.9          1.4         0.2  setosa <data.frame [1 × 2]>
10          4.9         3.1          1.5         0.1  setosa <data.frame [1 × 2]>
# ... with 140 more rows

How we add the output of the function is controlled by the .collate param. There's three options: list, rows, cols. When our output has length 1, it doesn't matter whether we use rows or cols.

iris %>% 
  by_row(.collate = "cols", ..f = function(this_row) {
    this_row[1:4] %>% unlist %>% mean
  })

iris %>% 
  by_row(.collate = "rows", ..f = function(this_row) {
    this_row[1:4] %>% unlist %>% mean
  })

both produce:

# A tibble: 150 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  .out
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr> <dbl>
1           5.1         3.5          1.4         0.2  setosa 2.550
2           4.9         3.0          1.4         0.2  setosa 2.375
3           4.7         3.2          1.3         0.2  setosa 2.350
4           4.6         3.1          1.5         0.2  setosa 2.350
5           5.0         3.6          1.4         0.2  setosa 2.550
6           5.4         3.9          1.7         0.4  setosa 2.850
7           4.6         3.4          1.4         0.3  setosa 2.425
8           5.0         3.4          1.5         0.2  setosa 2.525
9           4.4         2.9          1.4         0.2  setosa 2.225
10          4.9         3.1          1.5         0.1  setosa 2.400
# ... with 140 more rows

If we output a data.frame with 1 row, it matters only slightly which we use:

iris %>% 
  by_row(.collate = "cols", ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
      )
  })

iris %>% 
  by_row(.collate = "rows", ..f = function(this_row) {
    data.frame(
      new_col_mean = this_row[1:4] %>% unlist %>% mean,
      new_col_median = this_row[1:4] %>% unlist %>% median
    )
  })

both give:

# A tibble: 150 × 8
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  .row new_col_mean new_col_median
          <dbl>       <dbl>        <dbl>       <dbl>  <fctr> <int>        <dbl>          <dbl>
1           5.1         3.5          1.4         0.2  setosa     1        2.550           2.45
2           4.9         3.0          1.4         0.2  setosa     2        2.375           2.20
3           4.7         3.2          1.3         0.2  setosa     3        2.350           2.25
4           4.6         3.1          1.5         0.2  setosa     4        2.350           2.30
5           5.0         3.6          1.4         0.2  setosa     5        2.550           2.50
6           5.4         3.9          1.7         0.4  setosa     6        2.850           2.80
7           4.6         3.4          1.4         0.3  setosa     7        2.425           2.40
8           5.0         3.4          1.5         0.2  setosa     8        2.525           2.45
9           4.4         2.9          1.4         0.2  setosa     9        2.225           2.15
10          4.9         3.1          1.5         0.1  setosa    10        2.400           2.30
# ... with 140 more rows

except that the second has the column called .row and the first does not.

Finally, if our output is longer than length 1 either as a vector or as a data.frame with rows, then it matters whether we use rows or cols for .collate:

mtcars[1:2] %>% by_row(function(x) 1:5)
mtcars[1:2] %>% by_row(function(x) 1:5, .collate = "rows")
mtcars[1:2] %>% by_row(function(x) 1:5, .collate = "cols")

produces, respectively:

# A tibble: 32 × 3
     mpg   cyl      .out
   <dbl> <dbl>    <list>
1   21.0     6 <int [5]>
2   21.0     6 <int [5]>
3   22.8     4 <int [5]>
4   21.4     6 <int [5]>
5   18.7     8 <int [5]>
6   18.1     6 <int [5]>
7   14.3     8 <int [5]>
8   24.4     4 <int [5]>
9   22.8     4 <int [5]>
10  19.2     6 <int [5]>
# ... with 22 more rows

# A tibble: 160 × 4
     mpg   cyl  .row  .out
   <dbl> <dbl> <int> <int>
1     21     6     1     1
2     21     6     1     2
3     21     6     1     3
4     21     6     1     4
5     21     6     1     5
6     21     6     2     1
7     21     6     2     2
8     21     6     2     3
9     21     6     2     4
10    21     6     2     5
# ... with 150 more rows

# A tibble: 32 × 7
     mpg   cyl .out1 .out2 .out3 .out4 .out5
   <dbl> <dbl> <int> <int> <int> <int> <int>
1   21.0     6     1     2     3     4     5
2   21.0     6     1     2     3     4     5
3   22.8     4     1     2     3     4     5
4   21.4     6     1     2     3     4     5
5   18.7     8     1     2     3     4     5
6   18.1     6     1     2     3     4     5
7   14.3     8     1     2     3     4     5
8   24.4     4     1     2     3     4     5
9   22.8     4     1     2     3     4     5
10  19.2     6     1     2     3     4     5
# ... with 22 more rows

So, bottom line. If you want the adply(.margins = 1, ...) functionality, you can use by_row.

Henceforth answered 22/5, 2017 at 21:26 Comment(2)
by_row is deprecated, calling it says to "use a combination of: tidyr::nest(); dplyr::mutate(); purrr::map()" github.com/hadley/purrrlyr/blob/…Sibelle
That's a lot of r's.Importunacy
S
14

Extending BrodieG's answer,

If the function returns more than one row, then instead of mutate(), do() must be used. Then to combine it back together, use rbind_all() from the dplyr package.

In dplyr version dplyr_0.1.2, using 1:n() in the group_by() clause doesn't work for me. Hopefully Hadley will implement rowwise() soon.

iris %>%
    group_by(1:nrow(iris)) %>%
    do(do_fn) %>%
    rbind_all()

Testing the performance,

library(plyr)    # plyr_1.8.4.9000
library(dplyr)   # dplyr_0.8.0.9000
library(purrr)   # purrr_0.2.99.9000
library(microbenchmark)

d1_count <- 1000
d2_count <- 10

d1 <- data.frame(a=runif(d1_count))

do_fn <- function(row){data.frame(a=row$a, b=runif(d2_count))}
do_fn2 <- function(a){data.frame(a=a, b=runif(d2_count))}

op <- microbenchmark(
        plyr_version = plyr::adply(d1, 1, do_fn),
        dplyr_version = d1 %>%
            dplyr::group_by(1:nrow(d1)) %>%
            dplyr::do(do_fn(.)) %>%
            dplyr::bind_rows(),
        purrr_version = d1 %>% purrr::pmap_dfr(do_fn2),
        times=50)

it has the following results:

Unit: milliseconds
          expr       min        lq      mean    median        uq       max neval
  plyr_version 1227.2589 1275.1363 1317.3431 1293.5759 1314.4266 1616.5449    50
 dplyr_version  977.3025 1012.6340 1035.9436 1025.6267 1040.5882 1449.0978    50
 purrr_version  609.5790  629.7565  643.8498  644.2505  656.1959  686.8128    50

This shows that the new purrr version is the fastest

Sibelle answered 25/2, 2014 at 14:45 Comment(0)
S
2

In addition to the great answer provided by @alexwhan, please keep in mind that you need to use ungroup() to avoid side effects. This is because rowwise() is a grouping operation.

iris %>%
    rowwise() %>%
    mutate(Max.Len = max(Sepal.Length, Petal.Length))

will give you:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>
 1          5.1         3.5          1.4         0.2 setosa      5.1
 2          4.9         3            1.4         0.2 setosa      4.9
 3          4.7         3.2          1.3         0.2 setosa      4.7
 4          4.6         3.1          1.5         0.2 setosa      4.6
 5          5           3.6          1.4         0.2 setosa      5  
 6          5.4         3.9          1.7         0.4 setosa      5.4
 7          4.6         3.4          1.4         0.3 setosa      4.6
 8          5           3.4          1.5         0.2 setosa      5  
 9          4.4         2.9          1.4         0.2 setosa      4.4
10          4.9         3.1          1.5         0.1 setosa      4.9

Now let's assume that you need to continue with the dplyr pipe to add a lead to Max.Len:

iris %>%
    rowwise() %>%
    mutate(Max.Len = max(Sepal.Length, Petal.Length)) %>%
    mutate(Lead.Max.Len = lead(Max.Len))

This will produce:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len Lead.Max.Len
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>        <dbl>
 1          5.1         3.5          1.4         0.2 setosa      5.1           NA
 2          4.9         3            1.4         0.2 setosa      4.9           NA
 3          4.7         3.2          1.3         0.2 setosa      4.7           NA
 4          4.6         3.1          1.5         0.2 setosa      4.6           NA
 5          5           3.6          1.4         0.2 setosa      5             NA
 6          5.4         3.9          1.7         0.4 setosa      5.4           NA
 7          4.6         3.4          1.4         0.3 setosa      4.6           NA
 8          5           3.4          1.5         0.2 setosa      5             NA
 9          4.4         2.9          1.4         0.2 setosa      4.4           NA
10          4.9         3.1          1.5         0.1 setosa      4.9           NA

NA's are produced as a side effect. This can be corrected with ungroup():

iris %>%
    rowwise() %>%
    mutate(Max.Len = max(Sepal.Length, Petal.Length)) %>%
    ungroup() %>%
    mutate(Lead.Max.Len = lead(Max.Len))

This will produce the desired output:

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Max.Len lead.max.len
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>        <dbl>
 1          5.1         3.5          1.4         0.2 setosa      5.1          4.9
 2          4.9         3            1.4         0.2 setosa      4.9          4.7
 3          4.7         3.2          1.3         0.2 setosa      4.7          4.6
 4          4.6         3.1          1.5         0.2 setosa      4.6          5  
 5          5           3.6          1.4         0.2 setosa      5            5.4
 6          5.4         3.9          1.7         0.4 setosa      5.4          4.6
 7          4.6         3.4          1.4         0.3 setosa      4.6          5  
 8          5           3.4          1.5         0.2 setosa      5            4.4
 9          4.4         2.9          1.4         0.2 setosa      4.4          4.9
10          4.9         3.1          1.5         0.1 setosa      4.9          5.4
Sickle answered 20/9, 2020 at 11:12 Comment(0)
C
1

Something like this?

iris$Max.Len <- pmax(iris$Sepal.Length, iris$Petal.Length)
Cysticercus answered 16/2, 2014 at 23:24 Comment(6)
Yes thx, that's a very specific answer. But my example and question are trying to tease out if there is a general dplyr solution for any scalar function.Zamora
In general, functions should be vectorized -- if it is a wacky function, you might write wacky.function <- function(col.1, col.2){...}, and then iris.wacky <- wacky.function(iris$Sepal.Length, iris$Petal.Length).Cysticercus
Often they should I guess, but I think when you are using something like dplyr or plyr or say data.table you should try to use their idioms so that your code doesn't become a difficult to share mix of styles. Hence the question.Zamora
The first line of the plyr documentation is "plyr is a set of tools that solves a common set of problems: you need to break a big problem down into manageable pieces, operate on each pieces and then put all the pieces back together." This seems like a very different problem for which elementary column operations are the best tool. This also might explain why there's no "natural" plyr/dplyr command for doing this.Cysticercus
Uh OK.. but plyr does have a natural way to do this adply (see above).Zamora
To butcher a famous quote: "If all you have is a plyr you will end up using it for a hammer and a screwdriver too"Escobar
A
0

Just for completeness I am going to change the code of this user from the forgotten answer (and maybe the best answer) of the question: Sum across multiple columns. And apply it to your problem:

iris %>%
  mutate(max = select(.,c('Sepal.Length','Petal.Length')) %>% 
  apply(1, max, na.rm=TRUE))

The Result is expected. Accepted answer said that rowwise is increasingly not recommended, and apply is base R. Uou don't need to import an extra package like purrr.

You can use apply() function with max, min, sum, median, mean. So it's very handy and simple.

Alanis answered 23/1, 2022 at 15:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.