R: apply vs do.call

Asked 6/6, 2018 at 9:37 Answered 6/6, 2018 at 11:51

I just read the profile of @David Arenburg, and found a bunch of useful tips for how to develop good R-programming skills/habits, and one especially struck me. I have always thought that the apply functions in R was the cornerstone of working with dataframes, but he writes:

If you are working with data.frames, forget there is a function called apply- whatever you do - don't use it. Especially with a margin of 1 (the only good usecase for this function is to operate over matrix columns- margin of 2).

Some good alternatives: ?do.call, ?pmax/pmin, ?max.col, ?rowSums/rowMeans/etc, the awesome matrixStats packages (for matrices), ?rowsum and many more

Could anybody explain this to me? Why are apply functions frowned upon?

Shannon answered 6/6, 2018 at 9:37 Comment(4)

I'm actually talking specifically about apply- not the whole *apply family. The main issue with apply is that it converts the whole data to a matrix which messes up the data (because matrix can't store different classes unlike a dataframe), hence yields unexpected results. Hence, when operating over columns, it is better to use the rest of the *apply family such as lapply or sapply. On the other hand, because R is vectorized language apply with a margin of 1 will be very slow (regardless of the matrix issue), hence I'm offering to use vectorized alternatives instead. – Edmead 6/6, 2018 at 11:39

Aha, I see, thank you very much for clearing that up! – Shannon 6/6, 2018 at 11:50

Also, this is a useful read about the *apply family. – Edmead 6/6, 2018 at 12:39

Great! Thanks again :) – Shannon 6/6, 2018 at 12:49

apply(DF, 1, f) converts each row of DF to a vector and then passes that vector to f. If DF were a mix of strings and numbers then the row would be converted to a character vector before passing it to f so that, for example, apply(iris, 1, function(x) sum(x[-5])) will not work even though the row iris[i, -5] contains all numeric elements. The row is converted to character string and you can't sum character strings. On the other hand apply(iris[-5], 1, sum) will work the same as rowSums(iris[-5]).
if f produces a vector the result is a matrix and not another data frame; also, the result is the transpose of what you might expect. This
```
apply(BOD, 1, identity)
```
gives the following rather than giving BOD back:
```
       [,1] [,2] [,3] [,4] [,5] [,6]
Time    1.0  2.0    3    4  5.0  7.0
demand  8.3 10.3   19   16 15.6 19.8
```
Many years ago Hadley Wickham did post iapply which is idempotent in the sense that iapply(mat, 1, identity) returns mat, rather than t(mat), where mat is a matrix. More recently with his plyr package one can write:
```
library(plyr)
ddplyr(BOD, 1, identity)
```
and get BOD back as a data frame.

On the other hand apply(BOD, 1, sum) will give the same result as rowSums(BOD) and apply(BOD, 1, f) might be useful for functions f for which f produces a scalar and there is no counterpart such as in the sum / rowSums case. Also if f produces a vector and you don't mind a matrix result you can transpose the output of apply yourself and although ugly it would work.

Wrestle answered 6/6, 2018 at 11:29 Comment(0)

I think what the author means, is that you should use pre-built/vectorized functions (because it is easier), if you can and avoid apply (because in principle it is a for loop and takes longer):

library(microbenchmark)

d <- data.frame(a = rnorm(10, 10, 1),
                b = rnorm(10, 200, 1))

# bad - loop
microbenchmark(apply(d, 1, function(x) if (x[1] < x[2]) x[1] else x[2]))

# good - vectorized but same result
microbenchmark(pmin(d[[1]], d[[2]])) # use double brackets!

# edited:
# -------
# bad: lapply
microbenchmark(data.frame(lapply(d, round, 1)))

# good: do.call faster than lapply
microbenchmark(do.call("round", list(d, digits = 1)))

# --------------
# Unit: microseconds
#                                  expr     min    lq     mean  median      uq     max neval
# do.call("round", list(d, digits = 1)) 104.422 107.1 148.3419 134.767 184.524 332.009   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100
#
#                                  expr    min      lq    mean median       uq     max neval
# do.call("round", list(d, digits = 1)) 96.389 97.5055 113.075 98.175 105.5375 730.954   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100

Keeping answered 6/6, 2018 at 9:58 Comment(5)

so all apply functions are essentially loops? lapply, sapply etc? – Shannon 6/6, 2018 at 10:1

How does this answer do.call part? – Otic 6/6, 2018 at 10:3

Edited. (.. and regarding for-loop; according to burns-stat.com/pages/Tutor/R_inferno.pdf using apply's is loop hiding) – Keeping 6/6, 2018 at 10:18

Can you add the outputs of microbenchmark to your answer? – Pilothouse 6/6, 2018 at 10:34

@Erosennin - yes apply family are loops. Consider reading this question by @DavidArunberg. – Aureole 6/6, 2018 at 11:45

It is related to how R stores matrices and data frames*. As you may know, a data.frame is a list of vectors, that is, each column in the data.frame is a vector. Being a vectorized language, it is preferable to operate on vectors and that is the reason apply with margin of 2 is frowned upon: by doing so you will not be working on vectors, rather, you will be spanning across different vectors on each iteration.

As far as I know, using apply with margin 1 is not much different than using do.call. Although the latter might allow some more usage flexibility.

*This information should be somewhere in the manuals.

Bibbye answered 6/6, 2018 at 11:51 Comment(0)

Recommended topics

Hot tags