R: apply vs do.call
Asked Answered
S

3

7

I just read the profile of @David Arenburg, and found a bunch of useful tips for how to develop good R-programming skills/habits, and one especially struck me. I have always thought that the apply functions in R was the cornerstone of working with dataframes, but he writes:

If you are working with data.frames, forget there is a function called apply- whatever you do - don't use it. Especially with a margin of 1 (the only good usecase for this function is to operate over matrix columns- margin of 2).

Some good alternatives: ?do.call, ?pmax/pmin, ?max.col, ?rowSums/rowMeans/etc, the awesome matrixStats packages (for matrices), ?rowsum and many more

Could anybody explain this to me? Why are apply functions frowned upon?

Shannon answered 6/6, 2018 at 9:37 Comment(4)
I'm actually talking specifically about apply- not the whole *apply family. The main issue with apply is that it converts the whole data to a matrix which messes up the data (because matrix can't store different classes unlike a dataframe), hence yields unexpected results. Hence, when operating over columns, it is better to use the rest of the *apply family such as lapply or sapply. On the other hand, because R is vectorized language apply with a margin of 1 will be very slow (regardless of the matrix issue), hence I'm offering to use vectorized alternatives instead.Edmead
Aha, I see, thank you very much for clearing that up!Shannon
Also, this is a useful read about the *apply family.Edmead
Great! Thanks again :)Shannon
W
5
  • apply(DF, 1, f) converts each row of DF to a vector and then passes that vector to f. If DF were a mix of strings and numbers then the row would be converted to a character vector before passing it to f so that, for example, apply(iris, 1, function(x) sum(x[-5])) will not work even though the row iris[i, -5] contains all numeric elements. The row is converted to character string and you can't sum character strings. On the other hand apply(iris[-5], 1, sum) will work the same as rowSums(iris[-5]).

  • if f produces a vector the result is a matrix and not another data frame; also, the result is the transpose of what you might expect. This

    apply(BOD, 1, identity)
    

    gives the following rather than giving BOD back:

           [,1] [,2] [,3] [,4] [,5] [,6]
    Time    1.0  2.0    3    4  5.0  7.0
    demand  8.3 10.3   19   16 15.6 19.8
    

    Many years ago Hadley Wickham did post iapply which is idempotent in the sense that iapply(mat, 1, identity) returns mat, rather than t(mat), where mat is a matrix. More recently with his plyr package one can write:

    library(plyr)
    ddplyr(BOD, 1, identity)
    

    and get BOD back as a data frame.

On the other hand apply(BOD, 1, sum) will give the same result as rowSums(BOD) and apply(BOD, 1, f) might be useful for functions f for which f produces a scalar and there is no counterpart such as in the sum / rowSums case. Also if f produces a vector and you don't mind a matrix result you can transpose the output of apply yourself and although ugly it would work.

Wrestle answered 6/6, 2018 at 11:29 Comment(0)
K
2

I think what the author means, is that you should use pre-built/vectorized functions (because it is easier), if you can and avoid apply (because in principle it is a for loop and takes longer):

library(microbenchmark)

d <- data.frame(a = rnorm(10, 10, 1),
                b = rnorm(10, 200, 1))

# bad - loop
microbenchmark(apply(d, 1, function(x) if (x[1] < x[2]) x[1] else x[2]))

# good - vectorized but same result
microbenchmark(pmin(d[[1]], d[[2]])) # use double brackets!

# edited:
# -------
# bad: lapply
microbenchmark(data.frame(lapply(d, round, 1)))

# good: do.call faster than lapply
microbenchmark(do.call("round", list(d, digits = 1)))

# --------------
# Unit: microseconds
#                                  expr     min    lq     mean  median      uq     max neval
# do.call("round", list(d, digits = 1)) 104.422 107.1 148.3419 134.767 184.524 332.009   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100
#
#                                  expr    min      lq    mean median       uq     max neval
# do.call("round", list(d, digits = 1)) 96.389 97.5055 113.075 98.175 105.5375 730.954   100
#                            expr     min       lq     mean  median      uq      max neval
# data.frame(lapply(d, round, 1)) 235.619 243.2055 298.5042 252.353 276.004 1550.265   100
Keeping answered 6/6, 2018 at 9:58 Comment(5)
so all apply functions are essentially loops? lapply, sapply etc?Shannon
How does this answer do.call part?Otic
Edited. (.. and regarding for-loop; according to burns-stat.com/pages/Tutor/R_inferno.pdf using apply's is loop hiding)Keeping
Can you add the outputs of microbenchmark to your answer?Pilothouse
@Erosennin - yes apply family are loops. Consider reading this question by @DavidArunberg.Aureole
B
1

It is related to how R stores matrices and data frames*. As you may know, a data.frame is a list of vectors, that is, each column in the data.frame is a vector. Being a vectorized language, it is preferable to operate on vectors and that is the reason apply with margin of 2 is frowned upon: by doing so you will not be working on vectors, rather, you will be spanning across different vectors on each iteration.

As far as I know, using apply with margin 1 is not much different than using do.call. Although the latter might allow some more usage flexibility.

*This information should be somewhere in the manuals.

Bibbye answered 6/6, 2018 at 11:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.