Why use purrr::map instead of lapply?
Asked Answered
K

5

244

Is there any reason why I should use

map(<list-like-object>, function(x) <do stuff>)

instead of

lapply(<list-like-object>, function(x) <do stuff>)

the output should be the same and the benchmarks I made seem to show that lapply is slightly faster (it should be as map needs to evaluate all the non-standard-evaluation input).

So is there any reason why for such simple cases I should actually consider switching to purrr::map? I am not asking here about one's likes or dislikes about the syntax, other functionalities provided by purrr etc., but strictly about comparison of purrr::map with lapply assuming using the standard evaluation, i.e. map(<list-like-object>, function(x) <do stuff>). Is there any advantage that purrr::map has in terms of performance, exception handling etc.? The comments below suggest that it does not, but maybe someone could elaborate a little bit more?

Kickapoo answered 14/7, 2017 at 10:45 Comment(17)
For simple use cases indeed, better stick with base R and avoid dependencies. If you already load the tidyverse though, you may benefit from the pipe %>% and anonymous functions ~ .x + 1 syntaxAnachronism
Plus, purrr::map provides a range of functions, such as map_int, map_dbl, map_lgl, and map2 etc. that extend the functionality beyond lapply while keeping a consistent syntax.Capstan
I agree with both of you guys, all the functionalities mentioned by you are great and are the reason for me to use purrr, but I'm interested in the simple case and wonder if there is any advantage (maybe e.g. better exception handling?).Kickapoo
Unfortunately I can't read C code. Maybe the answer lies in the comparison between github.com/tidyverse/purrr/blob/… and github.com/wch/r-source/blob/…Anachronism
This is pretty much a question of style. You should know what the base R functions do though, because all this tidyverse stuff is just a shell on top of it. At some point, that shell will break.Conductive
I only found one test in purrr/tests/ comparing map() and lapply() outputs: test_that("map forces arguments in same way as base R", { f_map <- map(1:2, function(i) function(x) x + i) ; f_base <- lapply(1:2, function(i) function(x) x + i) ; expect_equal(f_map[[1]](0), f_base[[1]](0)) ; expect_equal(f_map[[2]](0), f_base[[2]](0)) }) and interestingly, it fails when I copy-paste-and-run it. Does it have to do with evaluation rules? github.com/tidyverse/purrr/blob/master/tests/testthat/…Anachronism
~{} shortcut lambda (with or without the {} seals the deal for me for plain purrr::map(). The type-enforcement of the purrr::map_…() are handy and less obtuse than vapply(). purrr::map_df() is a super expensive function but it also simplifies code. There's absolutely nothing wrong with sticking with base R [lsv]apply(), though.Gasify
@Capstan please look up vapply, mapply and friends. It's not because you don't know how to do it, that it doesn't exist in base R. Nothing against purrr::map, but it's JAF: Just Another Function.Hightest
@Aurèle You need the latest version of purrr. Seems to be a bug fix.Achondroplasia
@F.Privé I updated to 0.2.2.9000, now the test passes. Thank youAnachronism
Admittedly, I wrote my answer before reading your post very carefully. My answer highlights stuff you probably already know, but in terms of pure performance, lapply is a bit faster. I think it's just about what you're more comfortable with...Tuttifrutti
Thank you for the question - kind of stuff I also looked at. I am using R since more than 10 years and definitively don't and won't use purrr stuff. My point is following: tidyverse is fabulous for analyses/ interactive/reports stuff, not for programming. If you are into having to use lapply or map then you are programming and may end up one day with creating a package. Then the less dependencies the best. Plus: I sometime see people using map with quite obscure syntax after. And now that I see performances testing: if you are used to apply family: stick to it.Jihad
Tim you wrote: "I am not asking here about one's likes or dislikes about the syntax, other functionalities provided by purrr etc., but strictly about comparison of purrr::map with lapply assuming using the standard evaluation" and the answer you accepted is the one that goes over exactly what you said you didn't want people to go over.Archdeacon
@CarlosCinelli true, but this answer, as other answers, states that there is no difference and gives the most comprehensive review of the subject.Kickapoo
To Whom It May Concern: this question was put on hold as primarily opinion-based and re-opened already four times in a row. Not that it bothered me, but please notice that such voting pattern does not seem to lead anywhere...Kickapoo
@Kickapoo you could rephrase the question as "Can I safely replace any lapply call with a map call and expect my code not to break?". This removes the issue of some people interpreting your question as opinion-based and will still get you the right answers (if I got your question right)...Upholster
@Upholster thanks but it seems like the issues settled-up, while the shorter title is easier to read, so I'll keep with it.Kickapoo
K
312

If the only function you're using from purrr is map(), then no, the advantages are not substantial. As Rich Pauloo points out, the main advantage of map() is the helpers which allow you to write compact code for common special cases:

  • ~ . + 1 is equivalent to function(x) x + 1 (and \(x) x + 1 in R-4.1 and newer)

  • list("x", 1) is equivalent to function(x) x[["x"]][[1]]. These helpers are a bit more general than [[ - see ?pluck for details. For data rectangling, the .default argument is particularly helpful.

But most of the time you're not using a single *apply()/map() function, you're using a bunch of them, and the advantage of purrr is much greater consistency between the functions. For example:

  • The first argument to lapply() is the data; the first argument to mapply() is the function. The first argument to all map functions is always the data.

  • With vapply(), sapply(), and mapply() you can choose to suppress names on the output with USE.NAMES = FALSE; but lapply() doesn't have that argument.

  • There's no consistent way to pass consistent arguments on to the mapper function. Most functions use ... but mapply() uses MoreArgs (which you'd expect to be called MORE.ARGS), and Map(), Filter() and Reduce() expect you to create a new anonymous function. In map functions, constant argument always come after the function name.

  • Almost every purrr function is type stable: you can predict the output type exclusively from the function name. This is not true for sapply() or mapply(). Yes, there is vapply(); but there's no equivalent for mapply().

You may think that all of these minor distinctions are not important (just as some people think that there's no advantage to stringr over base R regular expressions), but in my experience they cause unnecessary friction when programming (the differing argument orders always used to trip me up), and they make functional programming techniques harder to learn because as well as the big ideas, you also have to learn a bunch of incidental details.

Purrr also fills in some handy map variants that are absent from base R:

  • modify() preserves the type of the data using [[<- to modify "in place". In conjunction with the _if variant this allows for (IMO beautiful) code like modify_if(df, is.factor, as.character)

  • map2() allows you to map simultaneously over x and y. This makes it easier to express ideas like map2(models, datasets, predict)

  • imap() allows you to map simultaneously over x and its indices (either names or positions). This is makes it easy to (e.g) load all csv files in a directory, adding a filename column to each.

    dir("\\.csv$") %>%
      set_names() %>%
      map(read.csv) %>%
      imap(~ transform(.x, filename = .y))
    
  • walk() returns its input invisibly; and is useful when you're calling a function for its side-effects (i.e. writing files to disk).

Not to mention the other helpers like safely() and partial().

Personally, I find that when I use purrr, I can write functional code with less friction and greater ease; it decreases the gap between thinking up an idea and implementing it. But your mileage may vary; there's no need to use purrr unless it actually helps you.

Microbenchmarks

Yes, map() is slightly slower than lapply(). But the cost of using map() or lapply() is driven by what you're mapping, not the overhead of performing the loop. The microbenchmark below suggests that the cost of map() compared to lapply() is around 40 ns per element, which seems unlikely to materially impact most R code.

library(purrr)
n <- 1e4
x <- 1:n
f <- function(x) NULL

mb <- microbenchmark::microbenchmark(
  lapply = lapply(x, f),
  map = map(x, f)
)
summary(mb, unit = "ns")$median / n
#> [1] 490.343 546.880
Khabarovsk answered 5/11, 2017 at 15:41 Comment(5)
Did you mean to use transform() in that example? As in base R transform(), or am I missing something? transform() gives you filename as a factor, which generates warnings when you (naturally) want to bind rows together. mutate() gives me the character column of filenames I want. Is there a reason not to use it there?Levania
Yes, better to use mutate(), I just wanted a simple example with no other deps.Khabarovsk
Shouldn't type-specificity show up somewhere in this answer? map_* is what got me loading purrr in many scripts. It helped me with some 'control flow' aspects of my code (stopifnot(is.data.frame(x))).Sacring
ggplot and data.table are great, but do we really need a new package for every single function in R?Torque
fwiw you can easily return a list while suppressing names in base R: sapply(1:10, function(x){x}, simplify=FALSE, USE.NAMES=FALSE)Kitti
T
83

Comparing purrr and lapply boils down to convenience and speed.


1. purrr::map is syntactically more convenient than lapply

extract second element of the list

map(list, 2)  

which as @F. Privé pointed out, is the same as:

map(list, function(x) x[[2]])

with lapply

lapply(list, 2) # doesn't work

we need to pass an anonymous function...

lapply(list, function(x) x[[2]])  # now it works

...or as @RichScriven pointed out, we pass [[ as an argument into lapply

lapply(list, `[[`, 2)  # a bit more simple syntantically

So if find yourself applying functions to many lists using lapply, and tire of either defining a custom function or writing an anonymous function, convenience is one reason to favor purrr.

2. Type-specific map functions simply many lines of code

  • map_chr()
  • map_lgl()
  • map_int()
  • map_dbl()
  • map_df()

Each of these type-specific map functions returns a vector, rather than the lists returned by map() and lapply(). If you're dealing with nested lists of vectors, you can use these type-specific map functions to pull out the vectors directly, and coerce vectors directly into int, dbl, chr vectors. The base R version would look something like as.numeric(sapply(...)), as.character(sapply(...)), etc.

The map_<type> functions also have the useful quality that if they cannot return an atomic vector of the indicated type, they fail. This is useful when defining strict control flow, where you want a function to fail if it [somehow] generates the wrong object type.

3. Convenience aside, lapply is [slightly] faster than map

Using purrr's convenience functions, as @F. Privé pointed out slows down processing a bit. Let's race each of the 4 cases I presented above.

# devtools::install_github("jennybc/repurrrsive")
library(repurrrsive)
library(purrr)
library(microbenchmark)
library(ggplot2)

mbm <- microbenchmark(
  lapply       = lapply(got_chars[1:4], function(x) x[[2]]),
  lapply_2     = lapply(got_chars[1:4], `[[`, 2),
  map_shortcut = map(got_chars[1:4], 2),
  map          = map(got_chars[1:4], function(x) x[[2]]),
  times        = 100
)
autoplot(mbm)

enter image description here

And the winner is....

lapply(list, `[[`, 2)

In sum, if raw speed is what you're after: base::lapply (although it's not that much faster)

For simple syntax and expressibility: purrr::map


This excellent purrr tutorial highlights the convenience of not having to explicitly write out anonymous functions when using purrr, and the benefits of type-specific map functions.

Tuttifrutti answered 1/9, 2017 at 6:31 Comment(4)
Note that if you use function(x) x[[2]] instead of just 2, it would be less slow. All this extra time is due to checks that lapply doesn't do.Achondroplasia
You don't "need" anonymous functions. [[ is a function. You can do lapply(list, "[[", 3).Antichlor
@RichScriven that makes sense. That does simplify the syntax for using lapply over purrr.Tuttifrutti
as.numeric(sapply(...)) is a weird thing to do. Use vapply: vapply(..., FUN.VALUE = numeric(1)). That's the base R way to return a vector from an apply function, and also enforces type (throws an error if your function doesn't return the correct one). This also results in better performance than sapply/lapply as the entire vector can be allocated at once. The only additional advantage of map_type is that it's a little bit more beginner-friendly.Mauser
A
55

If we do not consider aspects of taste (otherwise this question should be closed) or syntax consistency, style etc, the answer is no, there’s no special reason to use map instead of lapply or other variants of the apply family, such as the stricter vapply.

PS: To those people gratuitously downvoting, just remember the OP wrote:

I am not asking here about one's likes or dislikes about the syntax, other functionalities provided by purrr etc., but strictly about comparison of purrr::map with lapply assuming using the standard evaluation

If you do not consider syntax nor other functionalities of purrr, there's no special reason to use map. I use purrr myself and I'm fine with Hadley's answer, but it ironically goes over the very things the OP stated upfront he was not asking.

Archdeacon answered 31/7, 2017 at 22:47 Comment(1)
I got here asking if there are some differences I should worry about between lapply and map or I can use them more or less interchangeably. You are the one who answered my question, thanks :) My use case is related to scripts from a colleague full of map where I want to use either future.apply::future_lapply (which I know) or furrr::future_map (which I don't know). Now I know I can safely replace one with the other, that's it. Thanks again!Upholster
L
6

tl;dr

I am not asking about one's likes or dislikes about syntax or other functionalities provided by purrr.

Choose the tool that matches your use case, and maximizes your productivity. For production code that prioritizes speed use *apply, for code that requires small memory footprint use map. Based on ergonomics, map is likely preferable for most users and most one-off tasks.

Convenience

update October 2021 Since both the accepted answer and the 2nd most voted post mention syntax convenience:

R versions 4.1.1 and higher now support shorthand anonymous function \(x) and pipe |> syntax. To check your R version, use version[['version.string']].

library(purrr)
library(repurrrsive)
lapply(got_chars[1:2], `[[`, 2) |>
  lapply(\(.) . + 1)
#> [[1]]
#> [1] 1023
#> 
#> [[2]]
#> [1] 1053
map(got_chars[1:2], 2) %>%
  map(~ . + 1)
#> [[1]]
#> [1] 1023
#> 
#> [[2]]
#> [1] 1053

Syntax for the purrr approach generally is shorter to type if your task involves more than 2 manipulations of list-like objects.

nchar(
"lapply(x, fun, y) |>
      lapply(\\(.) . + 1)")
#> [1] 45
nchar(
"library(purrr)
map(x, fun) %>%
  map(~ . + 1)")
#> [1] 45

Considering a person might write tens or hundreds of thousands of these calls in their career, this syntax length difference can equate to writing 1 or 2 novels (av. novel 80 000 letters), given the code is typed. Further consider your code input speed (~65 words per minute?), your input accuracy (do you find that you often mistype certain syntax (\"< ?), your recall of function arguments, then you can make a fair comparison of your productivity using one style, or a combination of the two.

Another consideration might be your target audience. Personally I found explaining how purrr::map works harder than lapply precisely because of its concise syntax.

1 |>
  lapply(\(.z) .z + 1)
#> [[1]]
#> [1] 2

1 %>%
  map(~ .z+ 1)
#> Error in .f(.x[[i]], ...) : object '.z' not found

but,
1 %>%
  map(~ .+ 1)
#> [[1]]
#> [1] 2

Speed

Often when dealing with list-like objects, multiple operations are performed. A nuance to the discussion that the overhead of purrr is insignificant in most code - dealing with large lists and use cases.

got_large <- rep(got_chars, 1e4) # 300 000 elements, 1.3 GB in memory
bench::mark(
  base = {
    lapply(got_large, `[[`, 2) |>
      lapply(\(.) . * 1e5) |>
      lapply(\(.) . / 1e5) |>
      lapply(\(.) as.character(.))
  },
  purrr = {
    map(got_large, 2) %>%
      map(~ . * 1e5) %>%
      map(~ . / 1e5) %>%
      map(~ as.character(.))
  }, iterations = 100,
)[c(1, 3, 4, 5, 7, 8, 9)]

# A tibble: 2 x 7
  expression   median `itr/sec` mem_alloc n_itr  n_gc total_time
  <bch:expr> <bch:tm>     <dbl> <bch:byt> <int> <dbl>   <bch:tm>
1 base          1.19s     0.807    9.17MB   100   301      2.06m
2 purrr         2.67s     0.363    9.15MB   100   919      4.59m

This diverges the more actions are performed. If you are writing code that is used routinely by some users or packages depend on it, the speed might be a significant factor to consider in your choice between base R and purr. Notice purrr has a slightly lower memory footprint.

There is, however a counterargument: If you want speed, go to a lower level language.

Lyophilic answered 25/10, 2021 at 9:40 Comment(0)
T
2

I think people hit most of the points here, but I want to mention that the speedup from the user's perspective in using lapply() becomes much more significant, particularly if you're not using R on Windows, when you upgrade to mclapply() (from the parallel package, which to my knowledge doesn't work on Windows and literally never will). The mclapply() syntax is identical to lapply(), and so if you write your code using lapply() from the beginning, you won't need to change anything about your code aside from typing an "mc" at the beginning of the function call and providing it with a number of cores to use. This may be important if you're using the lapply() to break a job up into parallelizable chunks; speedup factor compared with lapply() will be approximately the number of processor cores being used. If you're using your code on the right server or cluster, that can easily turn hours into seconds.

Timekeeper answered 16/12, 2022 at 1:14 Comment(1)
JFYI, as lapply() has a counterpart mclapply(), purrr::map() and other purrr functions have a furrr counterpart to process data parallelly, e.g. furrr::future_map() (just add future_ to the name of purrr's functions).Jollanta

© 2022 - 2024 — McMap. All rights reserved.