Why does wrapping this loop in a function speed it up by 8x? [duplicate]
Asked Answered
T

1

9

I'm trying to gain a deeper understanding of loops vs. *apply functions in R. Here, I did an experiment where I compute the first 10,000 triangular numbers in 3 different ways.

  1. unwrapped: a simple for loop
  2. wrapped: I take the exact same loop from before, but wrap it in a function.
  3. vapply: Using vapply and an anonymous function.

The results surprised me in two different ways.

  1. Why is wrapped 8x faster than unwrapped (?!?!) My intuition is that given wrapped actually does more stuff (defines a function and then calls it), it should have been slower.
  2. Why are they both so much faster than vapply? I would have expected vapply to be able to do some kind of optimization that performs at least as well as the loops.
microbenchmark::microbenchmark(
  unwrapped = {
    x <- numeric(10000)
    for (i in 1:10000) {
      x[i] <- i * (i + 1) / 2
    }
    x
  },
  wrapped = {
    tri_nums <- function(n) {
      x <- numeric(n)
      for (i in 1:n) {
        x[i] <- i * (i + 1) / 2
      }
      x
    }
    tri_nums(10000)
    },
  vapply = vapply(1:10000, \(i) i * (i + 1) / 2, numeric(1)),
  check = 'equal'
)
#> Unit: microseconds
#>       expr      min       lq     mean    median       uq       max neval
#>  unwrapped 2652.487 3006.888 3445.896 3150.7555 3832.094  7029.949   100
#>    wrapped  398.534  414.010  455.333  439.7445  469.307   656.074   100
#>     vapply 4942.000 5154.639 5937.333 5453.2880 5969.760 13730.718   100

Created on 2023-01-04 with reprex v2.0.2

Threnode answered 4/1, 2023 at 22:38 Comment(3)
It's actually even better if you don't redefine the function on each pass.Disfeature
Related (for the first comparison): Speed difference of loop Inside vs Outside Function; Why is running R code inside a function faster?Spinneret
@Henrik, good find, this felt like it had to be a dupe but I didn't see those two. Thank you. For the record, I do add a bit more context than those two answers ... ¯\_(ツ)_/¯Disfeature
D
7

It's byte-compiling your function.

We can confirm just-in-time (JIT) compilation with:

compiler::enableJIT(-1)
# [1] 3                        # <--- this is the previous JIT level

where negative returns the current level unchanged, and a value of 3 means highest JIT compiling level. I'm not certain what steps each level is doing, but we can make a simple test to compare them. (See ?enableJIT for more info.)

compiler::enableJIT(0)
# [1] 3
tri_nums <- function(n) {
  x <- numeric(n)
  for (i in 1:n) {
    x[i] <- i * (i + 1) / 2
  }
  x
}
bench::mark(
  unwrapped = {
    x <- numeric(10000)
    for (i in 1:10000) {
      x[i] <- i * (i + 1) / 2
    }
    x
  },
  JIT0 = tri_nums(10000),
  vapply = vapply(1:10000, \(i) i * (i + 1) / 2, numeric(1))
)
# # A tibble: 3 × 13
#   expression      min   median `itr/sec` mem_al…¹ gc/se…² n_itr  n_gc total…³ result memory     time       gc      
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t> <list> <list>     <list>     <list>  
# 1 unwrapped    8.21ms    8.7ms      113.   78.2KB    7.07    48     3   424ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT0         7.26ms   7.72ms      128.   78.2KB    9.84    52     4   407ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 3 vapply       5.97ms    6.5ms      152.   78.2KB    9.51    64     4   421ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# # … with abbreviated variable names ¹​mem_alloc, ²​`gc/sec`, ³​total_time

(I can't put all three levels in at once, since I believe the JIT check is done when we call it as well as when we define it. I'm really not qualified to speak to this level of R-internal, so ... please correct me and/or add amplifying information.)

Doing this again for levels 1-3 and copy/pasting the relevant bench::mark rows, we see:

# # A tibble: 3 × 13
#   expression      min   median `itr/sec` mem_al…¹ gc/se…² n_itr  n_gc total…³ result memory     time       gc      
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:by>   <dbl> <int> <dbl> <bch:t> <list> <list>     <list>     <list>  
# 1 unwrapped    8.21ms    8.7ms      113.   78.2KB    7.07    48     3   424ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT0         7.26ms   7.72ms      128.   78.2KB    9.84    52     4   407ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT1        419.6µs  502.5µs     1923.  108.7KB    0      962     0   500ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT2        413.4µs  494.3µs     1971.  108.7KB    0      986     0   500ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 2 JIT3        426.7µs  498.3µs     1981.  108.7KB    0      991     0   500ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# 3 vapply       5.97ms    6.5ms      152.   78.2KB    9.51    64     4   421ms <dbl>  <Rprofmem> <bench_tm> <tibble>
# # … with abbreviated variable names ¹​mem_alloc, ²​`gc/sec`, ³​total_time

showing that the vast majority of gains are in the first level of byte-compiling (not too surprising given the simplicity of this function).


Note: for anybody who is actually testing some of this code, you might want to ensure you're back at the default level of 3:

compiler::enableJIT(3)
Disfeature answered 4/1, 2023 at 22:56 Comment(5)
This still leaves me with a mystery about vapply, but I guess that ought to be a separate questionThrenode
Call tri_nums on the console and note that it ends with something like <bytecode: 0x0000021e9d5ec1f8>, suggesting that it was byte-compiled.Disfeature
vapply is about as fast as the apply-family of functions can be ... but ever since R-3, for loops can be faster given basic CS-principles. It's been a while, but every now and then I still see a question here on SO along the lines of "my prof told me for loops are stupid slow, so I have to convert this simple process to a more-complicated *apply thingy".Disfeature
It's interesting that, in my experimentation at least, vapply doesn't get the byte-compiling boost—wrapping the vapply in a function call makes it slower.Threnode
My guess is that byte-compiling does so well here because the operations you're doing are completely mathematical and so-very-well optimizable. Start throwing in custom logic and/or other things, it it likely to narrow the margin between compiled and not.Disfeature

© 2022 - 2025 — McMap. All rights reserved.