Scheme: why is Internal Definition faster than External Definition?

G

3

I tried running the program below

(define (odd-internal x)
  (define (even x)
    (if (zero? x)
        #t
        (odd-internal (sub1 x))))
  (if (zero? x)
      #f
      (even (sub1 x))))

(define (odd-external x)
  (if (zero? x)
      #f
      (even (sub1 x))))

(define (even x)
  (if (zero? x)
      #t
      (odd-external (sub1 x))))

(begin (display "Using internal definition\n")
       (time (odd-internal 40000000)))

(begin (display "Using external definition\n")
       (time (odd-external 40000000)))

This is the result in Racket

Using internal definition
cpu time: 166 real time: 165 gc time: 0
#f
Using external definition
cpu time: 196 real time: 196 gc time: 0
#f

There you can see using internal definition is quite a bit faster. I've tried running on Chez Scheme and the result is similar. Why is that?

Gyrose answered 7/10, 2018 at 13:26 Comment(0)

A

1

I was amazed that it was a difference so from the commens of Lexis answer I split the two version in each their file internal.rkt and external.rkt and compiled them and decompiled them in this way:

raco make external.rkt
raco decompile compiled/external_rkt.zo

This goes one step further than looking at the fully expanded program in the macro stepper. It looks very non human readable so I have prettyfied it with the most important parts in tact:

(define (odd-external x1)
  (if (zero? x1)
      '#f
      (let ((x2 (sub1 x1)))
        (if (zero? x2)
            '#t
            (let ((x3 (sub1 x2)))
              (if (zero? x3)
                  '#f
                  (let ((x4 (sub1 x3)))
                    (if (zero? x4)
                        '#t
                        (let ((x5 (sub1 x4)))
                          (if (zero? x5) '#f (even (sub1 x5))))))))))))

(define (even x1)
  (if (zero? x1)
       '#t
       (let ((x2 (sub1 x1)))
         (if (zero? x2)
           '#f
           (let ((x3 (sub1 x2)))
             (if (zero? x3)
               '#t
               (let ((x4 (sub1 x3)))
                 (if (zero? x4)
                   '#f
                   (let ((x5 (sub1 x4)))
                     (if (zero? x5)
                       '#t
                       (let ((x6 (sub1 x5)))
                         (if (zero? x6)
                           '#f
                           (let ((x7 (sub1 x6)))
                             (if (zero? x7)
                               '#t
                               (odd-external (sub1 x7))))))))))))))))

Nothing special here. It unrolls the loop a certain times and constant folds. Notice we still have mutual recursion and that the unrolling is 5 and 7 times. The constant was even constant folded so it had replaced my call with (even 399999995) so the compiler had also run the code 5 turns and given up. The interesting thing is the internal version:

(define (odd-internal x1)
  (if (zero? x1)
      '#f
      (let ((x2 (sub1 x1)))
        (if (zero? x2)
            '#t
            (let ((x3 (sub1 x2)))
              (if (zero? x3)
                  '#f
                  (let ((x4 (sub1 x3)))
                    (if (zero? x4)
                        '#t
                        (let ((x5 (sub1 x4)))
                          (if (zero? x5)
                              '#f
                              (let ((x6 (sub1 x5)))
                                (if (zero? x6)
                                    '#t
                                    (let ((x7 (sub1 x6)))
                                      (if (zero? x7)
                                          '#f
                                          (let ((x8 (sub1 x7)))
                                            (if (zero? x8)
                                                '#t
                                                (odd-internal
                                                 (sub1 x8))))))))))))))))))

It is no longer mutual recursion since it calls itself after 8 times. An each round does 8 turns while the other version did 7, then 5.. In two rounds the internal one has done 16 rounds while the other has 12. The initial call on the internal one is (odd-internal '399999992) so the compiler did 8 rounds before giving up.

I guess the code in side the functions at the decompiler level are open coded and the code at each step is very cheap making the number of calls the reason for the 25% speed increase. After all 4 more is 25% more per recursion that coincides with the difference in computing time. This is speculations based on observation so it would be interesting to have a comment from Lexi on this.

Antisyphilitic answered 7/10, 2018 at 22:5 Comment(2)

Saying that you reduce the number of function applications by using let is incorrect. let is a function application itself. – Blau 8/10, 2018 at 3:59

@Blau let is by Scheme definition a function application, but an implementation is free to make it a primitive as long as it remains the functionality. Since this is beyond fully expanded program in some core racket language it is safe to say that this let is not a function. Every function has lost its sugar and looks like (define-values (binding) (lambda args source-loc flags freee-variables body ...)) unless it is inlined. – Antisyphilitic 8/10, 2018 at 8:38

C

1

Your numbers are too small to be meaningful. The difference between 166 ms and 196 ms is, in absolute terms, tiny. Who knows what other factors could be influencing that? VM warmup time, differences in memory allocation, or any host of other things could easily cause a discrepancy of that size. To be sure, you should make the numbers much bigger.

On my machine, running Racket v7.0, I increased the arguments from 40000000 to 1000000000 and ran the program. The results were 2.361 s for the internal definition case and 2.212 s for the external definition case. Given the sorts of factors listed above, that difference is too small to be meaningful.

Benchmarking is hard, and benchmarking languages that run on VMs and are JIT compiled is harder. Even if you account for warmup and GC, run lots of iterations and take the averages, and generally try to do things right, the results you get could still be nearly meaningless, as the 2017 OOPSLA paper Virtual Machine Warmup Blows Hot and Cold explains:

Virtual Machines (VMs) with Just-In-Time (JIT) compilers are traditionally thought to execute programs in two phases: the initial warmup phase determines which parts of a program would most benefit from dynamic compilation, before JIT compiling those parts into machine code; subsequently the program is said to be at a steady state of peak performance. Measurement methodologies almost always discard data collected during the warmup phase such that reported measurements focus entirely on peak performance. We introduce a fully automated statistical approach, based on changepoint analysis, which allows us to determine if a program has reached a steady state and, if so, whether that represents peak performance or not. Using this, we show that even when run in the most controlled of circumstances, small, deterministic, widely studied microbenchmarks often fail to reach a steady state of peak performance on a variety of common VMs. Repeating our experiment on 3 different machines, we found that at most 43.5% of ⟨VM, benchmark⟩ pairs consistently reach a steady state of peak performance.

Emphasis mine. Make sure you’re measuring what you think you’re measuring.

Cagliostro answered 7/10, 2018 at 16:57 Comment(5)

I did something similar I added a zero to get the numbers in seconds area and compiled with a random to choose the order. Running the program a lot of times had a pretty identical 25% faster for the internal. I remember reading that JIT has a mutual recursion optimization and I think perhaps that makes the internal faster since it knows the binding will never change and perhaps constant folds it? – Antisyphilitic 7/10, 2018 at 20:19

@Antisyphilitic Do you get the same results if you break the code into two separate programs and run them many times independently? – Cagliostro 7/10, 2018 at 20:49

Yes. And I get the same time results +-25ms. Internal 1532 and external 2030. Is there a way to see the b-tree/bytecode representation? – Antisyphilitic 7/10, 2018 at 21:22

This answer is incorrect. It's true that benchmarking can be affected by many things, but that doesn't mean that there is nothing here. In fact the two pieces of code produce rather different code. – Blau 7/10, 2018 at 21:33

I found out how to look at the compiler results and made my own answer. Somehow the compiler makes different number of constant folding rounds. I've made an answer of my findings. – Antisyphilitic 7/10, 2018 at 22:7

B

1

First off this will depend on your implementation since nested defines can be implemented in more than one way. On my Chez Scheme 9.5 setup I get a rather consistent 25% faster run time when I use odd-internal.

Now, for the why. This happens because nested defines (i.e. internal defines) are wildly different than actual defines.

When you use define on the top level, you are adding a new record to the free-variables table. Whenever you try to evaluate a variable that is not bound to any lambda, it is looked up in the free-variables (hash) table. This search is very efficient, but it's slower than fetching a bound variable. So when you calculate (odd-external 40000000) that you fetch even and odd-external from that table about 40mil times - even with caching and other cool stuff, this is still work to be done.

In contrast, nested defines create a bound variable. One way they are implemented is as nested lambda/let/letrec expressions. That way the odd-internal function would be transformed into^[1]:

(define (odd-internal x)
(let ((even (lambda (x)
              (if (zero? x)
                  #t
                  (odd-internal (sub1 x))))))
  (if (zero? x)
      #f
      (even (sub1 x)))))

(Which is a simplification of what Chez Scheme does).
Now every time you apply odd-internal it's still a free-variable, so you hash it and find it in the free-variables table. However, when you apply even, you just grab it from the environment (which can cost as little as a single memory dereference even without cool tricks).

A fun experiment would be to define both odd and even as bound variables, so all 40mil variable fetches would benefit from quick bound-variable fetch times. I saw a 16% improvement on top of the original 25%. Here's the code:

(define (odd-quick x)
    (define (odd x) (if (zero? x) #f (even (sub1 x))))
    (define (even x) (if (zero? x) #t (odd (sub1 x))))
    (odd x))

^_[1] let is a syntactic suger for a lambda application, so you can read that code as:

(define (odd-internal x)
    ((lambda (even)
      (if (zero? x)
          #f
          (even (sub1 x))))
     (lambda (x)
       (if (zero? x)
           #t
           (odd-internal (sub1 x))))))

Blau answered 7/10, 2018 at 21:31 Comment(3)

Your 'odd-quick' version is about as fast as the 'odd-external' version on my machine. Are you sure that you can consistently reproduce the result? – Gyrose 7/10, 2018 at 23:20

Yes, I tested this on Chez Scheme 9.5, JDoodle (Racket) and Repl.it (Scheme). However, as I noted at the top of my answer, this is implementation specific since R6Rs and its derived languages (the ones I know, at least) don't dictate how you should implement nested defines. One can write a compiler that will work with nested defines just as bad as with free-variables. What do you mean by "about as fast"? How did you test this? – Blau 8/10, 2018 at 3:54

I just test with time. Using Racket v7.0, I always get the result that odd-internal is the faster, then odd-quick, and finally odd-external. – Gyrose 8/10, 2018 at 9:30

A

1