On improving Haskell's performance compared to C in fibonacci micro-benchmark

Asked 16/7, 2011 at 8:39 Answered 22/7, 2011 at 20:26

performance haskell ghc micro-optimization microbenchmark

I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.

I tried doing this with Haskell to see how it compares to C.

C code:

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  return fib (n-1) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Result:

> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real    0m0.421s
user    0m0.420s
sys 0m0.000s

Haskell:

module Main where
import System.Environment (getArgs)

fib :: Int -> Int
fib n | n < 2 = 1
      | otherwise = fib (n-1) + fib (n-2)

main = getArgs >>= print . fib . read . head

Result:

> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real    0m1.476s
user    0m1.476s
sys 0m0.000s

Profiling with

> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof

reveals that fib takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:

> ./fib 40 +RTS -hc

enter image description here

> ./fib 40 +RTS -hd

enter image description here

So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm to compute fibs.)

Thank you very much.

[EDIT]

It turned out that ghc -O3 was faster than ghc -O3 -fllvm -optlo-O3 in this case. But optlo-block-placement made an observable difference for the LLVM backend:

> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.283s
user    0m1.284s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.449s
user    0m1.448s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.112s
user    0m1.096s
sys 0m0.016s

The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D

> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real    0m0.668s
user    0m0.660s
sys 0m0.008s

Jipijapa answered 16/7, 2011 at 8:39 Comment(5)

gcc -O3 propably turns it into a non-recursive loop. I don't know if GHC can do the same. – Kisumu 16/7, 2011 at 10:13

Oh really? Fib is not tail recursive. It's not even linearly recursive. How could it be turned into a loop? – Jipijapa 16/7, 2011 at 17:59

@Po: See https://mcmap.net/q/183660/-why-are-compilers-so-stupid/… for an example doing it with a (not tail-recursive either) factorial, same principle. You wouldn't be the first to undererstimate C compilers. You can make gcc output assembly with -S to check if it does optimize it (I'd check myself, but I don't have easy access to a Unix machine ATM). – Kisumu 16/7, 2011 at 18:3

@delnan: I see. That behavior with -O3 is pretty impressive for an imperative language compiler. But back to this program, how can this fib function possibly be turned into non-recursive? You have 2 resursive calls, and the final pending addition is inevitable. And the original author took care to give the argument at runtime so compilers couldn't perform constant propagation like that in gcc -O3 from the link you gave. – Jipijapa 16/7, 2011 at 18:25

@Po: The constant has nothing to do with the removal of recursion (go ahead, try it yourself - or just consider the fact that gcc didn't even bother to propagage it on -O2 but still generated the loop). And while the two recursive calls seem to make it harder, I'd just look and see instead of hypothizing. Again, you wouldn't be the first to underestimate the compiler... – Kisumu 16/7, 2011 at 18:57

The heap profile is not very interesting here, as GHC compiles fib into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of your main implementation.

As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:

_Main_zdwfib_info:
.Lc1RK:
    leal -8(%ebp),%eax
    cmpl 84(%ebx),%eax
    jb .Lc1RM

This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.

    cmpl $2,0(%ebp)
    jl .Lc1RO

The comparison against 2 from your code.

    movl 0(%ebp),%eax
    decl %eax

The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.

    movl %eax,-8(%ebp)
    movl $_s1Qp_info,-4(%ebp)
    addl $-8,%ebp
    jmp _Main_zdwfib_info

The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.

.Lc1RO:
    movl $1,%esi
    addl $4,%ebp
    jmp *0(%ebp)

The code for the case that the parameter is less than 2. The return value is passed in a register.

Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.

Pother answered 16/7, 2011 at 11:7 Comment(2)

How can this stack check be eliminated? – Iiette 20/7, 2011 at 12:57

@is7s: It can not be eliminated - it is an integral part of how programs compiled with GHC work. Note that most programs wouldn't have to do this stack check often, and the custom stack allows GHC to scale up to millions of user threads. – Pother 20/7, 2011 at 19:1

These seems like a really feeble 'benchmark' as barsoap says. Suppose I compare the following almost equally naive programs:

module Main where
import System.Environment (getArgs)

fib ::  Int ->  Int
fib 0 = 1
fib 1 = 1
fib 2 = 2 
fib n = (\x y -> x + y + y )  (fib (n-3))  (fib (n-2) )

main = getArgs >>= print . fib . read . head

... and in the other corner ...

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  if (n < 3) return n;
  return (fib (n-3) + fib (n-2)) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Then the Glorious ghc crushes gcc, not too surprisingly, really:

$ ghc --make -fforce-recomp fib.hs -o fibh
[1 of 1] Compiling Main             ( fib.hs, fib.o )
Linking fibh ...
$ time ./fibh 40
165580141

real    0m0.013s
user    0m0.007s
sys 0m0.003s

$ gcc fib.c -o fibc
$ time ./fibc 40
165580141

real    0m1.495s
user    0m1.483s
sys 0m0.003s

now turning on optimizations, ghc picks up a little more speed:

$ ghc --make -fforce-recomp fib.hs -O3 -o fibhO
$ time ./fibhO 40
165580141

real    0m0.007s
user    0m0.002s
sys 0m0.004s

but gcc finally gets a clue.

$ gcc fib.c -O3 -o fibcO
$ time ./fibcO 40
165580141

real    0m0.007s
user    0m0.004s
sys 0m0.002s

I think the explanation is the ghc's wariness of common subexpression elimination: it is dangerous where '(almost) everything is an expression', and it figures the programmer knows how to use a lambda.

Intensifier answered 22/7, 2011 at 20:26 Comment(7)

Is this performance difference due to the use of a lambda expression? – Iiette 22/7, 2011 at 20:44

I was making two emendations at once, making things less clear than they might be: first, I exposed that fib n = fib n-3 + fib n-2 + fib n-2 (and thus supplied the case fib 3) -- as in the C file. With ghc -O2 this is as fast as gcc -O3. Without optimization flags, ghc and gcc are both slow. If (second emendation) I add the lambda then ghc without optimization flags is already fast. My thought was that it was basically this lambda that ghc -O2 and gcc -O3 see. That gcc can see this even in pos file is a tiny bit of additional cleverness. – Intensifier 22/7, 2011 at 22:7

Another way to put it is that, with the use of the lambda, even using runhugs is faster the gcc, compiling without optimization flags. – Intensifier 23/7, 2011 at 2:13

But I don't get why or how does lambda addition result in this improvement. – Iiette 23/7, 2011 at 7:29

@is7s: Because each sub (fib n) is written only once and bound to either the x or y parameter. Each of these expression then gets (lazily) evaluated only once. The second call of y naturally uses the readily evaluated value. – Jipijapa 23/7, 2011 at 17:39

@Po but this doesn't make any difference in the original naive fibonacci algorithm in your question, does it? – Iiette 23/7, 2011 at 21:8

In the po's original clause otherwise = fib (n-1) + fib (n-2) nothing is (visibly) repeated, so there is no use for a lambda. If we rewrite fib (n - 1) as fib (n - 2) + fib (n - 3) which it is (for n>3), then we get otherwise = fib (n-2) + fib (n-3) + fib (n-2) Then we have a common subexpression fib (n-2) and can write otherwise = fib2 + fib3 + fib2 where fib2 = fib (n-2); fib3 = fib (n - 3) or the explicit lambda I used. (Or we can write 2 * fib2 + fib3, of course). To do any of this you need to add a third "base case", fib 2 = 2. – Intensifier 25/7, 2011 at 0:34

GHC compiles this fine. The next step is to micro-optimize the output from the backend of GHC. Playing with various LLVM flags can help here.

To do this, use ghc-core to inspect the assembly, and try other flags to LLVM to see what you get.

Another approach would be to add a small amount of parallelism.

Eisk answered 16/7, 2011 at 8:56 Comment(2)

Thanks Don :D. I played around with LLVM flags a bit. It looks like only optlo-block-placement made an observable difference (from ~1.5s to ~1.1s). Btw, I didn't notice that GHC's default backend (simply ghc -O3) was faster than ghc -O3 -fllvm -optlo-O3 (~1.2s, ~1.5s respectively) for this program on my machine. – Jipijapa 16/7, 2011 at 18:3

You can also do more advanced llvm games: see donsbot.wordpress.com/2010/03/01/… – Eisk 17/7, 2011 at 2:36

Try this:

fibs :: [Integer]
fibs = 0:1:zipWith (+) fibs (tail fibs)

fib n = fibs !! n

$ time ./fib 10000
  33644[...]6875
  ./fib 10000  0.03s user 0.01s system 89% cpu 0.039 total

(That's on an good ole Athlon64 3200+)

The version you are using is, for every n, computing fib (n-1) and fib (n-2), that is, has an complexity of roughly triangular nature. The version above is linear: Each fib is only computed once. Despite what the non-Haskell programming hivemind seems to think, Haskell does not automatically memoise (which would be slower, in general, than good ole dynamic programming, anyway).

There's even faster (using maths tricks) version of fibonnaci on the Haskell Wiki.

Change the C version to be non-recursive, and my bet is that you'll see both Haskell and C having very similar performance. Tight loops are just easier to optimise.

Stoltz answered 16/7, 2011 at 11:4 Comment(6)

From the question: "compared the performance of various compilers on computing fibonaci numbers the naive way", also "I'm not asking for an asymptotically faster algorithm to compute fibs" – Xerosere 16/7, 2011 at 11:14

...I understand the question, it just doesn't fit into my mind why one would want to compare naive implementations of micro-benchmarks. Comparing naive C IO and naive stream-fusion enabled Haskell IO, that I would understand. Also, that zipWith is naive, depending on which math book you have in your hand. – Stoltz 16/7, 2011 at 11:19

The benchmark is to compare implementations of a certain recursive algorithm; the point is not to find the Fibonacci numbers. – Xerosere 16/7, 2011 at 11:25

Thanks barsoap :). The reason I wanted to investigate further was because both C and OCaml were significantly faster than Haskell in this test (0.4s, 0.6s, and 1.4s on my machine, respectively), and I found it sort of hard to accept, so I wanted to learn more to make sure I already did everything I could. – Jipijapa 16/7, 2011 at 17:57

@Stoltz The naive recursive version tests the implementation's ability to handle a large number of small function calls something that a FP language implementation should be rather good at. – Dippy 16/7, 2011 at 19:21

@Dippy The thing is, GHC code usually doesn't end up with a large number of small function calls, due to very aggressive inlining. The productive work/uninlinable call ratio here is very atypical for actual programs. – Stoltz 17/7, 2011 at 4:26

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags