Ackermann very inefficient with Haskell/GHC
Asked Answered
A

7

50

I try computing Ackermann(4,1) and there's a big difference in performance between different languages/compilers. Below are results on my Core i7 3820QM, 16G, Ubuntu 12.10 64bit,

C: 1.6s, gcc -O3 (with gcc 4.7.2)

int ack(int m, int n) {
  if (m == 0) return n+1;
  if (n == 0) return ack(m-1, 1);
  return ack(m-1, ack(m, n-1));
}

int main() {
  printf("%d\n", ack(4,1));
  return 0;
}

OCaml: 3.6s, ocamlopt (with ocaml 3.12.1)

let rec ack = function
  | 0,n -> n+1
  | m,0 -> ack (m-1, 1)
  | m,n -> ack (m-1, ack (m, n-1))
in print_int (ack (4, 1))

Standard ML: 5.1s mlton -codegen c -cc-opt -O3 (with mlton 20100608)

fun ack 0 n = n+1
  | ack m 0 = ack (m-1) 1
  | ack m n = ack (m-1) (ack m (n-1));
print (Int.toString (ack 4 1));

Racket: 11.5s racket (with racket v5.3.3)

(require racket/unsafe/ops)

(define + unsafe-fx+)
(define - unsafe-fx-)
(define (ack m n)
  (cond
    [(zero? m) (+ n 1)]
    [(zero? n) (ack (- m 1) 1)]
    [else (ack (- m 1) (ack m (- n 1)))]))

(time (ack 4 1))

Haskell: unfinished, killed by system after 22s ghc -O2 (with ghc 7.4.2)

Haskell: 1.8s ajhc (with ajhc 0.8.0.4)

main = print $ ack 4 1
  where ack :: Int -> Int -> Int
        ack 0 n = n+1
        ack m 0 = ack (m-1) 1
        ack m n = ack (m-1) (ack m (n-1))

The Haskell version is the only one that fails to terminate properly because it takes too much memory. It freezes my machine and fills the swap space before getting killed. What can I do to improve it without heavily fuglifying the code?

EDIT: I appreciate some of the asymptotically smarter solutions, but they are not exactly what I am asking for. This is more about seeing whether the compiler handles certain patterns in a reasonably efficient way (stack, tail calls, unboxing, etc.) than computing the ackermann function.

EDIT 2: As pointed out by several responses, this seems to be a bug in recent versions of GHC. I try the same code with AJHC and get much better performance.

Thank you very much :)

Aridatha answered 20/4, 2013 at 2:13 Comment(11)
For anyone who attempts this, the problem here is computational, not algorithmic. The non-memoized call to ack(4,1) results in a recursion of depth 65535, and a total of 2862984010 calls to ack.Bolanos
ghc-core shows Ints are correctly getting converted to unboxed Int#, so that's not the problem.Bolanos
@Bolanos Thanks. I blindly attempted this.Cormier
The same code in Frege runs in 8.3s, though the JVM needs to be given a bit of extra stack space. Hardware is i3 CPU M 350 @ 2.27GHzSuffuse
JHC seems to perform better with micro benchmarks like this: haskell.org/pipermail/glasgow-haskell-users/2013-April/…Confabulate
On my macbook with ghc-7.6 this program returns the correct answer; with ghc-HEAD it returns the same answer in the same absurd amount of time. It differs only by using less memory. There is thus no genuine bug. The real scandal is that seemingly absurd timing each gives, which in fact follows from familiar facts about how ghc deals with things, what it chooses to optimize.Prominence
Please note that my username is actually not applicative and I think my issue was deleted along with his comments... In any case, its not fair to compare strict vs nonstrict runtimes. Please add strictness annotations if you want to do a fair comparison. This version has a very clear thunk leak.Gayelord
Hi alternative, the function uses Int, and GHC -O2 unboxes it into Int# already. So there's no need for explicit annotation. Thanks.Aridatha
This might be interesting for you #3242454Dumpy
I did mention mlton earlier, wondering why, if you were looking for scandal, you weren't surprised mlton took longer than ocamlopt given its famous orientation to whole program optimization, which is a quite different approach from e.g. the ghc's. It would not occur to me actual to attack SML or mlton, which are in my view both divine things.Prominence
FWIW, on ghc 8.6.4 with either -kc2m or -ki2m it runs in 2.8s on a 3gHz i5 on a Mac. Without those options it takes 10.5s presumably because of the large stack required, the small initial stack size and small stack chunk size on ghc which is there is to make creating threads as lightweight as possible.Playhouse
A
37

NB: The high memory usage issue is a bug in the GHC RTS, where upon stack overflow and allocation of new stacks on the heap it was not checked whether garbage collection is due. It has been already fixed in GHC HEAD.


I was able to get much better performance by CPS-converting ack:

module Main where

data P = P !Int !Int

main :: IO ()
main = print $ ack (P 4 1) id
  where
    ack :: P -> (Int -> Int) -> Int
    ack (P 0 n) k = k (n + 1)
    ack (P m 0) k = ack (P (m-1) 1) k
    ack (P m n) k = ack (P m (n-1)) (\a -> ack (P (m-1) a) k)

Your original function consumes all available memory on my machine, while this one runs in constant space.

$ time ./Test
65533
./Test  52,47s user 0,50s system 96% cpu 54,797 total

Ocaml is still faster, however:

$ time ./test
65533./test  7,97s user 0,05s system 94% cpu 8,475 total

Edit: When compiled with JHC, your original program is about as fast as the Ocaml version:

$ time ./hs.out 
65533
./hs.out  5,31s user 0,03s system 96% cpu 5,515 total

Edit 2: Something else I've discovered: running your original program with a larger stack chunk size (+RTS -kc1M) makes it run in constant space. The CPS version is still a bit faster, though.

Edit 3: I managed to produce a version that runs nearly as fast as the Ocaml one by manually unrolling the main loop. However, it only works when run with +RTS -kc1M (Dan Doel has filed a bug about this behaviour):

{-# LANGUAGE CPP #-}
module Main where

data P = P {-# UNPACK #-} !Int {-# UNPACK #-} !Int

ack0 :: Int -> Int
ack0 n =(n+1)

#define C(a) a
#define CONCAT(a,b) C(a)C(b)

#define AckType(M) CONCAT(ack,M) :: Int -> Int

AckType(1)
AckType(2)
AckType(3)
AckType(4)

#define AckDecl(M,M1) \
CONCAT(ack,M) n = case n of { 0 -> CONCAT(ack,M1) 1 \
; 1 ->  CONCAT(ack,M1) (CONCAT(ack,M1) 1) \
; _ ->  CONCAT(ack,M1) (CONCAT(ack,M) (n-1)) }

AckDecl(1,0)
AckDecl(2,1)
AckDecl(3,2)
AckDecl(4,3)

ack :: P -> (Int -> Int) -> Int
ack (P m n) k = case m of
  0 -> k (ack0 n)
  1 -> k (ack1 n)
  2 -> k (ack2 n)
  3 -> k (ack3 n)
  4 -> k (ack4 n)
  _ -> case n of
    0 -> ack (P (m-1) 1) k
    1 -> ack (P (m-1) 1) (\a -> ack (P (m-1) a) k)
    _ -> ack (P m (n-1)) (\a -> ack (P (m-1) a) k)

main :: IO ()
main = print $ ack (P 4 1) id

Testing:

$ time ./Test +RTS -kc1M
65533
./Test +RTS -kc1M  6,30s user 0,04s system 97% cpu 6,516 total

Edit 4: Apparently, the space leak is fixed in GHC HEAD, so +RTS -kc1M won't be required in the future.

Annelieseannelise answered 20/4, 2013 at 3:0 Comment(2)
It's worth noting that the constant memory usage is achieved by uncurrying and using unboxed tuples (similar to -XUnboxedTuples) regardless of whether or not you CPS-convert (which does improve the running time significantly).Terbia
Judging from the CMM code, the stack size required to run the code is just a bit less than 1M (recursion depth 65535 according to @scvalex, 16 bytes for each call). Hence -kc1M takes no additional space, while -kc768K finshes, but with considerable space consumption.Triangulation
W
13

It seems that there is some kind of bug involved. What GHC version are you using?

With GHC 7, I get the same behavior as you do. The program consumes all available memory without producing any output.

However if I compile it with GHC 6.12.1 just with ghc --make -O2 Ack.hs, it works perfectly. It computes the result in 10.8s on my computer, while plain C version takes 7.8s.

I suggest you to report this bug on GHC web site.

Wilde answered 20/4, 2013 at 16:36 Comment(0)
L
7

This version uses some properties of the ackermann function. It's not equivalent to the other versions, but it's fast :

ackermann :: Int -> Int -> Int
ackermann 0 n = n + 1
ackermann m 0 = ackermann (m - 1) 1
ackermann 1 n = n + 2
ackermann 2 n = 2 * n + 3
ackermann 3 n = 2 ^ (n + 3) - 3
ackermann m n = ackermann (m - 1) (ackermann m (n - 1))

Edit : And this is a version with memoization , we see that it's easy to memoize a function in haskell, the only change is in the call site :

import Data.Function.Memoize

ackermann :: Integer -> Integer -> Integer
ackermann 0 n = n + 1
ackermann m 0 = ackermann (m - 1) 1
ackermann 1 n = n + 2
ackermann 2 n = 2 * n + 3
ackermann 3 n = 2 ^ (n + 3) - 3
ackermann m n = ackermann (m - 1) (ackermann m (n - 1))

main :: IO ()
main = print $ memoize2 ackermann 4 2
Lydialydian answered 20/4, 2013 at 10:1 Comment(0)
B
5

The following is an idiomatic version that takes advantage of Haskell's lazyness and GHC's optimisation of constant top-level expressions.

acks :: [[Int]]
acks = [ [ case (m, n) of
                (0, _) -> n + 1
                (_, 0) -> acks !! (m - 1) !! 1
                (_, _) -> acks !! (m - 1) !! (acks !! m !! (n - 1))
         | n <- [0..] ]
       | m <- [0..] ]

main :: IO ()
main = print $ acks !! 4 !! 1

Here, we're lazily building a matrix for all the values of the Ackermann function. As a result, subsequent calls to acks will not recompute anything (i.e. evaluating acks !! 4 !! 1 again will not double the running time).

Although this is not the fastest solution, it looks a lot like the naïve implementation, it is very efficient in terms of memory use, and it recasts one of Haskell's weirder features (lazyness) as a strength.

Bolanos answered 20/4, 2013 at 3:10 Comment(7)
Runs in constant space, but is still much slower than Ocaml: ./Test 82,38s user 0,44s system 92% cpu 1:29,98 totalAnnelieseannelise
Yes. I'm personally happy it runs at all; the version from the question would run out of memory on my machine. I'll trying to speed it up with arrays now.Bolanos
Ah, no. I need bounds for arrays.Bolanos
I think using a different algorithm for a simple benchmarking question like this one is kind of cheating...Keeley
Taking advantage of the fact Haskell can do something awesome that most other languages cannot is a bit cheating, I suppose, but if you don't, the benchmark ends up comparing apples and oranges. The benchmark isn't, "how much slower is C code when written in Haskell", right?Bolanos
@scvalex: Other languages can do memoization too. Its just that you can't use list comprehensions for that.Keeley
Your solution is nice. But the purpose of this question is not to look for an asymptotically smarter algorithm to compute ackermann. It's more about seeing whether the compiler handles certain basic things in a reasonably efficient way. The Haskell solution I present is naive, but it's written in a declarative, functional manner, and it's not obvious to me why it fails to work at all while the ones from other compilers eventually terminate with that same pattern.Aridatha
M
4

Writing the algorithm in Haskell in a way that looks similar to the way you wrote it in C is not the same algorithm, because the semantics of recursion are quite different.

Here is a version using the same mathematical algorithm, but where we represent calls to the Ackermann function symbolically using a data type. That way, we can control the semantics of the recursion more precisely.

When compiled with optimization, this version runs in constant memory, but it is slow - about 4.5 minutes in an environment similar to yours. But I'm sure it could be modified to be much faster. This is just to give the idea.

data Ack = Ack !Int

ack :: Int -> Int -> Int
ack m n = length . ackR $ Ack m : replicate n (Ack 0)
  where
    ackR n@(Ack 0 : _) = n
    ackR n             = ackR $ ack' n

    ack' [] = []
    ack' (Ack 0 : n) = Ack 0 : ack' n
    ack' [Ack m]     = [Ack (m-1), Ack 0]
    ack' (Ack m : n) = Ack (m-1) : ack' (Ack m : decr n)

    decr (Ack 0 : n) = n
    decr n           = decr $ ack' n
Misconstruction answered 23/4, 2013 at 18:9 Comment(2)
Huh. I changed the data to a newtype, thinking it would save memory, hence less garbage collection and less time. But the result was actually slower. Changing back to data.Misconstruction
One easy optimization is to combine long runs of Ack m for the same m into one, adding an extra parameter to Ack for the number of times that it is repeated. That cuts down the amount of garbage collection. This cut down my run time to about 1 minute.Misconstruction
P
3

I don't see that this is a bug at all, ghc just isn't taking advantage of the fact that it knows that 4 and 1 are the only arguments the function will ever be called with -- that is, to put it bluntly, it doesn't cheat. It also doesn't do constant math for you, so if you had written main = print $ ack (2+2) 1, it wouldn't have calculated that 2+2 = 4 till runtime. The ghc has much more important things to think about. Help for the latter difficulty is available if you care for it http://hackage.haskell.org/package/const-math-ghc-plugin.

So ghc is helped if you do a little math e.g. this is at least a hundered times as fast as your C program with 4 and 1 as arguments. But try it with 4 & 2:

main = print $ ack 4 2 where

    ack :: Int -> Integer -> Integer
    ack 0 n = n + 1
    ack 1 n = n + 2 
    ack 2 n = 2 * n + 3
    ack m 0 = ack (m-1) 1
    ack m n = ack (m-1) (ack m (n-1) )

This will give the right answer, all ~20,000 digits, in under a tenth of a second, whereas the gcc, with your algorithm, will take forever to give the wrong answer.

Prominence answered 20/4, 2013 at 21:4 Comment(1)
It is a bug. The runtime quickly eats up all memory and hangs the machine even if you set hard limits on the amount of heap and stack space it is allowed to use. Fortunately, this bug only occurs for very arcane functions like the Akermann function, and anyway it is fixed in HEAD.Misconstruction
E
3

This performance issue (except for GHC RTS bug obviously) seems to be fixed now on OS X 10.8 after Apple XCode update to 4.6.2. I can still reproduce it on Linux (I have been testing with GHC LLVM backend though), but not any more on OS X. After I updated the XCode to 4.6.2, the new version seems to have affected the GHC backend code generation for Ackermann substantially (from what I remember from looking at object dumps pre-update). I was able to reproduce the performance issue on Mac before XCode update - I don't have the numbers but they were surely quite bad. So, it seems that XCode update improved the GHC code generation for Ackermann.

Now, both C and GHC versions are quite close. C code:

int ack(int m,int n){

  if(m==0) return n+1;
  if(n==0) return ack(m-1,1);
  return ack(m-1, ack(m,n-1));

}

Time to execute ack(4,1):

GCC 4.8.0: 2.94s
Clang 4.1: 4s

Haskell code:

ack :: Int -> Int -> Int
ack 0 n = n+1
ack m 0 = ack (m-1) 1
ack m n = ack (m-1) (ack m (n-1))

Time to execute ack 4 1 (with +RTS -kc1M):

GHC 7.6.1 Native: 3.191s
GHC 7.6.1 LLVM: 3.8s 

All were compiled with -O2 flag (and -rtsopts flag for GHC for RTS bug workaround). It is quite a head scratcher though. Updating XCode seems to have made a big difference with optimization of Ackermann in GHC.

Elga answered 27/4, 2013 at 3:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.