Why is this prime test so slow?
Asked Answered
C

4

9

This code was taken from the book "Haskell Road to Logic, Math and Programming". It implements sieve of eratosthenes algorithm and solves Project Euler Problem 10.

sieve :: [Integer] -> [Integer]
sieve (0 : xs) = sieve xs
sieve (n : xs) = n : sieve (mark xs 1 n)
  where
    mark :: [Integer] -> Integer -> Integer -> [Integer]
    mark (y:ys) k m | k == m = 0 : (mark ys 1 m)
                    | otherwise = y : (mark ys (k+1) m)

primes :: [Integer]
primes = sieve [2..]

-- Project Euler #10
main = print $ sum $ takeWhile (< 2000000) primes

Actually it runs even slower then the naive prime test. Can someone explain this behaivour?

I suspect it has something to do with iterating each element in the list in the mark function.

Thanks.

Contravene answered 30/7, 2012 at 8:47 Comment(1)
Somewhat relevant (but not an answer) - The Genuine Seive of Eratosthenes - there's already a direct link in a comment to the top answer, but this is via a Lambda The Ultimate page.Isotone
Y
11

You are building up a quadratic number of unevaluated thunks using this algorithm. The algorithm relies on laziness so heavily, that this also is the reason why it doesn't scale.

Let's walk through how it works, which hopefully should make the problem apparent. For simplicitly, let's say that we want to print the elements of primes ad infinitum, i.e. we want to evaluate each cell in the list one after the other. primes is defined as:

primes :: [Integer]
primes = sieve [2..]

Since 2 isn't 0, the second definition of sieve applies, and 2 is added to the list of primes, and the rest of the list is an unevaluated thunk (I use tail instead of the pattern match n : xs in sieve for xs, so tail isn't actually being called, and doesn't add any overhead in the code below; mark is actually the only thunked function):

primes = 2 : sieve (mark (tail [2..]) 1 2)

Now we want the second primes element. So, we walk through the code (exercise for the reader) and end up with:

primes = 2 : 3 : sieve (mark (tail (mark (tail [2..]) 1 2)) 1 3)

Same procedure again, we want to evaluate the next prime...

primes = 2 : 3 : 5 : sieve (mark (tail (tail (mark (tail (mark (tail [2..]) 1 2)) 1 3))) 1 5)

This is starting to look like LISP, but I digress... Starting to see the problem? For each element in the primes list, an increasingly large thunk of stacks of mark applications have to be evaluated. In other words, for each element in the list, there has to be a check for whether that element is marked by any of the preceding primes, by evaluating each mark application in the stack. So, for n~=2000000, the Haskell runtime has to call functions resulting in a call stack with a depth of about ... I don't know, 137900 (let n = 2e6 in n / log n gives a lower bound)? Something like that. This is probably what causes the slow-down; maybe vacuum can tell you more (I don't have a computer with both Haskell and a GUI right now).

The reason why the sieve of Eratosthenes works in languages like C is that:

  1. You aren't using an infinite list.
  2. Because of (1), you can mark the whole array before continuing with the next n, resulting in no call stack overheads at all.
Yarkand answered 30/7, 2012 at 10:2 Comment(13)
(For alternative algorithms, read this paper.)Dehydrogenate
... and this oneStuccowork
"The reason why the sieve of Eratosthenes works in languages like C" - fwiw you can do strict array programming in Haskell as well, so I suppose in this regard Haskell is "a language like C".Grayish
A = "(1) and (2) apply to languages like C", B = "(1) and (2) apply to languages like Haskell", P = "Sieve of Eratosthenes works". A → P; B → P. Does that mean that A = B?Yarkand
you seem to suggest "an increasingly large thunk" is growing there of nested mark applications, but each mark works alone, separately, actually producing a stream of numbers, each getting its supply from its predecessor, precisely because each mark application is getting forced by pattern matching - creating an actual list in memory. In Haskell, storage is persistent. So there just isn't such one growing thunk there as you describe.Tweedy
Well, by "large thunk" I mean a thunk that depends on a lot of other thunks. As in "foldl (+) 0 xs will create a large thunk." It's the same sort of thunk dependency, although the in-memory implications are different as you say.Yarkand
Exactly my point. That kind of thunk built up by foldl is nowhere to be found in the OP code. (k+1) is forced immediately by the next invocation comparing k==m. I don't see any other thunks there. Each marking thunk is shallow.Tweedy
Exactly my point. The in-memory representation is different, but it's the same kind of thunk dependency as in the foldl situation. I.e., the mark stack doesn't run to gradually resolve a single value; instead, the first mark needs to produce an element which forces the next mark to traverse its input to produce an element etc. So, forcing an element out of the total stack implies the work of all the marks in the stack, instead of being able to e.g. have an escape condition three layers deep for overlapping pseudo-primes (so mark x 3 escapes before mark x 7 triggers when checking 21).Yarkand
no, it's not. the inner thunks in the foldl situation remain unforced. That situation is commonly referred to as "unevaluated thunk buildup", just like you have done here. but here, there are no unevaluated thunks. each is forced by its consumer, the receiver of its output. here they are not nested, but chained through the data streams - output of one is input to another. each forced, shallow. No nested thunks buildup whatsoever. here the implementational detail of using no-arg lambdas aka thunks is irrelevant. they could be realized as generators with mutable state just as well.Tweedy
and, no traversing is performed. each output element is immediately consumed. Each mark producer works in separation. There is no escaping in postponed filters situation either, so it is not the main issue. The main issue is, there are too many of them. (well, with filters there is sieving involved, some numbers are rejected early - here there are no sieving - I mention that in my answer as well - but that is secondary concern).Tweedy
(ok, by "escaping" you probably meant what I mean by "sieving" ...) perhaps I misread the question too broadly, as in "why is it so much slow?". In the narrow sense it indeed asks why the OP code is slower than Turners. That is indeed because of non-sieving (but also because each mark producer has to constantly update its internal state, - increment its counter - unlike nomult in Turner's sieve). There are still no "unevaluated thunks" IMO - precisely because each tail is forced by pattern matching. You mention it in the answer itself.Tweedy
I think you are over-analyzing both the problem and how I chose to describe it. I look at the vacuum analysis of the primes value, and I personally think my description matches what I see. If there's some detail that makes my description have a completely different meaning to you, I'm sorry, but it also doesn't really matter all that much, either; this is after all a SO question, so there are always other answers offering different perspectives; it's not the end of the world if one doesn't cover 100% of the answer :)Yarkand
"Does that mean that A = B?" no, it means that (1) and (2) in this answer miss the point: for a sieve to not be abysmally slow (i.e. to not be quadratic) it must stop sieving as soon as the head element of the input list, squared, exceeds the upper limit (here, 2000000). So here, sieve (x:xs) | x*x > 2000000 = (x:xs) must be added as the 2nd clause in sieve's definition. This stopping early can be done in Haskell as in C, and it can be done on lists just as it can on arrays, under lazy evaluation as well as under the strict one.Tweedy
P
8

It's not only the thunks that make it horrendously slow, that algorithm would also be very slow if implemented in C on a finite bit array.

sieve :: [Integer] -> [Integer]
sieve (0 : xs) = sieve xs
sieve (n : xs) = n : sieve (mark xs 1 n)
  where
    mark :: [Integer] -> Integer -> Integer -> [Integer]
    mark (y:ys) k m | k == m = 0 : (mark ys 1 m)
                    | otherwise = y : (mark ys (k+1) m)

For each prime p, this algorithm checks all numbers from p+1 to the limit whether they are multiples of p. It doesn't do so by dividing, as Turner's sieve does, but by comparing a counter to the prime. Now, comparing two numbers is much faster than dividing, but the price paid for that is that each number n is now inspected for each prime < n, and not only for the primes up to n's smallest prime factor.

The result is that the complexity of this algorithm is O(N^2 / log N) versus O( (N/log N)^2 ) for the Turner sieve (and O(N*log (log N)) for a real sieve of Eratosthenes).

The nesting¹ stacking of thunks mentioned by dflemstr exacerbates the problem², but even without that the algorithm would be worse than Turner's. I'm appalled and fascinated at the same time.


¹ "Nesting" may not be the correct word. Although each of the mark thunks is reachable only via the one above it, they don't reference anything from the enclosing thunk's scope.

² There is nothing quadratic in either size or depth of the thunks, though, and the thunks are fairly well-behaved. For the illustration, let's pretend mark were defined with reverse argument order. Then, when 7 is found to be prime, the situation is

sieve (mark 5 2 (mark 3 1 (mark 2 1 [7 .. ])))
~> sieve (mark 5 2 (mark 3 1 (7 : mark 2 2 [8 .. ])))
~> sieve (mark 5 2 (7 : mark 3 2 (mark 2 2 [8 .. ])))
~> sieve (7 : mark 5 3 (mark 3 2 (mark 2 2 [8 .. ])))
~> 7 : sieve (mark 7 1 (mark 5 3 (mark 3 2 (mark 2 2 [8 .. ]))))

and the next pattern-match by sieve forces the mark 7 1 thunk, which forces the mark 5 3 thunk, which forces the mark 3 2 thunk, which forces the mark 2 2 thunk, which forces the [8 .. ] thunk and replaces the head with 0, and wraps the tail in a mark 2 1 thunk. That bubbles up to the sieve, which discards the 0 and then forces the next stack of thunks.

So for every number from p_k + 1 to p_(k+1) (inclusive), the pattern-match in sieve forces a stack/chain of k thunks of the form mark p r. Each of these takes the (y:ys) received from the enclosed thunk ([y ..] for the innermost mark 2 r) and wraps the tail ys in a new thunk, either leaving y unchanged or replacing it with 0, thus building up a new stack/chain of thunks in what will be the tail of the list reaching sieve.

For each found prime, sieve adds another mark p r thunk on top, so at the end, when the first prime larger than 2000000 is found and takeWhile (< 2000000) finishes, there will be 148933 levels of thunks.

The stacking of thunks here doesn't influence the complexity, it just influences the constant factors. In the situation we're dealing with, a lazily generated infinite immutable list, there's not much one can do to reduce the time spent transferring control from one thunk to the next. If we were dealing with a finite mutable list or array that is not lazily generated, as one would in a language like C or Java, it would be much better to let each mark p finish its complete job (that would be a simple for loop with less overhead than function calling/transfer of control) before examining the next number, so there'd never be more than one marker active and less control-passing.

Plutocrat answered 30/7, 2012 at 12:9 Comment(11)
appalled? be really appalled: [n | n<-[2..], not $ elem n [j*k | j<-[1..n-1], k<-[1..n-1]]]. Which was actually proposed here on SO, though in Python.Tweedy
Will, I'm not appalled by the bad algorithm as such. The appalling thing is that the code is from a book - a supposedly good one - and apparently comes without a warning.Plutocrat
appalled and fascinated - I should've quoted you more fully. :) That algo I quoted also fascinates me, as it is also a true transcription of the definition, just like Turner sieve is. After all, the primes are indeed all integers above 1 which are non-composite. Haskell isn't "Maths" after all.Tweedy
Yes, that algorithm is amazing. It has the slight flaw that the lists should start at 2 to truly reflect the definition of a prime, but apart from that glitch it's like Turner's sieve a neat and concise declaration hiding a terrible complexity - only more so.Plutocrat
even with 1s it produces primes for some reason. :) but it is really something special, isn't it. A thing to behold.Tweedy
Since the upper bound is n-1, no product 1*k can be n, so it works. But for n > 2, there are primes in the list of j*k, although conceptually it should contain only composites. I agree about the specialty, it's the first cubic prime generator I have seen, awesome.Plutocrat
I don't know if letting each mark work to the end at its turn would change much. It'd still be O(n) total ops as it is now because mark counts by ones, instead of O(n/p). Whether working "by rows" or "by columns", it's the same table that gets scanned through, cell by cell.Tweedy
@Will If you have strict mutable data, letting each mark do its job in one go eliminates the creation of thunk { when demanded, do that } and the control passing, each would just be a loop with little overhead. The complexity would still be the same, but for(i = p+1, m = 1; i < N; ++i) { if (p == m) { .. }} is much less work than calling functions and creating thunks.Plutocrat
it depends how well thunk handling is optimized by the compiler. if it blindly creates new full-blown closures each time just to be discarded later, then what you saying makes a lot of sense.Tweedy
I don't see how it could do anything less than write a structure containing the two numbers (prime and remainder) and the index next to process (the pointer to the next element in case of a linked list) plus an indication of which thunk to evaluate before if any. These structures could be reused, updating the appropriate members, but it's still more work than just looping. Not dramatically, I guess, but significant.Plutocrat
true. though such reified closures would indeed turn into kind of generators, so would be "pulled" rather than "evaluated". the chain of such structures could be then reified as an array, and processed in a loop. that loop could then get unrolled, structures compiled out into a bunch of global variables, etc.; optimizers would kick in, collapsing chains of assignments into mutating just one var, ... who knows where it would end. :)Tweedy
A
5

Ok, you are definitely right, it is slower than the naive implementation. I took this one from Wikipedia and compared it to your code with GHCI thus:

-- from Wikipedia
sieveW [] = [] 
sieveW (x:xs) = x : sieveW remaining 
  where 
    remaining = [y | y <- xs, y `mod` x /= 0]

-- your code
sieve :: [Integer] -> [Integer]
sieve (0 : xs) = sieve xs
sieve (n : xs) = n : sieve (mark xs 1 n)
  where
    mark :: [Integer] -> Integer -> Integer -> [Integer]
    mark (y:ys) k m | k == m = 0 : (mark ys 1 m)
                    | otherwise = y : (mark ys (k+1) m)

Running gives

[1 of 1] Compiling Main             ( prime.hs, interpreted )
Ok, modules loaded: Main.
*Main> :set +s
*Main> sum $ take 2000 (sieveW [2..])
16274627
(1.54 secs, 351594604 bytes)
*Main> sum $ take 2000 (sieve [2..])
16274627
(12.33 secs, 2903337856 bytes)

To try and understand what was happening and how exactly the mark code works, I tried expanding the code manually:

  sieve [2..]
= sieve 2 : [3..]
= 2 : sieve (mark [3..] 1 2)
= 2 : sieve (3 : (mark [4..] 2 2))
= 2 : 3 : sieve (mark (mark [4..] 2 2) 1 3)
= 2 : 3 : sieve (mark (0 : (mark [5..] 1 2)) 1 3)
= 2 : 3 : sieve (0 : (mark (mark [5..] 1 2) 1 3))
= 2 : 3 : sieve (mark (mark [5..] 1 2) 1 3)
= 2 : 3 : sieve (mark (5 : (mark [6..] 2 2)) 1 3)
= 2 : 3 : sieve (5 : mark (mark [6..] 2 2) 2 3)
= 2 : 3 : 5 : sieve (mark (mark (mark [6..] 2 2) 2 3) 1 5)
= 2 : 3 : 5 : sieve (mark (mark (0 : (mark [7..] 1 2)) 2 3) 1 5)
= 2 : 3 : 5 : sieve (mark (0 : (mark (mark [7..] 1 2) 3 3)) 1 5)
= 2 : 3 : 5 : sieve (0 : (mark (mark (mark [7..] 1 2) 3 3)) 2 5))
= 2 : 3 : 5 : sieve (mark (mark (mark [7..] 1 2) 3 3)) 2 5)
= 2 : 3 : 5 : sieve (mark (mark (7 : (mark [8..] 2 2)) 3 3)) 2 5)

I think I may have made a slight mistake toward the end there, as the 7 looks like it is about to be turned into a 0 and deleted, but the mechanism is clear. This code is merely creating a set of counters counting up to each prime, emitting the next prime at the correct moment and passing it up the list. This is equivalent to merely checking for division by each previous prime as in the naive implementation, with the additional overhead of passing the 0s or primes between the thunks.

There may be some further subtlety here that I am missing. There is a very detailed treatment of The Sieve of Eratosthenes in Haskell along with various optimisations here.

Astronautics answered 30/7, 2012 at 10:15 Comment(2)
I don't see that any of the mod calculations in sieveW are being memoised, partly because each y `mod` x is only asked for once (elements of xs are unique) and partly because even if xs contained duplicates, the previously calculated y `mod` x would not be accessible.Kirkland
Yes, you are right — should have thought that fully through. Will edit answer.Astronautics
T
1

short answer: the counting sieve is slower than Turner's (a.k.a. "naive") sieve because it emulates direct RAM access with sequential counting, which forces it to pass along the streams unsieved between the marking stages. This is ironic because counting makes it a "genuine" sieve of Eratosthenes, as opposed to the Turner's trial division sieve. Actually removing the multiples, like the Turner's sieve is doing, would mess up the counting.

Both algorithms are extremely slow because they start up multiples elimination work too early from each found prime instead of its square, thus creating too many unneeded stream processing stages (whether filtering or marking) - O(n) of them, instead of just ~ 2*sqrt n/log n, in producing primes up to n in value. No checking for the multiples of 7 is required until 49 is seen in the input.

This answer explains how sieve can be seen as building a pipeline of stream-processing "transducers" behind itself, as it is working:

[2..] ==> sieve --> 2
[3..] ==> mark 1 2 ==> sieve --> 3
[4..] ==> mark 2 2 ==> mark 1 3 ==> sieve 
[5..] ==> mark 1 2 ==> mark 2 3 ==> sieve --> 5
[6..] ==> mark 2 2 ==> mark 3 3 ==> mark 1 5 ==> sieve 
[7..] ==> mark 1 2 ==> mark 1 3 ==> mark 2 5 ==> sieve --> 7
[8..] ==> mark 2 2 ==> mark 2 3 ==> mark 3 5 ==> mark 1 7 ==> sieve
[9..] ==> mark 1 2 ==> mark 3 3 ==> mark 4 5 ==> mark 2 7 ==> sieve
[10..]==> mark 2 2 ==> mark 1 3 ==> mark 5 5 ==> mark 3 7 ==> sieve
[11..]==> mark 1 2 ==> mark 2 3 ==> mark 1 5 ==> mark 4 7 ==> sieve --> 11

The Turner sieve uses nomult p = filter ((/=0).(`rem`p)) in place of mark _ p entries but otherwise looks the same:

[2..] ==> sieveT --> 2
[3..] ==> nomult 2 ==> sieveT --> 3
[4..] ==> nomult 2 ==> nomult 3 ==> sieveT 
[5..] ==> nomult 2 ==> nomult 3 ==> sieveT --> 5
[6..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> sieveT 
[7..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> sieveT --> 7
[8..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> nomult 7 ==> sieveT 

Each such transducer can be implemented as a closure frame (a.k.a. "thunk"), or a generator with mutable state, that is unimportant. The output of each such producer goes directly as input into its successor in a chain. There are no unevaluated thunks here, each is forced by its consumer, to produce its next output.

So, to answer your question,

I suspect it has something to do with iterating each element in the list in the mark function.

yes, exactly. They both run non-postponed schemes otherwise.


The code can be thus improved by postponing the start of stream-marking:

primes = 2:3:filter (>0) (sieve [5,7..] (tail primes) 9)

sieve (x:xs) ps@ ~(p:t) q 
   | x < q = x:sieve xs ps q
   | x==q  = sieve (mark xs 1 p) t (head t^2)
  where
    mark (y:ys) k p 
       | k == p    = 0 : (mark ys 1 p)      -- mark each p-th number in supply
       | otherwise = y : (mark ys (k+1) p)

It now runs at just above O(k^1.5), empirically, in k primes produced. But why count by ones when we can count by an increment. (Each 3rd odd number from 9 can be found by adding 6, again and again.) And then instead of marking, we can weed out the numbers right away, getting ourselves a bona fide sieve of Eratosthenes (even if not the most efficient one):

primes = 2:3:sieve [5,7..] (tail primes) 9

sieve (x:xs) ps@ ~(p:t) q 
   | x < q = x:sieve xs ps q
   | x==q  = sieve (weedOut xs (q+2*p) (2*p)) t (head t^2)
  where
    weedOut i@(y:ys) m s 
       | y < m = y:weedOut ys m s
       | y==m  = weedOut ys (m+s) s
       | y > m = weedOut i (m+s) s

This runs at above O(k^1.2) in k primes produced, quick-n-dirty testing compiled-loaded into GHCi, producing upto 100k - 150k primes, deteriorating to O(k^1.3) at about 0.5 mil primes.


So what kind of speedups are achieved by this? Comparing the OP code with the "Wikipedia"'s Turner sieve,

primes = sieve [2..] :: [Int]
  where
    sieve (x:xs) = x : sieve [y | y <- xs, rem y x /= 0]

there was an 8x speedup of W/OP at 2k (i.e. producing 2000 primes). But at 4k it was a 15x speedup. The Turner sieve seems to run at about O(k^1.9 .. 2.3) empirical complexity in producing k = 1000 .. 6000 primes, and the counting sieve at O(k^2.3 .. 2.6) for the same range.

For the two versions here in this answer, v1/W was an additional 20x faster at 4k, and 43x at 8k. v2/v1 was 5.2x at 20k, 5.8x at 40k and 6.5x faster at producing 80,000 primes.

(for comparison, the priority-queue code by Melissa O'Neill runs at about O(k^1.2) empirical complexity, in k primes produced. It scales much better than the code here, of course).


Here is the sieve of Eratosthenes's definition:

P = {3,5, ...} \ { {p*p, p*p + 2*p, ...} | p in P }

The key to Sieve of Eratosthenes's efficiency is direct generation of multiples of primes, by counting in increments of (twice) prime's value from each prime; and their direct elimination, made possible by conflation of value and address much like in integer sorting algorithms (possible only with mutable arrays). It is immaterial whether it must produce pre-set number of primes or work indefinitely because it can always work by segments.

Tweedy answered 31/7, 2012 at 3:13 Comment(9)
"The Turner sieve uses nomult p = filter ((/=0).(`rem`p)) ...". Having troubles with the markup in the answer body.Tweedy
Fixed that for you, double-backtick code-with-backticks double-backtick also works in question bodies.Plutocrat
@DanielFischer thanks a lot, I tried double backquotes around "rem" on the inside, not around the whole thing... If we're here already, what about that mythical unevaluated thunk buildup (re "increasingly large thunk of ..." in dflemstr answer - see also in comments). I say it does not exist here. The chain of dependent producers, each forced by its consumer, is not a nested thunk like foldl would make. Shouldn't we be able to use language in unambiguous manner? You too refer to this in your answer. I think it is erroneous.Tweedy
Well, "stacked thunks" may be the better formulation. But you do have after the k-th prime thunks (switching arg order) mark p_k x (mark ... (mark 3 x (mark 2 x list))...), nested or stacked k layers deep. So every pattern match in sieve tells the outermost thunk "Hey, gimme a constructor", that passes the message down, then the (_:_) is passed up again, a couple of thunks replacing the head with 0 on the way (usually). You have a growing nest/stack of thunks (not quadratic in either size or depth, though - proportional to the number of primes found) that slows things down.Plutocrat
chained, not nested. Each existing in its own right in memory. It is forced through the storage, not from inside the function above it. It doesn't know about the function/thunk/producer/generator/transducer above it, all it knows is how to produce its next output. Same way, all it has is its input "(y:ys)" - storage - it doesn't know where it comes from, from another "mark" down the chain or whatever. There is no mark (mark (mark ... ))) at run-time. Each list-producing function mark is best seen as generator, IMO.Tweedy
As so often, I don't understand what you're trying to say. A mark list thunk, when evaluated, produces a result of the form (:) x' (mark xs) if list is of the form (x:xs), where x' is x or 0. It is only evaluated when the enclosing thunk needs to know whether its argument is y:ys or not. That happens only when that thunk is evaluated, its evaluation then produces (:) x'' (mark (mark xs)) [or, if you prefer (:) x'' thunk where thunk = mark thunk' where thunk' = mark xs ]. Each of these thunks is only reachable from the thunk above it, looks nested to me.Plutocrat
Unless there's a technical definition of nested thunks that means something else, if so, I'd appreciate a link. And yes, there is a mark (mark (mark ...)) at runtime. Even if you think of them as generators, each takes the output of the one below as input, so has a reference to that.Plutocrat
"nested" implies nested scope. f x = g where g = z x where z = p q where p = .... defines nested thunks. each referring to the enclosing environment. so I thought.Tweedy
With that definition, of course these are indeed not nested. Okay, will change wording to avoid misunderstandings.Plutocrat

© 2022 - 2024 — McMap. All rights reserved.