short answer: the counting sieve is slower than Turner's (a.k.a. "naive") sieve because it emulates direct RAM access with sequential counting, which forces it to pass along the streams unsieved between the marking stages. This is ironic because counting makes it a "genuine" sieve of Eratosthenes, as opposed to the Turner's trial division sieve. Actually removing the multiples, like the Turner's sieve is doing, would mess up the counting.
Both algorithms are extremely slow because they start up multiples elimination work too early from each found prime instead of its square, thus creating too many unneeded stream processing stages (whether filtering or marking) - O(n)
of them, instead of just ~ 2*sqrt n/log n
, in producing primes up to n
in value. No checking for the multiples of 7
is required until 49
is seen in the input.
This answer explains how sieve
can be seen as building a pipeline of stream-processing "transducers" behind itself, as it is working:
[2..] ==> sieve --> 2
[3..] ==> mark 1 2 ==> sieve --> 3
[4..] ==> mark 2 2 ==> mark 1 3 ==> sieve
[5..] ==> mark 1 2 ==> mark 2 3 ==> sieve --> 5
[6..] ==> mark 2 2 ==> mark 3 3 ==> mark 1 5 ==> sieve
[7..] ==> mark 1 2 ==> mark 1 3 ==> mark 2 5 ==> sieve --> 7
[8..] ==> mark 2 2 ==> mark 2 3 ==> mark 3 5 ==> mark 1 7 ==> sieve
[9..] ==> mark 1 2 ==> mark 3 3 ==> mark 4 5 ==> mark 2 7 ==> sieve
[10..]==> mark 2 2 ==> mark 1 3 ==> mark 5 5 ==> mark 3 7 ==> sieve
[11..]==> mark 1 2 ==> mark 2 3 ==> mark 1 5 ==> mark 4 7 ==> sieve --> 11
The Turner sieve uses nomult p = filter ((/=0).(`rem`p))
in place of mark _ p
entries but otherwise looks the same:
[2..] ==> sieveT --> 2
[3..] ==> nomult 2 ==> sieveT --> 3
[4..] ==> nomult 2 ==> nomult 3 ==> sieveT
[5..] ==> nomult 2 ==> nomult 3 ==> sieveT --> 5
[6..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> sieveT
[7..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> sieveT --> 7
[8..] ==> nomult 2 ==> nomult 3 ==> nomult 5 ==> nomult 7 ==> sieveT
Each such transducer can be implemented as a closure frame (a.k.a. "thunk"), or a generator with mutable state, that is unimportant. The output of each such producer goes directly as input into its successor in a chain. There are no unevaluated thunks here, each is forced by its consumer, to produce its next output.
So, to answer your question,
I suspect it has something to do with iterating each element in the list in the mark function.
yes, exactly. They both run non-postponed schemes otherwise.
The code can be thus improved by postponing the start of stream-marking:
primes = 2:3:filter (>0) (sieve [5,7..] (tail primes) 9)
sieve (x:xs) ps@ ~(p:t) q
| x < q = x:sieve xs ps q
| x==q = sieve (mark xs 1 p) t (head t^2)
where
mark (y:ys) k p
| k == p = 0 : (mark ys 1 p) -- mark each p-th number in supply
| otherwise = y : (mark ys (k+1) p)
It now runs at just above O(k^1.5)
, empirically, in k
primes produced. But why count by ones when we can count by an increment. (Each 3rd odd number from 9
can be found by adding 6
, again and again.) And then instead of marking, we can weed out the numbers right away, getting ourselves a bona fide sieve of Eratosthenes (even if not the most efficient one):
primes = 2:3:sieve [5,7..] (tail primes) 9
sieve (x:xs) ps@ ~(p:t) q
| x < q = x:sieve xs ps q
| x==q = sieve (weedOut xs (q+2*p) (2*p)) t (head t^2)
where
weedOut i@(y:ys) m s
| y < m = y:weedOut ys m s
| y==m = weedOut ys (m+s) s
| y > m = weedOut i (m+s) s
This runs at above O(k^1.2)
in k
primes produced, quick-n-dirty testing compiled-loaded into GHCi, producing upto 100k - 150k primes, deteriorating to O(k^1.3)
at about 0.5 mil primes.
So what kind of speedups are achieved by this? Comparing the OP code with the "Wikipedia"'s Turner sieve,
primes = sieve [2..] :: [Int]
where
sieve (x:xs) = x : sieve [y | y <- xs, rem y x /= 0]
there was an 8x
speedup of W/OP at 2k (i.e. producing 2000 primes). But at 4k it was a 15x
speedup. The Turner sieve seems to run at about O(k^1.9 .. 2.3)
empirical complexity in producing k = 1000 .. 6000
primes, and the counting sieve at O(k^2.3 .. 2.6)
for the same range.
For the two versions here in this answer, v1/W was an additional 20x
faster at 4k, and 43x
at 8k. v2/v1 was 5.2x
at 20k, 5.8x
at 40k and 6.5x
faster at producing 80,000 primes.
(for comparison, the priority-queue code by Melissa O'Neill runs at about O(k^1.2)
empirical complexity, in k
primes produced. It scales much better than the code here, of course).
Here is the sieve of Eratosthenes's definition:
P = {3,5, ...} \ ⋃ { {p*p, p*p + 2*p, ...} | p in P }
The key to Sieve of Eratosthenes's efficiency is direct generation of multiples of primes, by counting in increments of (twice) prime's value from each prime; and their direct elimination, made possible by conflation of value and address much like in integer sorting algorithms (possible only with mutable arrays). It is immaterial whether it must produce pre-set number of primes or work indefinitely because it can always work by segments.