The sieve of Eratosthenes in F#
Asked Answered
E

17

37

I am interested in an implementation of the sieve of eratosthenes in purely functional F#. I am interested in an implementation of the actual sieve, not the naive functional implementation that isn't really the sieve, so not something like this:

let rec PseudoSieve list =
    match list with
    | hd::tl -> hd :: (PseudoSieve <| List.filter (fun x -> x % hd <> 0) tl)
    | [] -> []

The second link above briefly describes an algorithm that would require the use of a multimap, which isn't available in F# as far as I know. The Haskell implementation given uses a map that supports an insertWith method, which I haven't seen available in the F# functional map.

Does anyone know a way to translate the given Haskell map code to F#, or perhaps knows of alternative implementation methods or sieving algorithms that are as efficient and better suited for a functional implementation or F#?

Eight answered 7/1, 2011 at 20:1 Comment(12)
It is quite possible to use arrays in F# in a pure way. Do it that (traditional) way and it'll be as fast as though you'd written it in C#.Evan
@Evan - the traditional way requires modifying the array, which wouldn't be pure anymore, would it?Eight
Ah, but you can make it look pure! Say you want to update array a to produce array b and ensure that this is done in a pure fashion, what you do is this (in pseudocode): "a[i] := x; b = a; // Never use a again!" You can give this a pure semantics, even though you have an impure implementation. In Mercury, for example, the array update function does this and the Mercury mode system guarantees that your program will never be allowed to use a again.Evan
@Evan - that is not pure in any sense of the word, it's just a silly trick that actually means exactly nothing. You are still using the memory location of a because that's what b will point to. You cannot stop using a completely unless you copy it element-by-element in a newly allocated array, which would be terribly inefficient.Eight
Sorry, but you're wrong: this is exactly how state is managed in a pure fashion in Mercury and Haskell (Mercury uses uniqueness and Haskell uses monads, but what's happening under the covers is exactly the same). Indeed, it's how IO is managed in a pure fashion, too. There is nothing wrong with having an impure implementation with a pure interface provided your promise of purity is warranted.Evan
@Evan - but how is that promise warranted when you are still changing the same memory location? Changing the elements of b, in F#, will change those of a too in your example.Eight
@Eight - the promise is warranted because referential transparency isn't violated. That is, there's no way anyone calling your sieve function implemented in this way can decide whether the underlying implementation is impure or not. Sure, my proposed implementation does depend on sordid side effects, but those side effects are invisible to the caller. Seriously, take a look at the implementation of arrays in Mercury or Haskell!Evan
To explain a bit more, the "never use 'a' again" constraint is exactly what Haskell's State and IO monads guarantee or what Mercury's unique modes guarantee. In Haskell, you never actually get your hands on the array itself because its hidden inside the monad and the monad ain't never going to give it to you! In Mercury, any updates to the array produce a new "unique" array and render the old array "dead" (never to be accessed again). Of course, the new array is exactly the old array after a destructive update.Evan
@Evan - then what's the point of doing b = a in F# if b will still point to the exact same memory address? You might as well not do that, no? Either way, I was looking for a function without side effects.Eight
It is OK to copy-update an array, creating a new one. Both are separate pure immutable entities. Now if the 1st one goes out of scope just as 2nd is created, it is OK for an implementation to implement such update in a destructive fashion, as in here: (in Haskell) sieve p a = sieve (p+2) (a//[(i,False) | i<-[p*p, p*p+2*p..m]]). Here // is pure array-update operation. Unfortunately Haskell forces us to write it out exlicitly, using special monad. Similarly it is OK to use impure F# in isolated fashion 4 such update.Disentail
Although this question is old it's still valid. In fact, @Evan is correct that there aren't really any pure functional computer languages as behind the scenes there are always needs for mutable fields or variables else performance suffers. For instance, LazyList's have a mutable LazyList class field as a discriminated union that changes state depending on whether the cell has been evaluated or not; not doing this would mean that the whole list would need to be copied for every evaluaton. Sequences also need at least one mutable state to record the result of the last MoveNext() as Current.Varicella
@Varicella thunk memoization affects performance, not correctness. As long as mutation can not be observed from inside the language (i.e. w.r.t. correctness), it is as if it doesn't exist.Disentail
T
18

Reading that article I came up with an idea that doesn't require a multimap. It handles colliding map keys by moving the colliding key forward by its prime value again and again until it reaches a key that isn't in the map. Below primes is a map with keys of the next iterator value and values that are primes.

let primes = 
    let rec nextPrime n p primes =
        if primes |> Map.containsKey n then
            nextPrime (n + p) p primes
        else
            primes.Add(n, p)

    let rec prime n primes =
        seq {
            if primes |> Map.containsKey n then
                let p = primes.Item n
                yield! prime (n + 1) (nextPrime (n + p) p (primes.Remove n))
            else
                yield n
                yield! prime (n + 1) (primes.Add(n * n, n))
        }

    prime 2 Map.empty

Here's the priority queue based algorithm from that paper without the square optimization. I placed the generic priority queue functions at the top. I used a tuple to represent the lazy list iterators.

let primes() = 
    // the priority queue functions
    let insert = Heap.Insert
    let findMin = Heap.Min
    let insertDeleteMin = Heap.DeleteInsert

    // skips primes 2, 3, 5, 7
    let wheelData = [|2L;4L;2L;4L;6L;2L;6L;4L;2L;4L;6L;6L;2L;6L;4L;2L;6L;4L;6L;8L;4L;2L;4L;2L;4L;8L;6L;4L;6L;2L;4L;6L;2L;6L;6L;4L;2L;4L;6L;2L;6L;4L;2L;4L;2L;10L;2L;10L|]

    // increments iterator
    let wheel (composite, n, prime) =
        composite + wheelData.[n % 48] * prime, n + 1, prime

    let insertPrime prime n table =
        insert (prime * prime, n, prime) table

    let rec adjust x (table : Heap) =
        let composite, n, prime = findMin table

        if composite <= x then 
            table 
            |> insertDeleteMin (wheel (composite, n, prime))
            |> adjust x
        else
            table

    let rec sieve iterator table =
        seq {
            let x, n, _ = iterator
            let composite, _, _ = findMin table

            if composite <= x then
                yield! sieve (wheel iterator) (adjust x table)
            else
                if x = 13L then
                    yield! [2L; 3L; 5L; 7L; 11L]

                yield x
                yield! sieve (wheel iterator) (insertPrime x n table)
        }

    sieve (13L, 1, 1L) (insertPrime 11L 0 (Heap(0L, 0, 0L)))

Here's the priority queue based algorithm with the square optimization. In order to facilitate lazy adding primes to the lookup table, the wheel offsets had to be returned along with prime values. This version of the algorithm has O(sqrt(n)) memory usage where the none optimized one is O(n).

let rec primes2() : seq<int64 * int> = 
    // the priority queue functions
    let insert = Heap.Insert
    let findMin = Heap.Min
    let insertDeleteMin = Heap.DeleteInsert

    // increments iterator
    let wheel (composite, n, prime) =
        composite + wheelData.[n % 48] * prime, n + 1, prime

    let insertPrime enumerator composite table =
        // lazy initialize the enumerator
        let enumerator =
            if enumerator = null then
                let enumerator = primes2().GetEnumerator()
                enumerator.MoveNext() |> ignore
                // skip primes that are a part of the wheel
                while fst enumerator.Current < 11L do
                    enumerator.MoveNext() |> ignore
                enumerator
            else
                enumerator

        let prime = fst enumerator.Current
        // Wait to insert primes until their square is less than the tables current min
        if prime * prime < composite then
            enumerator.MoveNext() |> ignore
            let prime, n = enumerator.Current
            enumerator, insert (prime * prime, n, prime) table
        else
            enumerator, table

    let rec adjust x table =
        let composite, n, prime = findMin table

        if composite <= x then 
            table 
            |> insertDeleteMin (wheel (composite, n, prime))
            |> adjust x
        else
            table

    let rec sieve iterator (enumerator, table) = 
        seq {
            let x, n, _ = iterator
            let composite, _, _ = findMin table

            if composite <= x then
                yield! sieve (wheel iterator) (enumerator, adjust x table)
            else
                if x = 13L then
                    yield! [2L, 0; 3L, 0; 5L, 0; 7L, 0; 11L, 0]

                yield x, n
                yield! sieve (wheel iterator) (insertPrime enumerator composite table)
        }

    sieve (13L, 1, 1L) (null, insert (11L * 11L, 0, 11L) (Heap(0L, 0, 0L)))

Here's my test program.

type GenericHeap<'T when 'T : comparison>(defaultValue : 'T) =
    let mutable capacity = 1
    let mutable values = Array.create capacity defaultValue
    let mutable size = 0

    let swap i n =
        let temp = values.[i]
        values.[i] <- values.[n]
        values.[n] <- temp

    let rec rollUp i =
        if i > 0 then
            let parent = (i - 1) / 2
            if values.[i] < values.[parent] then
                swap i parent
                rollUp parent

    let rec rollDown i =
        let left, right = 2 * i + 1, 2 * i + 2

        if right < size then
            if values.[left] < values.[i] then
                if values.[left] < values.[right] then
                    swap left i
                    rollDown left
                else
                    swap right i
                    rollDown right
            elif values.[right] < values.[i] then
                swap right i
                rollDown right
        elif left < size then
            if values.[left] < values.[i] then
                swap left i

    member this.insert (value : 'T) =
        if size = capacity then
            capacity <- capacity * 2
            let newValues = Array.zeroCreate capacity
            for i in 0 .. size - 1 do
                newValues.[i] <- values.[i]
            values <- newValues

        values.[size] <- value
        size <- size + 1
        rollUp (size - 1)

    member this.delete () =
        values.[0] <- values.[size]
        size <- size - 1
        rollDown 0

    member this.deleteInsert (value : 'T) =
        values.[0] <- value
        rollDown 0

    member this.min () =
        values.[0]

    static member Insert (value : 'T) (heap : GenericHeap<'T>) =
        heap.insert value
        heap    

    static member DeleteInsert (value : 'T) (heap : GenericHeap<'T>) =
        heap.deleteInsert value
        heap    

    static member Min (heap : GenericHeap<'T>) =
        heap.min()

type Heap = GenericHeap<int64 * int * int64>

let wheelData = [|2L;4L;2L;4L;6L;2L;6L;4L;2L;4L;6L;6L;2L;6L;4L;2L;6L;4L;6L;8L;4L;2L;4L;2L;4L;8L;6L;4L;6L;2L;4L;6L;2L;6L;6L;4L;2L;4L;6L;2L;6L;4L;2L;4L;2L;10L;2L;10L|]

let primes() = 
    // the priority queue functions
    let insert = Heap.Insert
    let findMin = Heap.Min
    let insertDeleteMin = Heap.DeleteInsert

    // increments iterator
    let wheel (composite, n, prime) =
        composite + wheelData.[n % 48] * prime, n + 1, prime

    let insertPrime prime n table =
        insert (prime * prime, n, prime) table

    let rec adjust x (table : Heap) =
        let composite, n, prime = findMin table

        if composite <= x then 
            table 
            |> insertDeleteMin (wheel (composite, n, prime))
            |> adjust x
        else
            table

    let rec sieve iterator table =
        seq {
            let x, n, _ = iterator
            let composite, _, _ = findMin table

            if composite <= x then
                yield! sieve (wheel iterator) (adjust x table)
            else
                if x = 13L then
                    yield! [2L; 3L; 5L; 7L; 11L]

                yield x
                yield! sieve (wheel iterator) (insertPrime x n table)
        }

    sieve (13L, 1, 1L) (insertPrime 11L 0 (Heap(0L, 0, 0L)))

let rec primes2() : seq<int64 * int> = 
    // the priority queue functions
    let insert = Heap.Insert
    let findMin = Heap.Min
    let insertDeleteMin = Heap.DeleteInsert

    // increments iterator
    let wheel (composite, n, prime) =
        composite + wheelData.[n % 48] * prime, n + 1, prime

    let insertPrime enumerator composite table =
        // lazy initialize the enumerator
        let enumerator =
            if enumerator = null then
                let enumerator = primes2().GetEnumerator()
                enumerator.MoveNext() |> ignore
                // skip primes that are a part of the wheel
                while fst enumerator.Current < 11L do
                    enumerator.MoveNext() |> ignore
                enumerator
            else
                enumerator

        let prime = fst enumerator.Current
        // Wait to insert primes until their square is less than the tables current min
        if prime * prime < composite then
            enumerator.MoveNext() |> ignore
            let prime, n = enumerator.Current
            enumerator, insert (prime * prime, n, prime) table
        else
            enumerator, table

    let rec adjust x table =
        let composite, n, prime = findMin table

        if composite <= x then 
            table 
            |> insertDeleteMin (wheel (composite, n, prime))
            |> adjust x
        else
            table

    let rec sieve iterator (enumerator, table) = 
        seq {
            let x, n, _ = iterator
            let composite, _, _ = findMin table

            if composite <= x then
                yield! sieve (wheel iterator) (enumerator, adjust x table)
            else
                if x = 13L then
                    yield! [2L, 0; 3L, 0; 5L, 0; 7L, 0; 11L, 0]

                yield x, n
                yield! sieve (wheel iterator) (insertPrime enumerator composite table)
        }

    sieve (13L, 1, 1L) (null, insert (11L * 11L, 0, 11L) (Heap(0L, 0, 0L)))


let mutable i = 0

let compare a b =
    i <- i + 1
    if a = b then
        true
    else
        printfn "%A %A %A" a b i
        false

Seq.forall2 compare (Seq.take 50000 (primes())) (Seq.take 50000 (primes2() |> Seq.map fst))
|> printfn "%A"

primes2()
|> Seq.map fst
|> Seq.take 10
|> Seq.toArray
|> printfn "%A"

primes2()
|> Seq.map fst
|> Seq.skip 999999
|> Seq.take 10
|> Seq.toArray
|> printfn "%A"

System.Console.ReadLine() |> ignore
Tormoria answered 8/1, 2011 at 1:12 Comment(22)
Very nice! Your program finds the 100000 th prime in ~5 seconds on my machine. Fast and elegant, +1.Eight
@IVIad I just made a small tweak to the starting prime number by setting it to n * n instead of n + n. Saved a second on my machine.Tormoria
@Tormoria - yes, it's down to 4 seconds now. This is exactly what I wanted. Thank you.Eight
@Tormoria - have anything fancy in mind for the wheel? :) I managed to get it working by hardcoding the given values in an array and adding a k param to the prime function, which gets incremented and taken mod 48 at each call. Then instead of doing n + 1 I do n + wheel2357.[k]. Not sure if it's the best / most elegant way but now it gets the 100000th prime in under a second.Eight
@IVIad are you still getting the correct answer? My version of that fails to return the correct values. I think it has something to do with finding the key instead of finding the min key.Tormoria
@Tormoria - you're right. If I use the wheel, 8809 for example is returned as a prime, when in fact it isn't. Any idea how to fix it?Eight
I wonder if using a priority queue would fix this. Will try to find time to implement that solution too and post back.Eight
@IVIad The main problem I'm having with using the wheel is incrementing the prime iterators the correct amount. Using a wheel of only 2 works correctly but I'm having trouble with wheel [2,4].Tormoria
@gradbot: something's wrong here. On my machine, it takes a minute and 40 seconds to calculate the millionth prime with your implementation of the incremental sieve, yet Juliet's non-sieve takes only 16 seconds. Shouldn't a correct incremental sieve overtake any "isPrime" filter at some point?Zee
@Stephen Swensen - because a purely functional approach will always be slower here. The complexity of @gradbot's implementation is O(n log n log log n) instead of the O(n log log n) performance of the imperative sieve. That translates to over 80 million operations. This is an understatement however, as a map has pretty high constants. The isPrime implementation is O(n sqrt n), but this is an overstatement, as a lot of composites will be eliminated in much less than sqrt(n) operations, and about 1/6 numbers aren't even tested at all. The sieve will overtake it, but very slowly.Eight
-1: The 100,001th prime is 1299721 not 1299817 which is the 100,006th prime.Prothalamium
@Jon, @Tormoria - The simplest fix is probably to replace primes.Add(n*n,n) with if n > 46340 then primes else primes.Add(n*n,n) to prevent overflow when using ints.Barnyard
kvb is right. It's simple integer overflow. You can get to 100001 by using longs i.e. 1L instead of 1. On a side note Seq.nth starts from 0 and not 1 so primes |> Seq.nth 100000 would give the 100001 prime. That's what was throwing me off.Tormoria
masking the overflow problem with longs hides the real problem - premature additions into the Map. A prime's data should only be added into the Map when that prime's square is seen as a candidate. This will shrink the map to a square root in size, save lots of memory and speed up the code. And eliminate the spurious overflows.Disentail
@WillNess How would I delay adding into the map? Wouldn't I need a another queue for that?Tormoria
@Tormoria yes, you read from another source, at much slower pace. that one can read from itself, or from yet another, recursively.Disentail
@WillNess added a recursive version that has the square optimization.Tormoria
what complexities are we forced to deal with, by the languages we use! primes = [2 ..] \ [[p*p, p*p+p ..], for p in primes] is the definition, everything else is implementation cruft! :) :) But when fully declarative programming arrives, millions of coders will be driven out of work, like the car drivers are about to be. Oh well.Disentail
BTW that idea to "move the colliding key forward" is a rediscovery of some Python "ActiveState recipe". Just in case you didn't know about it. :)Disentail
this is what I had in mind -- a simple tweak of your initial code, done along the lines of the Haskell Map version linked above. (repl.it uses F# v4, which doesn't need any type annotations, which the old v3 Ideone made me add before).Disentail
@WillNess nice, too bad Seq.tail is broken in F#. repl.it/EwO8/7Tormoria
hmm, but the primes code still works okay, and its empirical complexity is within norm (~ k^1.35 .. 1.20) as far as I could tell (going up to 1.6 million primes).Disentail
V
10

Although there has been one answer giving an algorithm using a Priority Queue (PQ) as in a SkewBinomialHeap, it is perhaps not the right PQ for the job. What the incremental Sieve of Eratosthenes (iEoS) requires is a PQ that has excellent performance for getting the minimum value and reinserting values mostly slightly further down the queue but doesn't need the ultimate in performance for adding new values as iSoE only adds as new values a total of the primes up to the the square root of the range (which is a tiny fraction of the number of re-insertions that occur once per reduction). The SkewBinomialHeap PQ doesn't really give much more than using the built-in Map which uses a balanced binary search tree - all O(log n) operations - other than it changes the weighting of the operations slightly in favour of the SoE's requirements. However, the SkewBinaryHeap still requires many O(log n) operations per reduction.

A PQ implemented as a Heap in more particular as a Binary Heap and even more particularly as a MinHeap pretty much satisfies iSoE's requirements with O(1) performance in getting the minimum and O(log n) performance for re-insertions and adding new entries, although the performance is actually a fraction of O(log n) as most of the re-insertions occur near the top of the queue and most of the additions of new values (which don't matter as they are infrequent) occur near the end of the queue where these operations are most efficient. In addition, the MinHeap PQ can efficiently implement the delete minimum and insert function in one (generally a fraction of) one O(log n) pass. Then, rather than for the Map (which is implemented as an AVL tree) where there is one O(log n) operation with generally a full 'log n' range due to the minimum value we require being at the far left last leaf of the tree, we are generally adding and removing the minimum at the root and inserting on the average of a few levels down in one pass. Thus the MinHeap PQ can be used with only one fraction of O(log n) operation per culling reduction rather than multiple larger fraction O(log n) operations.

The MinHeap PQ can be implemented with pure functional code (with no "removeMin" implemented as the iSoE doesn't require it but there is an "adjust" function for use in segmentation), as follows:

[<RequireQualifiedAccess>]
module MinHeap =

  type MinHeapTreeEntry<'T> = class val k:uint32 val v:'T new(k,v) = { k=k;v=v } end
  [<CompilationRepresentation(CompilationRepresentationFlags.UseNullAsTrueValue)>]
  [<NoEquality; NoComparison>]
  type MinHeapTree<'T> = 
      | HeapEmpty 
      | HeapOne of MinHeapTreeEntry<'T>
      | HeapNode of MinHeapTreeEntry<'T> * MinHeapTree<'T> * MinHeapTree<'T> * uint32

  let empty = HeapEmpty

  let getMin pq = match pq with | HeapOne(kv) | HeapNode(kv,_,_,_) -> Some kv | _ -> None

  let insert k v pq =
    let kv = MinHeapTreeEntry(k,v)
    let rec insert' kv msk pq =
      match pq with
        | HeapEmpty -> HeapOne kv
        | HeapOne kv2 -> if k < kv2.k then HeapNode(kv,pq,HeapEmpty,2u)
                          else let nn = HeapOne kv in HeapNode(kv2,nn,HeapEmpty,2u)
        | HeapNode(kv2,l,r,cnt) ->
          let nc = cnt + 1u
          let nmsk = if msk <> 0u then msk <<< 1
                     else let s = int32 (System.Math.Log (float nc) / System.Math.Log(2.0))
                          (nc <<< (32 - s)) ||| 1u //never ever zero again with the or'ed 1
          if k <= kv2.k then if (nmsk &&& 0x80000000u) = 0u then HeapNode(kv,insert' kv2 nmsk l,r,nc)
                                                            else HeapNode(kv,l,insert' kv2 nmsk r,nc)
          else if (nmsk &&& 0x80000000u) = 0u then HeapNode(kv2,insert' kv nmsk l,r,nc)
                else HeapNode(kv2,l,insert' kv nmsk r,nc)
    insert' kv 0u pq

  let private reheapify kv k pq =
    let rec reheapify' pq =
      match pq with
        | HeapEmpty -> HeapEmpty //should never be taken
        | HeapOne kvn -> HeapOne kv
        | HeapNode(kvn,l,r,cnt) ->
            match r with
              | HeapOne kvr when k > kvr.k ->
                  match l with //never HeapEmpty
                    | HeapOne kvl when k > kvl.k -> //both qualify, choose least
                        if kvl.k > kvr.k then HeapNode(kvr,l,HeapOne kv,cnt)
                        else HeapNode(kvl,HeapOne kv,r,cnt)
                    | HeapNode(kvl,_,_,_) when k > kvl.k -> //both qualify, choose least
                        if kvl.k > kvr.k then HeapNode(kvr,l,HeapOne kv,cnt)
                        else HeapNode(kvl,reheapify' l,r,cnt)
                    | _ -> HeapNode(kvr,l,HeapOne kv,cnt) //only right qualifies
              | HeapNode(kvr,_,_,_) when k > kvr.k -> //need adjusting for left leaf or else left leaf
                  match l with //never HeapEmpty or HeapOne
                    | HeapNode(kvl,_,_,_) when k > kvl.k -> //both qualify, choose least
                        if kvl.k > kvr.k then HeapNode(kvr,l,reheapify' r,cnt)
                        else HeapNode(kvl,reheapify' l,r,cnt)
                    | _ -> HeapNode(kvr,l,reheapify' r,cnt) //only right qualifies
              | _ -> match l with //r could be HeapEmpty but l never HeapEmpty
                        | HeapOne(kvl) when k > kvl.k -> HeapNode(kvl,HeapOne kv,r,cnt)
                        | HeapNode(kvl,_,_,_) when k > kvl.k -> HeapNode(kvl,reheapify' l,r,cnt)
                        | _ -> HeapNode(kv,l,r,cnt) //just replace the contents of pq node with sub leaves the same
    reheapify' pq


  let reinsertMinAs k v pq =
    let kv = MinHeapTreeEntry(k,v)
    reheapify kv k pq

  let adjust f (pq:MinHeapTree<_>) = //adjust all the contents using the function, then rebuild by reheapify
    let rec adjust' pq =
      match pq with
        | HeapEmpty -> pq
        | HeapOne kv -> HeapOne(MinHeapTreeEntry(f kv.k kv.v))
        | HeapNode (kv,l,r,cnt) -> let nkv = MinHeapTreeEntry(f kv.k kv.v)
                                   reheapify nkv nkv.k (HeapNode(kv,adjust' l,adjust' r,cnt))
    adjust' pq

Using the above module, the iSoE can be written with the wheel factorization optimizations and using efficient Co-Inductive Streams (CIS's) as follows:

type CIS<'T> = class val v:'T val cont:unit->CIS<'T> new(v,cont) = { v=v;cont=cont } end
type cullstate = struct val p:uint32 val wi:int new(cnd,cndwi) = { p=cnd;wi=cndwi } end
let primesPQWSE() =
  let WHLPRMS = [| 2u;3u;5u;7u |] in let FSTPRM = 11u in let WHLCRC = int (WHLPRMS |> Seq.fold (*) 1u) >>> 1
  let WHLLMT = int (WHLPRMS |> Seq.fold (fun o n->o*(n-1u)) 1u) - 1
  let WHLPTRN =
    let wp = Array.zeroCreate (WHLLMT+1)
    let gaps (a:int[]) = let rec gap i acc = if a.[i]=0 then gap (i+1) (acc+1uy) else acc
                         {0..WHLCRC-1} |> Seq.fold (fun s i->
                           let ns = if a.[i]<>0 then wp.[s]<-2uy*gap (i+1) 1uy;(s+1) else s in ns) 0 |> ignore
    Array.init (WHLCRC+1) (fun i->if WHLPRMS |> Seq.forall (fun p->(FSTPRM+uint32(i<<<1))%p<>0u)
                                  then 1 else 0) |> gaps;wp
  let inline whladv i = if i < WHLLMT then i + 1 else 0 in let advcnd c i = c + uint32 WHLPTRN.[i]
  let inline culladv c p i = let n = c + uint32 WHLPTRN.[i] * p in if n < c then 0xFFFFFFFFu else n
  let rec mkprm (n,wi,pq,(bps:CIS<_>),q) =
    let nxt = advcnd n wi in let nxti = whladv wi
    if nxt < n then (0u,0,(0xFFFFFFFFu,0,MinHeap.empty,bps,q))
    elif n>=q then let bp,bpi = bps.v in let nc,nci = culladv n bp bpi,whladv bpi
                    let nsd = bps.cont() in let np,_ = nsd.v in let sqr = if np>65535u then 0xFFFFFFFFu else np*np
                    mkprm (nxt,nxti,(MinHeap.insert nc (cullstate(bp,nci)) pq),nsd,sqr)
    else match MinHeap.getMin pq with | None -> (n,wi,(nxt,nxti,pq,bps,q))
                                      | Some kv -> let ca,cs = culladv kv.k kv.v.p kv.v.wi,cullstate(kv.v.p,whladv kv.v.wi)
                                                   if n>kv.k then mkprm (n,wi,(MinHeap.reinsertMinAs ca cs pq),bps,q)
                                                   elif n=kv.k then mkprm (nxt,nxti,(MinHeap.reinsertMinAs ca cs pq),bps,q)
                                                   else (n,wi,(nxt,nxti,pq,bps,q))
  let rec pCID p pi pq bps q = CIS((p,pi),fun()->let (np,npi,(nxt,nxti,npq,nbps,nq))=mkprm (advcnd p pi,whladv pi,pq,bps,q)
                                                 pCID np npi npq nbps nq)
  let rec baseprimes() = CIS((FSTPRM,0),fun()->let np=FSTPRM+uint32 WHLPTRN.[0]
                                               pCID np (whladv 0) MinHeap.empty (baseprimes()) (FSTPRM*FSTPRM))
  let genseq sd = Seq.unfold (fun (p,pi,pcc) ->if p=0u then None else Some(p,mkprm pcc)) sd
  seq { yield! WHLPRMS; yield! mkprm (FSTPRM,0,MinHeap.empty,baseprimes(),(FSTPRM*FSTPRM)) |> genseq }

The above code calculates the first 100,000 primes in about 0.077 seconds, the first 1,000,000 primes in 0.977 seconds, the first 10,000,000 primes in about 14.33 seconds, and the first 100,000,000 primes in about 221.87 seconds, all on an i7-2700K (3.5GHz) as 64-bit code. This purely functional code is slightly faster than that of Dustin Cambell's mutable Dictionary based code with the added common optimizations of wheel factorization, deferred adding of base primes, and use of the more efficient CID's all added (tryfsharp and ideone) but is still pure functional code where his using the Dictionary class is not. However, for larger prime ranges of about of about two billion (about 100 million primes), the code using the hash table based Dictionary will be faster as the Dictionary operations do not have a O(log n) factor and this gain overcomes the computational complexity of using Dictionary hash tables.

The above program has the further feature that the factorization wheel is parameterized so that, for instance, one can use a extremely large wheel by setting WHLPRMS to [| 2u;3u;5u;7u;11u;13u;17u;19u |] and FSTPRM to 23u to get a run time of about two thirds for large ranges at about 9.34 seconds for ten million primes, although note that it takes several seconds to compute the WHLPTRN before the program starts to run, which is a constant overhead no matter the prime range.

Comparative Analysis: As compared to the pure functional incremental tree folding implementation, this algorithm is just slightly faster because the average used height of the MinHeap tree is less by a factor of two than the depth of the folded tree but that is offset by an equivalent constant factor loss in efficiency in ability to traverse the PQ tree levels due to it being based on a binary heap requiring processing of both the right and left leaves for every heap level and a branch either way rather than a single comparison per level for the tree folding with generally the less deep branch the taken one. As compared to other PQ and Map based functional algorithms, improvements are generally by a constant factor in reducing the number of O(log n) operations in traversing each level of the respective tree structures.

The MinHeap is usually implemented as a mutable array binary heap after a genealogical tree based model invented by Michael Eytzinger over 400 years ago. I know the question said there was no interest in non-functional mutable code, but if one must avoid all sub code that uses mutability, then we couldn't use list's or LazyList's which use mutability "under the covers" for performance reasons. So imagine that the following alternate mutable version of the MinHeap PQ is as supplied by a library and enjoy another factor of over two for larger prime ranges in performance:

[<RequireQualifiedAccess>]
module MinHeap =

  type MinHeapTreeEntry<'T> = class val k:uint32 val v:'T new(k,v) = { k=k;v=v } end
  type MinHeapTree<'T> = ResizeArray<MinHeapTreeEntry<'T>>

  let empty<'T> = MinHeapTree<MinHeapTreeEntry<'T>>()

  let getMin (pq:MinHeapTree<_>) = if pq.Count > 0 then Some pq.[0] else None

  let insert k v (pq:MinHeapTree<_>) =
    if pq.Count = 0 then pq.Add(MinHeapTreeEntry(0xFFFFFFFFu,v)) //add an extra entry so there's always a right max node
    let mutable nxtlvl = pq.Count in let mutable lvl = nxtlvl <<< 1 //1 past index of value added times 2
    pq.Add(pq.[nxtlvl - 1]) //copy bottom entry then do bubble up while less than next level up
    while ((lvl <- lvl >>> 1); nxtlvl <- nxtlvl >>> 1; nxtlvl <> 0) do
      let t = pq.[nxtlvl - 1] in if t.k > k then pq.[lvl - 1] <- t else lvl <- lvl <<< 1; nxtlvl <- 0 //causes loop break
    pq.[lvl - 1] <-  MinHeapTreeEntry(k,v); pq

  let reinsertMinAs k v (pq:MinHeapTree<_>) = //do minify down for value to insert
    let mutable nxtlvl = 1 in let mutable lvl = nxtlvl in let cnt = pq.Count
    while (nxtlvl <- nxtlvl <<< 1; nxtlvl < cnt) do
      let lk = pq.[nxtlvl - 1].k in let rk = pq.[nxtlvl].k in let oldlvl = lvl
      let k = if k > lk then lvl <- nxtlvl; lk else k in if k > rk then nxtlvl <- nxtlvl + 1; lvl <- nxtlvl
      if lvl <> oldlvl then pq.[oldlvl - 1] <- pq.[lvl - 1] else nxtlvl <- cnt //causes loop break
    pq.[lvl - 1] <- MinHeapTreeEntry(k,v); pq

  let adjust f (pq:MinHeapTree<_>) = //adjust all the contents using the function, then re-heapify
    if pq <> null then 
      let cnt = pq.Count
      if cnt > 1 then
        for i = 0 to cnt - 2 do //change contents using function
          let e = pq.[i] in let k,v = e.k,e.v in pq.[i] <- MinHeapTreeEntry (f k v)
        for i = cnt/2 downto 1 do //rebuild by reheapify
          let kv = pq.[i - 1] in let k = kv.k
          let mutable nxtlvl = i in let mutable lvl = nxtlvl
          while (nxtlvl <- nxtlvl <<< 1; nxtlvl < cnt) do
            let lk = pq.[nxtlvl - 1].k in let rk = pq.[nxtlvl].k in let oldlvl = lvl
            let k = if k > lk then lvl <- nxtlvl; lk else k in if k > rk then nxtlvl <- nxtlvl + 1; lvl <- nxtlvl
            if lvl <> oldlvl then pq.[oldlvl - 1] <- pq.[lvl - 1] else nxtlvl <- cnt //causes loop break
          pq.[lvl - 1] <- kv
    pq

Geek note: I had actually expected the mutable version to offer a much better improved performance ratio, but it bogs down in the re-insertions due to the nested if-then-else code structure and the random behavior of the prime cull values meaning that the CPU branch prediction fails for a large proportion of the branches resulting in many additional 10's of CPU clock cycles per cull reduction to rebuilt the instruction pre-fetch cache.

The only other constant factor performance gains on this algorithm would be segmentation and use of multi-tasking for a performance gain proportional to the number of CPU cores; however, as it stands, this is the fastest pure functional SoE algorithm to date, and even the pure functional form using the functional MinHeap beats simplistic imperative implementations such as Jon Harrop's code or Johan Kullbom's Sieve of Atkin (which is in error in his timing as he only calculated the primes to 10 million rather than the 10 millionth prime), but those algorithms would be about five times faster if better optimizations were used. That ratio of about five between functional and imperative code will be somewhat reduced when we add multi-threading of larger wheel factorization as the computational complexity of the imperative code increases faster than the functional code and multi-threading helps the slower functional code more than the faster imperative code as the latter gets closer to the base limit of the time required to enumerate through the found primes.

EDIT_ADD: Even though one could elect to continue to use the pure functional version of MinHeap, adding efficient segmentation in preparation for multi-threading would slightly "break" the "pureness" of the functional code as follows: 1) The most efficient way of transferring a representation of composite-culled primes is a packed-bit array the size of the segment, 2) While the size of the array is known, using an array comprehension to initialize it in a functional way isn't efficient as it uses "ResizeArray" under the covers which needs to copy itself for every x additions (I think 'x' is eight for the current implementation) and using Array.init doesn't work as many values at particular indexes are skipped, 3) Therefore, the easiest way to fill the culled-composite array is to zeroCreate it of the correct size and then run an initialization function which could write to each mutable array index no more than once. Although this isn't strictly "functional", it is close in that the array is initialized and then never modified again.

The code with added segmentation, multi-threading, programmable wheel factorial circumference, and many performance tweaks is as follows (other than some added new constants, the extra tuned code to implement the segmentation and multi-threading is the bottom approximately half of the code starting at the "prmspg" function):

type prmsCIS = class val pg:uint16 val bg:uint16 val pi:int val cont:unit->prmsCIS
                     new(pg,bg,pi,nxtprmf) = { pg=pg;bg=bg;pi=pi;cont=nxtprmf } end
type cullstate = struct val p:uint32 val wi:int new(cnd,cndwi) = { p=cnd;wi=cndwi } end
let primesPQOWSE() =
  let WHLPRMS = [| 2u;3u;5u;7u;11u;13u;17u |] in let FSTPRM = 19u in let WHLCRC = int(WHLPRMS |> Seq.fold (*) 1u)
  let MXSTP = uint64(FSTPRM-1u) in let BFSZ = 1<<<11 in let NUMPRCS = System.Environment.ProcessorCount
  let WHLLMT = int (WHLPRMS |> Seq.fold (fun o n->o*(n-1u)) 1u) - 1 in let WHLPTRN = Array.zeroCreate (WHLLMT+1)
  let WHLRNDUP = let gaps (a:int[]) = let rec gap i acc = if a.[i]=0 then gap (i+1) (acc+1)
                                                          else acc in let b = a |> Array.scan (+) 0
                                      Array.init (WHLCRC>>>1) (fun i->
                                        if a.[i]=0 then 0 else let g=2*gap (i+1) 1 in WHLPTRN.[b.[i]]<-byte g;1)
                 Array.init WHLCRC (fun i->if WHLPRMS |> Seq.forall (fun p->(FSTPRM+uint32(i<<<1))%p<>0u) then 1 else 0)
                 |> gaps |> Array.scan (+) 0
  let WHLPOS = WHLPTRN |> Array.map (uint32) |> Array.scan (+) 0u in let advcnd cnd cndi = cnd + uint32 WHLPTRN.[cndi]
  let MINRNGSTP = if WHLLMT<=31 then uint32(32/(WHLLMT+1)*WHLCRC) else if WHLLMT=47 then uint32 WHLCRC<<<1 else uint32 WHLCRC
  let MINBFRNG = uint32((BFSZ<<<3)/(WHLLMT+1)*WHLCRC)/MINRNGSTP*MINRNGSTP
  let MINBFRNG = if MINBFRNG=0u then MINRNGSTP else MINBFRNG
  let inline whladv i = if i < WHLLMT then i+1 else 0 in let inline culladv c p i = c+uint32 WHLPTRN.[i]*p
  let rec mkprm (n,wi,pq,(bps:prmsCIS),q,lstp,bgap) =
    let nxt,nxti = advcnd n wi,whladv wi
    if n>=q then let p = (uint32 bps.bg<<<16)+uint32 bps.pg
                 let nbps,nxtcmpst,npi = bps.cont(),culladv n p bps.pi,whladv bps.pi
                 let pg = uint32 nbps.pg in let np = p+pg in let sqr = q+pg*((p<<<1)+pg) //only works to p < about 13 million
                 let nbps = prmsCIS(uint16 np,uint16(np>>>16),nbps.pi,nbps.cont) //therefore, algorithm only works to p^2 or about
                 mkprm (nxt,nxti,(MinHeap.insert nxtcmpst (cullstate(p,npi)) pq),nbps,sqr,lstp,(bgap+1us)) //1.7 * 10^14
    else match MinHeap.getMin pq with 
           | None -> (uint16(n-uint32 lstp),bgap,wi,(nxt,nxti,pq,bps,q,n,1us)) //fix with q is uint64
           | Some kv -> let ca,cs = culladv kv.k kv.v.p kv.v.wi,cullstate(kv.v.p,whladv kv.v.wi)
                        if n>kv.k then mkprm (n,wi,(MinHeap.reinsertMinAs ca cs pq),bps,q,lstp,bgap)
                        elif n=kv.k then mkprm (nxt,nxti,(MinHeap.reinsertMinAs ca cs pq),bps,q,lstp,(bgap+1us))
                        else (uint16(n-uint32 lstp),bgap,wi,(nxt,nxti,pq,bps,q,n,1us))
  let rec pCIS p pg bg pi pq bps q = prmsCIS(pg,bg,pi,fun()->
    let (npg,nbg,npi,(nxt,nxti,npq,nbps,nq,nl,ng))=mkprm (p+uint32 WHLPTRN.[pi],whladv pi,pq,bps,q,p,0us)
    pCIS (p+uint32 npg) npg nbg npi npq nbps nq)
  let rec baseprimes() = prmsCIS(uint16 FSTPRM,0us,0,fun()->
                           let np,npi=advcnd FSTPRM 0,whladv 0
                           pCIS np (uint16 WHLPTRN.[0]) 1us npi MinHeap.empty (baseprimes()) (FSTPRM*FSTPRM))
  let prmspg nxt adj pq bp q =
    //compute next buffer size rounded up to next even wheel circle so at least one each base prime hits the page
    let rng = max (((uint32(MXSTP+uint64(sqrt (float (MXSTP*(MXSTP+4UL*nxt))))+1UL)>>>1)+MINRNGSTP)/MINRNGSTP*MINRNGSTP) MINBFRNG
    let nxtp() = async {
      let rec addprms pqx (bpx:prmsCIS) qx = 
        if qx>=adj then pqx,bpx,qx //add primes to queue for new lower limit
        else let p = (uint32 bpx.bg<<<16)+uint32 bpx.pg in let nbps = bpx.cont()
             let pg = uint32 nbps.pg in let np = p+pg in let sqr = qx+pg*((p<<<1)+pg)
             let nbps = prmsCIS(uint16 np,uint16(np>>>16),nbps.pi,nbps.cont)
             addprms (MinHeap.insert qx (cullstate(p,bpx.pi)) pqx) nbps sqr
      let adjcinpg low k (v:cullstate) = //adjust the cull states for the new page low value
        let p = v.p in let WHLSPN = int64 WHLCRC*int64 p in let db = int64 p*int64 WHLPOS.[v.wi]
        let db = if k<low then let nk = int64(low-k)+db in nk-((nk/WHLSPN)*WHLSPN)
                 else let nk = int64(k-low) in if db<nk then db+WHLSPN-nk else db-nk
        let r = WHLRNDUP.[int((((db>>>1)%(WHLSPN>>>1))+int64 p-1L)/int64 p)] in let x = int64 WHLPOS.[r]*int64 p
        let r = if r>WHLLMT then 0 else r in let x = if x<db then x+WHLSPN-db else x-db in uint32 x,cullstate(p,r)
      let bfbtsz = int rng/WHLCRC*(WHLLMT+1) in let nbuf = Array.zeroCreate (bfbtsz>>>5)
      let rec nxtp' wi cnt = let _,nbg,_,ncnt = mkprm cnt in let nwi = wi + int nbg
                             if nwi < bfbtsz then nbuf.[nwi>>>5] <- nbuf.[nwi>>>5] ||| (1u<<<(nwi&&&0x1F)); nxtp' nwi ncnt
                             else let _,_,pq,bp,q,_,_ = ncnt in nbuf,pq,bp,q //results incl buf and cont parms for next page
      let npq,nbp,nq = addprms pq bp q
      return nxtp' 0 (0u,0,MinHeap.adjust (adjcinpg adj) npq,nbp,nq-adj,0u,0us) }
    rng,nxtp() |> Async.StartAsTask
  let nxtpg nxt (cont:(_*System.Threading.Tasks.Task<_>)[]) = //(len,pq,bp,q) =
    let adj = (cont |> Seq.fold (fun s (r,_)  -> s+r) 0u)
    let _,tsk = cont.[0] in let _,pq,bp,q = tsk.Result
    let ncont = Array.init (NUMPRCS+1) (fun i -> if i<NUMPRCS then cont.[i+1]
                                                 else prmspg (nxt+uint64 adj) adj pq bp q)
    let _,tsk = ncont.[0] in let nbuf,_,_,_ = tsk.Result in nbuf,ncont
  //init cond buf[0], no queue, frst bp sqr offset
  let initcond = 0u,System.Threading.Tasks.Task.Factory.StartNew (fun()->
                   (Array.empty,MinHeap.empty,baseprimes(),FSTPRM*FSTPRM-FSTPRM))
  let nxtcond n = prmspg (uint64 n) (n-FSTPRM) MinHeap.empty (baseprimes()) (FSTPRM*FSTPRM-FSTPRM)
  let initcont = Seq.unfold (fun (n,((r,_)as v))->Some(v,(n+r,nxtcond (n+r)))) (FSTPRM,initcond)
                 |> Seq.take (NUMPRCS+1) |> Seq.toArray
  let rec nxtprm (c,ci,i,buf:uint32[],cont) =
    let rec nxtprm' c ci i =
      let nc = c + uint64 WHLPTRN.[ci] in let nci = whladv ci in let ni = i + 1 in let nw = ni>>>5
      if nw >= buf.Length then let (npg,ncont)=nxtpg nc cont in nxtprm (c,ci,-1,npg,ncont)
      elif (buf.[nw] &&& (1u <<< (ni &&& 0x1F))) = 0u then nxtprm' nc nci ni
      else nc,nci,ni,buf,cont
    nxtprm' c ci i
  seq { yield! WHLPRMS |> Seq.map (uint64);
        yield! Seq.unfold (fun ((c,_,_,_,_) as cont)->Some(c,nxtprm cont))
                 (nxtprm (uint64 FSTPRM-uint64 WHLPTRN.[WHLLMT],WHLLMT,-1,Array.empty,initcont)) }

Note that the MinHeap modules, both functional and array-based, have had an "adjust" function added to permit adjusting of the cull state of each thread's version of the PQ at the beginning of every new segment page. Also note that it was possible to adjust the code so that most of the computation is done using 32 bit ranges with the final sequence output as uint64's at little cost in computational time so that currently the theoretical range is something over 100 trillion (ten raised to the fourteen power) if one were willing to wait the about three to four months required to compute that range. The numeric range checks were removed as it is unlikely that anyone would use this algorithm to compute up to that range let alone past it.

Using the pure functional MinHeap and 2,3,5,7 wheel factorization, the above program computes the first hundred thousand, one million, ten million, and a hundred million primes in 0.062, 0.629, 10.53, and 195.62 seconds, respectively. Using the array-based MinHeap speeds this up to 0.097, 0.276, 3.48, and 51.60 seconds, respectively. Using the 2,3,5,7,11,13,17 wheel by changing WHLPRMS to "[| 2u;3u;5u;7u;11u;13u;17u |]" and FSTPRM to 19u speeds that up yet a little more to 0.181, 0.308, 2.49, and 36.58 seconds, respectively (for constant factor improvement with a constant overhead). This fastest tweak calculates the 203,280,221 primes in the 32-bit number range in about 88.37 seconds. The "BFSZ" constant can be adjusted with trade-offs between slower times for smaller ranges version faster times for larger ranges, with a value of "1<<<14" recommended to be tried for the larger ranges. This constant only sets the minimum buffer size, with the program adjusting the buffer size above that size automatically for larger ranges such that the buffer is sufficient so that the largest base prime required for the page range will always "strike" each page at least once; this means that the complexity and overhead of an additional "bucket sieve" is not required. This last fully optimized version can compute the primes up to 10 and 100 billion in about 256.8 and 3617.4 seconds (just over an hour for the 100 billion) as tested using "primesPQOWSE() |> Seq.takeWhile ((>=)100000000000UL) |> Seq.fold (fun s p -> s + 1UL) 0UL" for output. This is where the estimates of about half a day for the count of primes to a trillion, a week for up to ten trillion and about three to four months for up to a hundred trillion come from.

I don't think it's possible to make functional or almost functional code using the incremental SoE algorithm to run much faster than this. As one can see in looking at the code, optimizing the basic incremental algorithm has added greatly to the code complexity such that it is likely slightly more complex than equivalently optimized code based on straight array culling with that code able to run approximately ten times faster than this and without the extra exponent in the performance meaning that this functional incremental code has an ever increasing extra percentage overhead.

So is this useful other than from an interesting theoretical and intellectual viewpoint? Probably it's not. For smaller ranges of primes up to about ten million, the best of the basic not fully optimized incremental functional SoE's are probably adequate and quite simple to write or have less RAM memory use than the simplest imperative SoE's. However, they are much slower than more imperative code using an array so they "run out of steam" for ranges above that. While it has been demonstrated here that the code can be sped up by optimization, it is still 10's of times slower than a more imperative pure array-based version yet has added to the complexity to be at least as complex as that code with equivalent optimizations, and even that code under F# on DotNet is about four times slower than using a language such as C++ compiled directly to native code; if one really wanted to investigate large ranges of primes, one would likely use one of those other languages and techniques where primesieve can calculate the number of primes in the hundred trillion range in under four hours instead of the about three months required for this code. END_EDIT_ADD

Varicella answered 26/9, 2013 at 7:53 Comment(5)
Your code doesn't compile for me. The field, constructor or member 'pi' is not defined (using external F# compiler) - share.linqpad.net/e6imko.linqLongley
@Maslow, Fixed both codes just now: top code was missing the cullstate type, both codes erroneously referred to pi instead of the correct wi field in that struct type. Please excuse the coding style as this was written when I first started using F#; having since graduated into further functional programming languages,, I would not likely write it the same today, but it still is true to the Haskell incremental Sieve of Eratosthenes principle as per the O'Neil reference article.Varicella
@Maslow, Note comment that purely functional incremental sieves can't match the speed of array mutating page segmented sieve as in another of my answers; I have written a maximally wheel factorized, multi-threaded, page-segmented, Sieve of Eratosthenes in F# that finds the number of primes to a billion in under 300 milliseconds on an Intel Atom i5-Z8350 processor @ 1.44 GHz (4 core), making it much faster than any Sieve of Atkin implementation in any language and within about a factor of two of Kim Walisch's "primesieve" written in C++, slowed due to JIT compilation and array bounds checking.Varicella
I would think( I guess in the future based on your findings) that the level of parallelism functional programming can achieve, would eventually win out. Also thanks for writing it.Longley
@Maslow, the fast implementation I mentioned is functional as I originally wrote it in Haskell, just in Haskell the array mutations are locked away behind the ST Monad whereas in F# one has to use self discipline to achieve the same result. Non array versions such as this can never be as fast due to the high overhead in processing the priority queue and the extra "log n" multiplier term accessing it. Sieves such as this or (simpler as one doesn't need to implement the priority queue) the tree folding version are really only moderately useful up to ranges of a million or so. You're welcome.Varicella
V
6

Here is a pretty much maximally optimized as to algorithm incremental (and recursive) map based Sieve of Eratosthenes using sequences since there is no need for memoization of previous sequence values (other than there is a slight advantage to caching the base prime values using Seq.cache), with the major optimizations being that it uses wheel factorization for the input sequence and that it uses multiple (recursive) streams to maintain the base primes which are less than the square root of the latest number being sieved, as follows:

  let primesMPWSE =
    let whlptrn = [| 2;4;2;4;6;2;6;4;2;4;6;6;2;6;4;2;6;4;6;8;4;2;4;2;
                     4;8;6;4;6;2;4;6;2;6;6;4;2;4;6;2;6;4;2;4;2;10;2;10 |]
    let adv i = if i < 47 then i + 1 else 0
    let reinsert oldcmpst mp (prime,pi) =
      let cmpst = oldcmpst + whlptrn.[pi] * prime
      match Map.tryFind cmpst mp with
        | None -> mp |> Map.add cmpst [(prime,adv pi)]
        | Some(facts) -> mp |> Map.add cmpst ((prime,adv pi)::facts)
    let rec mkprimes (n,i) m ps q =
      let nxt = n + whlptrn.[i]
      match Map.tryFind n m with
        | None -> if n < q then seq { yield (n,i); yield! mkprimes (nxt,adv i) m ps q }
                  else let (np,npi),nlst = Seq.head ps,ps |> Seq.skip 1
                       let (nhd,ni),nxtcmpst = Seq.head nlst,n + whlptrn.[npi] * np
                       mkprimes (nxt,adv i) (Map.add nxtcmpst [(np,adv npi)] m) nlst (nhd * nhd)
        | Some(skips) -> let adjmap = skips |> List.fold (reinsert n) (m |> Map.remove n)
                         mkprimes (nxt,adv i) adjmap ps q
    let rec prs = seq {yield (11,0); yield! mkprimes (13,1) Map.empty prs 121 } |> Seq.cache
    seq { yield 2; yield 3; yield 5; yield 7; yield! mkprimes (11,0) Map.empty prs 121 |> Seq.map (fun (p,i) -> p) }

It finds the 100,000th primes up to 1,299,721 in about a 0.445 second, but not being a proper imperative EoS algorithm it doesn't scale near linearly with increased numbers of primes, takes 7.775 seconds to find the 1,000,000 prime up to 15,485,867 for a performance over this range of about O(n^1.2) where n is the maximum prime found.

There is a bit more tuning that could be done, but it probably isn't going to make much of a difference as to a large percentage in better performance as follows:

  1. As the F# sequence library is markedly slow, one could use an self defined type that implements IEnumerable to reduce the time spent in the inner sequence, but as the sequence operations only take about 20% of to overall time, even if these were reduced to zero time the result would only be a reduction to 80% of the time.

  2. Other forms of map storage could be tried such as a priority queue as mentioned by O'Neil or the SkewBinomialHeap as used by @gradbot, but at least for the SkewBinomialHeap, the improvement in performance is only a few percent. It seems that in choosing different map implementations, one is just trading better response in finding and removing items that are near the beginning of the list against time spent in inserting new entries in order to enjoy those benefits so the net gain is pretty much a wash and still has a O(log n) performance with increasing entries in the map. The above optimization using multi streams of entries just to the square root reduce the number of entries in the map and thus make those improvements of not much importance.

EDIT_ADD: I did do the little extra bit of optimization and the performance did improve somewhat more than expected, likely due to the improved way of eliminating the Seq.skip as a way of advancing through the base primes sequence. This optimization uses a replacement for the inner sequence generation as a tuple of integer value and a continuation function used to advance to the next value in the sequence, with the final F# sequence generated by an overall unfold operation. Code is as follows:

type SeqDesc<'a> = SeqDesc of 'a * (unit -> SeqDesc<'a>) //a self referring tuple type
let primesMPWSE =
  let whlptrn = [| 2;4;2;4;6;2;6;4;2;4;6;6;2;6;4;2;6;4;6;8;4;2;4;2;
                   4;8;6;4;6;2;4;6;2;6;6;4;2;4;6;2;6;4;2;4;2;10;2;10 |]
  let inline adv i = if i < 47 then i + 1 else 0
  let reinsert oldcmpst mp (prime,pi) =
    let cmpst = oldcmpst + whlptrn.[pi] * prime
    match Map.tryFind cmpst mp with
      | None -> mp |> Map.add cmpst [(prime,adv pi)]
      | Some(facts) -> mp |> Map.add cmpst ((prime,adv pi)::facts)
  let rec mkprimes (n,i) m (SeqDesc((np,npi),nsdf) as psd) q =
    let nxt = n + whlptrn.[i]
    match Map.tryFind n m with
      | None -> if n < q then SeqDesc((n,i),fun() -> mkprimes (nxt,adv i) m psd q)
                else let (SeqDesc((nhd,x),ntl) as nsd),nxtcmpst = nsdf(),n + whlptrn.[npi] * np
                     mkprimes (nxt,adv i) (Map.add nxtcmpst [(np,adv npi)] m) nsd (nhd * nhd)
      | Some(skips) -> let adjdmap = skips |> List.fold (reinsert n) (m |> Map.remove n)
                       mkprimes (nxt,adv i) adjdmap psd q
  let rec prs = SeqDesc((11,0),fun() -> mkprimes (13,1) Map.empty prs 121 )
  let genseq sd = Seq.unfold (fun (SeqDesc((n,i),tailfunc)) -> Some(n,tailfunc())) sd
  seq { yield 2; yield 3; yield 5; yield 7; yield! mkprimes (11,0) Map.empty prs 121 |> genseq }

The times required to find the 100,000th and 1,000,000th primes are about 0.31 and 5.1 seconds, respectively, so there is a considerable percentage gain for this small change. I did try my own implementation of the IEnumerable/IEnumerator interfaces that are the base of sequences, and although they are faster than the versions used by the Seq module they hardly make any further difference to this algorithm where most of the time is spent in the Map functions. END_EDIT_ADD

Other than map based incremental EoS implementations, there is another "pure functional" implementation using Tree Folding which is said to be slightly faster, but as it still has a O(log n) term in the tree folding I suspect that it will mainly be faster (if it is) due to how the algorithm is implemented as to numbers of computer operations as compared to using a map. If people are interested I will develop that version as well.

In the end, one must accept that no pure functional implementation of the incremental EoS will ever come close to the raw processing speed of a good imperative implementation for larger numerical ranges. However, one could come up with an approach where all the code is purely functional except for the segmented sieving of composite numbers over a range using a (mutable) array which would come close to O(n) performance and in practical use would be fifty to a hundred times faster than functional algorithms for large ranges such as the first 200,000,000 primes. This has been done by @Jon Harrop in his blog, but this could be tuned further with very little additional code.

Varicella answered 1/7, 2013 at 10:2 Comment(11)
5.42 secs for 1 mln primes is a bit slow. perhaps you could compare the performance for this 2357-wheel version with same code on odds only (wheel = [2]). I've seen 2.5x ratio between the two (TME/TMWE entries in the table at the bottom). If your ratio is significantly different it'd mean you don't handle the wheel optimally. Worth a check. (I can't see it right away from just reading your code; I've never used F#). :) -- Is this something you're well aware off? (again, I don't know F#, so can't evaluate this myself).Disentail
@WillNess, I did remove the wheel from the above algorithm and the ratio is actually a bit larger than 2.5 in favour of the wheel at 2.87 times better for the first million primes found. I think the speed is just related to F# as it is written in itself and closures as are used here require building new heap references to hold the closure states; this is as compared to GHC Haskell which uses C++ as the underneath implementation so can be much more efficient.Varicella
@WillNess, cont'd: The ratios between F# and Haskell are even worse on Ideone.com where your primesTMWE takes about 0.13 seconds to calculate the first 100,000 primes whereas my latest primesPMWSE takes about 1.2 seconds or almost ten times slower! I think this is because the Ideone server is Linux based and running GHC Haskell, which is quite efficient, where they are running mono-project F# version 2, with noted poor JIT compilation and garbage collection. The ratio is likely only about three times on a Windows box using DotNet.Varicella
thanks, the picture is now clear; your code does handle the wheel optimally. Will study it some more; for now I note you roll out your own coinductive streams, by the book. Nice. :) Re "thunks" etc., I think that's what (non-compiled) Haskell does too. -- I was never interested in getting the speed functionally, only in finding ways to get close to optimal complexity with a simplest code possible. For speed, of course, C/C++ (like primesieve does).Disentail
Of course a SufficientlySmartCompiler in any language should produce an extremely fast assembler code from any very-high-level description. Like, in English, ultimately. :) -- 10% speedup to your code: ideone.com/ZL2EyX (just moved out the constant functions from mkprimes to an outer scope above it).Disentail
@WillNess, thanks for the look at my code and the tweak; bad habit of mine to use scope to enclose requisites, forgetting that let bindings will be re-evaluated on every recursive call to mkprimes - I'm still fairly new to F# and functional programming. Interesting that I re-invented coinductive streams; it just seemed a good idea at the time given that sequences in F# aren't very efficient and I didn't need the built-in memoization of LazyList's given that they are somewhat built on top of sequences and thus are also somewhat inefficientVaricella
@WillNess, as to rolling out streams, even that should be minimized for maximum speed as each continuation is a function call taking about 30 clock cycles round trip. That isn't too important for these algorithms where most of the time is spent in the Map operations (or in Tree Merging), but for a segmented buffer approach it would quickly build up. For instance, for the sieving of the 32-bit number range for over 200 million primes, just this unrolling would take about a second and a half. That is why I use the array look up method for the wheel in order to avoid this.Varicella
@Will Ness, I've converted your Haskell Tree Merging code as listed here as an alternate answer to this question as another answer. Even optimizing as much as I could, it isn't much faster than this map based version.Varicella
very interesting. In Haskell too, the postponed map-based version is a bit slower but runs at the same complexity as tree version. It's just that in Haskell, tree version code (w/out the wheel stuff) is so much shorter and clearer than map's. :) It even allows for the ultimate one-liner (almost) formulation, 2 : _Y ( (3:) . gaps 5 . unionAll . map (\p->[p*p, p*p+2*p..]) ) where _Y g = g (_Y g) (sans the definitions for gaps and unionAll).Disentail
@WillNess, yes, I really liked the version using the "non-sharing fixpoint combinator for telescoping multistage primes production", although I'm still getting my head around how it works. I could implement that in a F# version as well, although it would be likely to be a little more verbose. I still have the version's using pattern matching rather than structs, and that is somewhat less verbose but still not as concise as Haskell.Varicella
Nothing too special about it, just a simple straightforward recursion - there's no self-reference in data (no loops, like here), so it creates as many separate stages as necessary, increasing/diminishing by squares/sqroots. Here it is in Python, doing exactly the same thing.Disentail
B
5

Here's my attempt at a reasonably faithful translation of the Haskell code to F#:

#r "FSharp.PowerPack"

module Map =
  let insertWith f k v m =
    let v = if Map.containsKey k m then f m.[k] v else v
    Map.add k v m

let sieve =
  let rec sieve' map = function
  | LazyList.Nil -> Seq.empty
  | LazyList.Cons(x,xs) -> 
      if Map.containsKey x map then
        let facts = map.[x]
        let map = Map.remove x map
        let reinsert m p = Map.insertWith (@) (x+p) [p] m
        sieve' (List.fold reinsert map facts) xs
      else
        seq {
          yield x
          yield! sieve' (Map.add (x*x) [x] map) xs
        }
  fun s -> sieve' Map.empty (LazyList.ofSeq s)

let rec upFrom i =
  seq {
    yield i
    yield! upFrom (i+1)
  }

let primes = sieve (upFrom 2)
Barnyard answered 7/1, 2011 at 21:27 Comment(11)
This actually takes longer than the algorithm I posted. For sieving the first 100 000 naturals my algorithm takes about 8 seconds, while this takes over 9 seconds on my machine. I didn't actually time the Haskell solution (having trouble getting it to even run), but this seems pretty slow. Could it be the implementation of LazyList, which seems to be using locks to avoid side effects?Eight
@Eight - Hmm... on my machine PseudoSieve [2..100000] takes about 2 seconds, while sieve [2..100000] |> List.ofSeq takes around 0.5 seconds. If you're only going to be sieving a finite sequence, then using a list rather than a LazyList will probably give a performance improvement. However, with a finite list, you could also just use a mutable array as in the classical algorithm, which should be faster still.Barnyard
Also note that the paper that you cited also provides a faster algorithm based on a priority queue, which could be implemented in F# too (with some effort).Barnyard
Implementations of F# priority queues can be found in the question https://mcmap.net/q/138884/-f-priority-queue/336455Mannos
Okay, nevermind, I was being stupid. I was actually making your program find the 100000 th prime :). Mine still takes 8 seconds, yours is 0.5 seconds indeed. +1. I know about the priority queue implementation, but I just wanted to understand this one first.Eight
FYI - the sieve @Barnyard linked is the fastest F# infinite sieve I've yet to see.Zee
This also seems to give the wrong answers: Seq.nth 100000 primes should be the 100,001th prime which is 1,299,721 but your program gives the 100,006th prime 1,299,817 (same wrong result as gradbot's earlier answer!).Prothalamium
Ok, if I convert it to 64-bit arithmetic then it works correctly but why?! Ah, you're adding squares to the Map. That doesn't smell like the real sieve...Prothalamium
@Jon - see the paper cited in the question. My code is a fairly direct translation of sieve on page 6. In the standard imperative sieve of Eratosthenes this optimization is just as applicable - when discovering a new prime p there's no need to cross off multiples less than p*p since these will each have a smaller prime factor and will be crossed off when going through that prime's multiples.Barnyard
stackoverflow.com/search?q=user%3A849891++postpone (and this) :) -- @Barnyard you're both right and the code on pg 6 is wrong/lacking: yes you can/should add starting from p*p, but only after p*p (not p!) is seen as a candidate number! :) The author of that article has corrected this in the accompanying ZIP file with the source code.Disentail
@WillNess - good point, though implementing that change requires threading a queue of primes through the function along with the existing inputs.Barnyard
S
5

Prime sieve implemented with mailbox processors:

let (<--) (mb : MailboxProcessor<'a>) (message : 'a) = mb.Post(message)
let (<-->) (mb : MailboxProcessor<'a>) (f : AsyncReplyChannel<'b> -> 'a) = mb.PostAndAsyncReply f

type 'a seqMsg =  
    | Next of AsyncReplyChannel<'a>   

type PrimeSieve() =   
    let counter(init) =   
        MailboxProcessor.Start(fun inbox ->   
            let rec loop n =   
                async { let! msg = inbox.Receive()   
                        match msg with
                        | Next(reply) ->   
                            reply.Reply(n)   
                            return! loop(n + 1) }   
            loop init)   

    let filter(c : MailboxProcessor<'a seqMsg>, pred) =   
        MailboxProcessor.Start(fun inbox ->   
            let rec loop() =   
                async {   
                    let! msg = inbox.Receive()   
                    match msg with
                    | Next(reply) ->
                        let rec filter prime =
                            if pred prime then async { return prime }
                            else async {
                                let! next = c <--> Next
                                return! filter next }
                        let! next = c <--> Next
                        let! prime = filter next
                        reply.Reply(prime)
                        return! loop()   
                }   
            loop()   
        )   

    let processor = MailboxProcessor.Start(fun inbox ->   
        let rec loop (oldFilter : MailboxProcessor<int seqMsg>) prime =   
            async {   
                let! msg = inbox.Receive()   
                match msg with
                | Next(reply) ->   
                    reply.Reply(prime)   
                    let newFilter = filter(oldFilter, (fun x -> x % prime <> 0))   
                    let! newPrime = oldFilter <--> Next
                    return! loop newFilter newPrime   
            }   
        loop (counter(3)) 2)   

    member this.Next() = processor.PostAndReply( (fun reply -> Next(reply)), timeout = 2000)

    static member upto max =
        let p = PrimeSieve()
        Seq.initInfinite (fun _ -> p.Next())
        |> Seq.takeWhile (fun prime -> prime <= max)
        |> Seq.toList
Siward answered 9/1, 2011 at 5:29 Comment(1)
Probably not very fast, but makes in up sheer awesomeness.Siward
R
3

Here is my two cents, though I am not sure it meets the OP's criterion for truely being the sieve of eratosthenes. It doesn't utilize modular division and implements an optimization from the paper cited by the OP. It only works for finite lists, but that seems to me to be in the spirit of how the sieve was originally described. As an aside, the paper the talks about complexiety in terms of the number of markings and the number of divisions. Seems that, as we have to traverse a linked list, that this perhaps ignoring some key aspects of the various algorithms in performance terms. In general though modular division with computers is an expensive operation.

open System

let rec sieve list =
    let rec helper list2 prime next =
        match list2 with
            | number::tail -> 
                if number< next then
                    number::helper tail prime next
                else
                    if number = next then 
                        helper tail prime (next+prime)
                    else
                        helper (number::tail) prime (next+prime)

            | []->[]
    match list with
        | head::tail->
            head::sieve (helper tail head (head*head))
        | []->[]

let step1=sieve [2..100]

EDIT: fixed an error in the code from my original post. I tried the follow the original logic of the sieve with a few modifications. Namely start with the first item and cross off the multiples of that item from the set. This algorithm literally looks for the next item that is a multiple of the prime instead of doing modular division on every number in the set. An optimization from the paper is that it starts looking for multiples of the prime greater than p^2.

The part in the helper function with the multi-level deals with the possibility that the next multiple of the prime might already be removed from the list. So for instance with the prime 5, it will try to remove the number 30, but it will never find it because it was already removed by the prime 3. Hope that clarifies the algorithm's logic.

Remains answered 7/1, 2011 at 23:32 Comment(3)
sieve [1..10000] takes about 2 second, while it's instant with my algorithm and @kvb's. Could you explain the logic behind the algorithm a bit?Eight
+1, That seems to be faster than the previous. However, I get a stack overflow exception if I try to sieve [2..100000].Eight
@Eight it should be possible to achieve substantial speedup by adding top_limit as another parameter to the sieve function, and have it test whether top_limit/head < head, and if so, just return head::tail. Detailed discussion (in Haskell) is here. Also, working with [3..2..100] and calling helper with (2*head) as a step value will help (though this one will only double your speed). :) Cheers,Disentail
S
3

For what its worth, this isn't a sieve of Erathothenes, but its very fast:

let is_prime n =
    let maxFactor = int64(sqrt(float n))
    let rec loop testPrime tog =
        if testPrime > maxFactor then true
        elif n % testPrime = 0L then false
        else loop (testPrime + tog) (6L - tog)
    if n = 2L || n = 3L || n = 5L then true
    elif n <= 1L || n % 2L = 0L || n % 3L = 0L || n % 5L = 0L then false
    else loop 7L 4L
let primes =
    seq {
        yield 2L;
        yield 3L;
        yield 5L;
        yield! (7L, 4L) |> Seq.unfold (fun (p, tog) -> Some(p, (p + tog, 6L - tog)))
    }
    |> Seq.filter is_prime

It finds the 100,000th prime in 1.25 seconds on my machine (AMD Phenom II, 3.2GHZ quadcore).

Siward answered 9/1, 2011 at 5:45 Comment(11)
That's 500x slower than the sieve of E that I blogged.Prothalamium
@JonHarrop interesting read. :) (couldn't comment there... so if I may, to comment here...) you enlarge the array there by doubling the top prime - but you could be squaring it. Optimal compxty is usually seen as n^1.2, in n primes produced. Your data shows n^1.4.Disentail
@Will, for reference, the link to Jon's work is: Mutable Arrays SoE. Jon's program misses maximum efficiency in several respects: 1) it adds all of the found primes to the output list rather than just the base root primes, 2) it uses new heap sieve buffers for every new prime rather than one fixed size buffer which should be limited to the CPU L1 cache size to avoid cache thrashing on marking composites, and 3) it doesn't use a wheel or even only odd numbers. The current implementation isn't that fast for large ranges.Varicella
@Will, the reason that Jon's sieve shows the poor O(n^1.4) performance for larger number ranges is actually twofold: 1) the cache thrashing on marking composites once the required buffer gets bigger than the CPU L1 cache size and 2) that new (very large) arrays are created on the heap for every grow() operation. Although the number of grow operations would be reduced by making the arrays the square of the last found prime, this would make the cache thrashing even worse). An algorithm fixing these would be a very concise segmented array EoS sieve with close to O(nloglogn) performance.Varicella
@Will, the advantage in using a mutable array(s) is that, unlike purely functional code where every step depends on preceding steps, the buffer setting and culling could easily be implemented by thread pools in async workflows so that (given enough cores), the main limiting factor to speed would be the enumeration through the found primes sequence as the thread pools would be delivering pre-sieved arrays well in advance of being required by the enumerator. Using these techniques in F#, the sequence of all primes in the 32-bit range (200 million plus) could be produced in about 10 seconds.Varicella
@Varicella fixed wheel gives only constant factor improvement; the algorithmic problem is doing too much work on each step - sieving [n,2n] range by primes below n instead of by primes below sqrt(2n). ~n^1.4 is not so poor, :) it's just above optimal. One fixed size buffer, if size is small, also means algorithmic degeneration as n increases. Optimal complexity is about not doing work for nothing. Every prime below sqrt(limit) must be consulted, but some will not contribute any multiples on small sized segment. Look up "prime buckets Silva". :)Disentail
@Will, yes, I've read Silva, but with sufficient buffer, one can efficiently use EoS to sieve up to about 10 billion or so, as follows: With a CPU L1 cache size of say 32 KBytes can be used to represent a range of about a million numbers with wheel packing, so there will be an average of at least one hit for the largest base root primes up to about a third of this, meaning we can sieve up to the above range. Above that one has to use the bucket sieve. However, up to that limit of about 10^10 or so, one gets close to O(nloglogn) performance. EoS has it's limits at about that range anyway.Varicella
@Will, it depends on one's needs as to primes: calculating primes up to the billions doesn't suit pure functional programs with their O(n^1.2-1.4) performance as it will take hours or days where it can take just seconds with more imperative segmented techniques. The bucket sieve advances a few more orders of magnitude. Further advances use numerical techniques as those developed by de Silva. I think for our purposes, calculating all of the primes in the 32-bit number range (all 203,280,221 of them) in a few seconds is adequate; the pure functional techniques discussed will never do that.Varicella
@Varicella ah, very interesting, thanks a lot for this detailed and practical discussion! :) I was just theorizing. :) :) Thanks!Disentail
@Will, just for completeness, even if we are able to get a version of map-based or tree folding pure functional sieves down to one second for primes up to 16 million, and even if the performance continues to be O(n^1.2) over this range, it will take about *13 minutes to calculate the 32-bit range of 203,280,221 primes, whereas it can likely be done in about 10 seconds on a multi-core desktop computer using segmented buffers. Even if one uses functional techniques for the bucket sieves from that point forward, we have gained from the base primes performance.Varicella
@WillNess, for further clarification of the concepts of using mutable arrays as the only "non-pure" item in the array, please see my recent new answer to this question. My latest answer goes much further than John Harrop's code discussed here and fixes the main objection one might have with it as in excessive memory use, while actually advancing performance.Varicella
Z
3

I know you explicitly stated that you were interested in a purely functional sieve implementation so I held off presenting my sieve until now. But upon re-reading the paper you referenced, I see the incremental sieve algorithm presented there is essentially the same as my own, the only difference being implementation details of using purely functional techniques versus decidedly imperative techniques. So I think I at least half-qualify in satisfying your curiosity. Moreover, I would argue that using imperative techniques when significant performance gains can be realized but hidden away by functional interfaces is one of the most powerful techniques encouraged in F# programming, as opposed to the everything pure Haskell culture. I first published this implementation on my Project Euler for F#un blog but re-publish here with pre-requisite code substituted back in and structural typing removed. primes can calculate the first 100,000 primes in 0.248 seconds and the first 1,000,000 primes in 4.8 seconds on my computer (note that primes caches its results so you'll need to re-evaluate it each time you perform a benchmark).

let inline infiniteRange start skip = 
    seq {
        let n = ref start
        while true do
            yield n.contents
            n.contents <- n.contents + skip
    }

///p is "prime", s=p*p, c is "multiplier", m=c*p
type SievePrime<'a> = {mutable c:'a ; p:'a ; mutable m:'a ; s:'a}

///A cached, infinite sequence of primes
let primes =
    let primeList = ResizeArray<_>()
    primeList.Add({c=3 ; p=3 ; m=9 ; s=9})

    //test whether n is composite, if not add it to the primeList and return false
    let isComposite n = 
        let rec loop i = 
            let sp = primeList.[i]
            while sp.m < n do
                sp.c <- sp.c+1
                sp.m <- sp.c*sp.p

            if sp.m = n then true
            elif i = (primeList.Count-1) || sp.s > n then
                primeList.Add({c=n ; p=n ; m=n*n ; s=n*n})
                false
            else loop (i+1)
        loop 0

    seq { 
        yield 2 ; yield 3

        //yield the cached results
        for i in 1..primeList.Count-1 do
            yield primeList.[i].p

        yield! infiniteRange (primeList.[primeList.Count-1].p + 2) 2 
               |> Seq.filter (isComposite>>not)
    }
Zee answered 9/1, 2011 at 15:46 Comment(5)
You're right of course, there's no good reason to use a purely functional approach for implementing the sieve, it was just a curiosity of mine. The imperative sieve supports a lot more optimizations, such as halving the memory used (you don't care about multiples of two), using a single bit for marking off composites (compare that to using a map for example) and others. And it will stay at O(n log log n), when a purely functional implementation won't. +1 for some interesting code.Eight
@Stephen, further to IVlad's comments desiring pure functional code with no mutable states, your code goes "mutable" for no good reason as to performance; this algorithm is similar to that of Jon Harrop except that rather than gaining in efficiency due to the use of the mutable arrays he uses, you give that all away through the use of a mutable list (ResizeArray) containing mutable records, which list you process serially by repeated indexing (an O(index) operation), for a performance hardly better than pure functional code.Varicella
@Varicella ResizeArray index operations are O(1) (they are backed by arrays under-the-hood)...Zee
@Stephen - yes, the List<'T> class is backed by an array so that indexing isn't a problem but adding items to the list has an a ratio of a O[n] term in it (depending on the capacity increase on overflow). The O(n^1.4) performance for large ranges of about 10 million is easily explained when one realizes that the algorithm is scanning across all base square root primes for every composite check. This actually isn't really the SoE in that it calculates the next composite by multiplication rather than by adding a prime, but that could easily by fixed (sp.m <- sp.m+sp.p, with no need for sp.c).Varicella
@Stephen, as you cull for odd primes, you could add sp.p twice in the while loop as in sp.m <- sp.m + sp.p + sp.p (no sp.c required) although the only difference for this change will be to half the number of cycles in that while loop. Also, separating out the base primes calculation from the main primes output generator would mean that only the base primes need to be memoized without adding the main primes to the list, so you would reduce computation time and memory requirements by quite a large factor, although the performance would still be O(n^1.4).Varicella
V
3

Here is yet another method of accomplishing the incremental Sieve of Eratosthenes (SoE) using only pure functional F# code. It is adapted from the Haskell code developed as "This idea is due to Dave Bayer, though he used a complex formulation and balanced ternary tree structure, progressively deepening in uniform manner (simplified formulation and a skewed, deepening to the right binary tree structure introduced by Heinrich Apfelmus, further simplified by Will Ness). Staged production idea due to M. O'Neill" as per the following link: Optimized Tree Folding code using a factorial wheel in Haskell.

The following code has several optimizations that make it more suitable for execution in F#, as follows:

  1. The code uses coinductive streams instead of LazyList's as this algorithm has no (or little) need of LazyList's memoization and my coinductive streams are more efficient than either LazyLists (from the FSharp.PowerPack) or the built in sequences. A further advantage is that my code can be run on tryFSharp.org and ideone.com without having to copy and paste in the Microsoft.FSharp.PowerPack Core source code for the LazyList type and module (along with copyright notice)

  2. It was discovered that there is quite a lot of overhead for F#'s pattern matching on function parameters so the previous more readable discriminated union type using tuples was sacrificed in favour of by-value struct (or class as runs faster on some platforms) types for about a factor of two or more speed up.

  3. Will Ness's optimizations going from linear tree folding to bilateral folding to multi-way folding and improvements using the wheel factorization are about as effective ratiometrically for F# as they are for Haskell, with the main difference between the two languages being that Haskell can be compiled to native code and has a more highly optimized compiler whereas F# has more overhead running under the DotNet Framework system.

    type prmstate = struct val p:uint32 val pi:byte new (prm,pndx) = { p = prm; pi = pndx } end
    type prmsSeqDesc = struct val v:prmstate val cont:unit->prmsSeqDesc new(ps,np) = { v = ps; cont = np } end
    type cmpststate = struct val cv:uint32 val ci:byte val cp:uint32 new (strt,ndx,prm) = {cv = strt;ci = ndx;cp = prm} end
    type cmpstsSeqDesc = struct val v:cmpststate val cont:unit->cmpstsSeqDesc new (cs,nc) = { v = cs; cont = nc } end
    type allcmpsts = struct val v:cmpstsSeqDesc val cont:unit->allcmpsts new (csd,ncsdf) = { v=csd;cont=ncsdf } end
    
    let primesTFWSE =
      let whlptrn = [| 2uy;4uy;2uy;4uy;6uy;2uy;6uy;4uy;2uy;4uy;6uy;6uy;2uy;6uy;4uy;2uy;6uy;4uy;6uy;8uy;4uy;2uy;4uy;2uy;
                       4uy;8uy;6uy;4uy;6uy;2uy;4uy;6uy;2uy;6uy;6uy;4uy;2uy;4uy;6uy;2uy;6uy;4uy;2uy;4uy;2uy;10uy;2uy;10uy |]
      let inline whladv i = if i < 47uy then i + 1uy else 0uy
      let inline advmltpl c ci p = cmpststate (c + uint32 whlptrn.[int ci] * p,whladv ci,p)
      let rec pmltpls cs = cmpstsSeqDesc(cs,fun() -> pmltpls (advmltpl cs.cv cs.ci cs.cp))
      let rec allmltpls (psd:prmsSeqDesc) =
        allcmpsts(pmltpls (cmpststate(psd.v.p*psd.v.p,psd.v.pi,psd.v.p)),fun() -> allmltpls (psd.cont()))
      let rec (^) (xs:cmpstsSeqDesc) (ys:cmpstsSeqDesc) = //union op for SeqDesc's
        match compare xs.v.cv ys.v.cv with
          | -1 -> cmpstsSeqDesc (xs.v,fun() -> xs.cont() ^ ys)
          | 0 -> cmpstsSeqDesc (xs.v,fun() -> xs.cont() ^ ys.cont())
          | _ -> cmpstsSeqDesc(ys.v,fun() -> xs ^ ys.cont()) //must be greater than
      let rec pairs (csdsd:allcmpsts) =
        let ys = csdsd.cont in
        allcmpsts(cmpstsSeqDesc(csdsd.v.v,fun()->csdsd.v.cont()^ys().v),fun()->pairs (ys().cont()))
      let rec joinT3 (csdsd:allcmpsts) = cmpstsSeqDesc(csdsd.v.v,fun()->
        let ys = csdsd.cont() in let zs = ys.cont() in (csdsd.v.cont()^(ys.v^zs.v))^joinT3 (pairs (zs.cont())))
      let rec mkprimes (ps:prmstate) (csd:cmpstsSeqDesc) =
        let nxt = ps.p + uint32 whlptrn.[int ps.pi]
        if ps.p >= csd.v.cv then mkprimes (prmstate(nxt,whladv ps.pi)) (csd.cont()) //minus function
        else prmsSeqDesc(prmstate(ps.p,ps.pi),fun() -> mkprimes (prmstate(nxt,whladv ps.pi)) csd)
      let rec baseprimes = prmsSeqDesc(prmstate(11u,0uy),fun() -> mkprimes (prmstate(13u,1uy)) initcmpsts)
      and initcmpsts = joinT3 (allmltpls baseprimes)
      let genseq sd = Seq.unfold (fun (psd:prmsSeqDesc) -> Some(psd.v.p,psd.cont())) sd
      seq { yield 2u; yield 3u; yield 5u; yield 7u; yield! mkprimes (prmstate(11u,0uy)) initcmpsts |> genseq }
    
    primesLMWSE |> Seq.nth 100000
    

EDIT_ADD: This has been corrected as the original code did not properly handle the tail of the stream and passed the tail of the parameter stream to the pairs function to the joinT3 function rather than the tail following the zs stream. The timing below has also accordingly been corrected, with about an extra 30% speed up. The tryFSharp and ideone link codes have also been corrected. END_EDIT_ADD

The above program works at about O(n^1.1) performance with n the maximum prime calculated or about O(n^1.18) when n is the number of primes calculated, and takes about 2.16 seconds to calculate the first million primes (about 0.14 second for the first 100,000 primes) on a fast computer running 64 bit code using struct types rather than classes (it seems that some implementations box and unbox the by-value struct's in forming closures). I consider that to be about the maximum practical range for any of these pure functional prime algorithms. This is probably about the fastest that one can run a pure functional SoE algorithm other than for some minor tweaking to reduce constant factors.

Other than combining segmentation and multi-threading to share the computation between multiple CPU cores, most of the "tweaks" that could be made to this algorithm are in increasing the circumference of the wheel factorization for a gain of up to about 40% in performance and minor gains due to tweaks as to the use of structures, classes, tuples, or more direct individual parameters in the passing of state between functions.

EDIT_ADD2: I've done the above optimizations with the result that the code is now almost twice as fast due to structure optimizations with the added bonus of optionally using larger wheel factorization circumferences for the added smaller reduction. Note that the below code avoids using continuations in the main sequence generation loop and only uses them where necessary for the base primes streams and the subsequent composite number cull streams derived from those base primes. The new code is as follows:

type CIS<'T> = struct val v:'T val cont:unit->CIS<'T> new(v,cont) = { v=v;cont=cont } end //Co-Inductive Steam
let primesTFOWSE =
  let WHLPRMS = [| 2u;3u;5u;7u |] in let FSTPRM = 11u in let WHLCRC = int (WHLPRMS |> Seq.fold (*) 1u) >>> 1
  let WHLLMT = int (WHLPRMS |> Seq.fold (fun o n->o*(n-1u)) 1u) - 1
  let WHLPTRN =
    let wp = Array.zeroCreate (WHLLMT+1)
    let gaps (a:int[]) = let rec gap i acc = if a.[i]=0 then gap (i+1) (acc+1uy) else acc
                         {0..WHLCRC-1} |> Seq.fold (fun s i->
                           let ns = if a.[i]<>0 then wp.[s]<-2uy*gap (i+1) 1uy;(s+1) else s in ns) 0 |> ignore
    Array.init (WHLCRC+1) (fun i->if WHLPRMS |> Seq.forall (fun p->(FSTPRM+uint32(i<<<1))%p<>0u)
                                  then 1 else 0) |> gaps;wp
  let inline whladv i = if i < WHLLMT then i+1 else 0 in let inline advcnd c ci = c + uint32 WHLPTRN.[ci]
  let inline advmltpl p (c,ci) = (c + uint32 WHLPTRN.[ci] * p,whladv ci)
  let rec pmltpls p cs = CIS(cs,fun() -> pmltpls p (advmltpl p cs))
  let rec allmltpls k wi (ps:CIS<_>) =
    let nxt = advcnd k wi in let nxti = whladv wi
    if k < ps.v then allmltpls nxt nxti ps
    else CIS(pmltpls ps.v (ps.v*ps.v,wi),fun() -> allmltpls nxt nxti (ps.cont()))
  let rec (^) (xs:CIS<uint32*_>) (ys:CIS<uint32*_>) = 
    match compare (fst xs.v) (fst ys.v) with //union op for composite CIS's (tuple of cmpst and wheel ndx)
      | -1 -> CIS(xs.v,fun() -> xs.cont() ^ ys)
      | 0 -> CIS(xs.v,fun() -> xs.cont() ^ ys.cont())
      | _ -> CIS(ys.v,fun() -> xs ^ ys.cont()) //must be greater than
  let rec pairs (xs:CIS<CIS<_>>) =
    let ys = xs.cont() in CIS(CIS(xs.v.v,fun()->xs.v.cont()^ys.v),fun()->pairs (ys.cont()))
  let rec joinT3 (xs:CIS<CIS<_>>) = CIS(xs.v.v,fun()->
    let ys = xs.cont() in let zs = ys.cont() in (xs.v.cont()^(ys.v^zs.v))^joinT3 (pairs (zs.cont())))
  let rec mkprm (cnd,cndi,(csd:CIS<uint32*_>)) =
    let nxt = advcnd cnd cndi in let nxti = whladv cndi
    if cnd >= fst csd.v then mkprm (nxt,nxti,csd.cont()) //minus function
    else (cnd,cndi,(nxt,nxti,csd))
  let rec pCIS p pi cont = CIS(p,fun()->let (np,npi,ncont)=mkprm cont in pCIS np npi ncont)
  let rec baseprimes() = CIS(FSTPRM,fun()->let np,npi = advcnd FSTPRM 0,whladv 0
                                           pCIS np npi (advcnd np npi,whladv npi,initcmpsts))
  and initcmpsts = joinT3 (allmltpls FSTPRM 0 (baseprimes()))
  let inline genseq sd = Seq.unfold (fun (p,pi,cont) -> Some(p,mkprm cont)) sd
  seq { yield! WHLPRMS; yield! mkprm (FSTPRM,0,initcmpsts) |> genseq }

The above code takes about 0.07, 1.02, and 14.58 seconds to enumerate the first hundred thousand, million, and ten million primes, respectively, all on the reference Intel i7-2700K (3.5 GHz) machine in 64 bit mode. This isn't much slower then the reference Haskell implementation from which this code was derived, although it is slightly slower on tryfsharp and ideone due to being in 32 bit mode for tryfsharp under Silverlight (about half again slower) and running on a slower machine under Mono 2.0 (which is inherently much slower for F#) on ideone so is up to about five times slower than the reference machine. Note that the running time reported by ideone includes the initialization time for the embedded look up table arrays, which time needs to be accounted for.

The above program has the further feature that the factorization wheel is parameterized so that, for instance, one can use a extremely large wheel by setting WHLPRMS to [| 2u;3u;5u;7u;11u;13u;17u;19u |] and FSTPRM to 23u to get a run time of about two thirds for large ranges at about 10.02 seconds for ten million primes, although note that it takes several seconds to compute the WHLPTRN before the program starts to run.

Geek note: I have not implemented a "non-sharing fixpoint combinator for telescoping multistage primes production" as per the reference Haskell code, although I tried to do this, because one needs to have something like Haskell's lazy list for this to work without running away into an infinite loop and stack overflow. Although my Co-Inductive Streams (CIS's) have some properties of laziness, they are not formally lazy lists or cached sequences (they become non-cached sequences and could be cached when passed so a function such as the "genseq" one I provide for the final output sequence). I did not want to use the PowerPack implementation of LazyList because it isn't very efficient and would require that I copy that source code into tryfsharp and ideone, which do not provide for imported modules. Using the built-in sequences (even cached) is very inefficient when one wants to use head/tail operations as are required for this algorithm as the only way to get the tail of a sequence is to use "Seq.skip 1" which on multiple uses produces a new sequence based on the original sequence recursively skipped many times. I could implement my own efficient LazyList class based on CIS's, but it hardly seems worth it to demonstrate a point when the recursive "initcmpsts" and "baseprimes" objects take very little code. In addition, passing a LazyList to a function to produce extensions to that LazyList which function only uses values from near the beginning of the Lazylist requires that almost the entire LazyList is memoized for a reduction in memory efficiency: a pass for the first 10 million primes will require a LazyList in memory with almost 180 million elements. So I took a pass on this.

Note that for larger ranges (10 million primes or more), this purely functional code is about the same speed as many simplistic imperative implementations of the Sieve of Eratosthenes or Atkins, although that is due to the lack of optimization of those imperative algorithms; a more imperative implementation than this using equivalent optimizations and segmented sieving arrays will still be about ten times faster than this as per my "almost functional" answer.

Also note that while it is possible to implement segmented sieving using tree folding, it is more difficult since the culling advance algorithms are buried inside the continuations used for the "union - ^" operator and working around this would mean that a continuously advancing cull range needs to be used; this is unlike other algorithms where the state of the cull variable can be reset for each new page including having their range reduced, so that if larger ranges than 32-bits are used, the internal culled range can still be reset to operate within the 32-bit range even when a 64-bit range of primes is determined at little cost in execution time per prime. END_EDIT_ADD2

Varicella answered 16/7, 2013 at 5:34 Comment(12)
I must correct you, the ideas were not mine. All the proper dues are at haskell.org/haskellwiki/Prime_numbers#Tree_merging. Tree folding (using balanced tree) originally posted by Dave Bayer, the structure made more optimal (skewed tree) by Heinrich Apfelmus, two-staged production idea by Melissa O'Neill. The code by Bayer and Apfelmus used much more complex prioritized merging, taking great care to avoid premature forcing; I found that being careless allowed for much simplified code which still worked. :)Disentail
And of course, the wheel technique was well known much before I ever saw it in Melissa O'Neill's article. :)Disentail
this takes 10 seconds to calculate a million primes, and the previous answer says 5 seconds for 1,000,000th prime?..Disentail
@Will Ness, thanks, I'll correct the attributions. Yes, the use of wheel factorization has been well known for many years before the O'Neil article. As to the differences in timings, the five second rate was on a faster computer which I don't have access to just now; based on tryFSharp and ideone times I believe this code when run on that machine will have comparable timing to the map based code.Varicella
@WillNess, although outside the bounds of this question in not being a Sieve of Eratosthenes (SoE), it occurs to me that incremental prime sieves are misapplied using SoE and that one would get a much more efficient implementation using numeric techniques such as quadratic sieving (a la Atkins) where the sub sequences are based on squares rather than primes (many less sub sequences) and advance by squares rather than linearly by primes summation for many less steps per sequence. This could be adapted to using the Map priority queue or to the tree folding of sequences as used in this answer.Varicella
I can't be much of help there. Try asking this as a separate question.Disentail
@WillNess, cont'd, it would be easy enough to modify the above code to check, but I guess the big question regarding the Atkins algorithm is whether practically having move sequences with bigger steps between values in each sequence (Atkin) is better than less sequences with smaller steps between the values in each sequence (SoE); theory says there is a slight advantage to Atkin but practically there may or may not be a difference. Using a priority queue to store the sequence state is likely not the way to go with the log n term, but using the merged tree might be.Varicella
@WillNess, please note the correction to the code which did not affect the results but saves reductions and is in line with and true to the Haskell code in properly handling the tail sequence passed to the pair function for the call to the joinT3 function. Timing is now more as expected and faster than the Map based version just as for Haskell. As to the Sieve of Atkin, I think that it does fit within this question as "other sieving algorithms" are mentioned in the question, and I may submit that code as an alternate answer.Varicella
interesting news, more power to you! btw Daniel Fischer in his testing (when we discussed this on haskell-cafe few years back (search: "mergeSP haskell")) compared speeds under the joinT3 structure vs the regular and simple joinT structure, with some mixed results. (he went on to create his arithmoi package afterwards). My goal was never the fastest code, but the prettiest :) purely-functional code (that wasn't unreasonably slow). ideone.com/se0bsB and ideone.com/vkXCXt are reasonably pretty IMO, and/because/ true 2the math.Disentail
@WillNess, I have tuned this algorithm so that now this tree folding algorithm is about the same speed as the pure functional version of the Priority Queue answer tuned to about the same level, although using a mutable implementation of the MinHeap Priority Queue is still about twice as fast. You will see that I investigated the use of a "non-sharing fixpoint combinator for telescoping multistage primes production" but decided not to use it for the given reasons in the answer.Varicella
the point of "non-sharing" is exactly to prevent caching in chaining output of one invocation as input to another (making those invocations independent; ensuring there are actually 2 (3,4... as much as needed) of them at run time). It's just a normal recursion; if your function has a name you can just call it by that name. Haskell can just be too optimizing; having ps = g ... ps, ps almost surely will be shared; having ps() = g ... ps(), the output of ps() might get shared. With _Y it's just less likely. But in F#/ML, the run time behaviour is much more precisely defined.Disentail
@WillNess, it seems that it's the "F# run-time being more precisely defined" limitation as compared to Haskell in my sharing problem. Still, it is harder to use in F# as it would mean I would have to import or copy in the LazyList implementation and execution time would suffer, or come up with an alternative that is more efficient and has "non-sharing" properties, which would be somewhat complex. Not using LazyList's or equivalent is why my F# code is now up to almost the same speed as optimized Haskell on my machine even though Haskell would normally be faster by a factor or two or three.Varicella
D
3

There has been some truly fascinating and illuminating discussions on this thread and, I know this thread is very old, but I wanted to address the original question of the OP. Recall it was wanting to have a purely functional version of Eratosthenes' Sieve.

let rec PseudoSieve list =
match list with
| hd::tl -> hd :: (PseudoSieve <| List.filter (fun x -> x % hd <> 0) tl)
| [] -> []

This has the flaws already discussed. Surely the simplest purely functional solution not using mutation, modulo arithmetic - with too many checks to crossed out candidates - would be something like this?

let rec sieve primes = function
    | []        -> primes |> List.rev
    | p :: rest -> 
         sieve  (p :: primes) (rest |> List.except [p*p..p..n])

This is clearly not for ultimate performance and memory usage and it would be interesting to check how List.except - which does the crossings in a way so that they are only done once, (which might make this an alternative to rather than an implementation of Eratosthenes' Sieve but has the same benefits over the naive solution as as argued in the paper linked in the OP) - is implemented and the Big O costs there.

Still I think this is the most succinct answer to the original OP. What do you think?

Update: Made it a proper sieve with p*p in List.except!

EDIT_ADD:

I am @GordonBGood and I am editing directly into your answer (as you ask for ideas) rather than making a series of extensive comments, as follows:

  1. First, your code won't compile as it is as n is unknown and must be given in an enclosing code which defines the list [ 2 .. n ] which defines the beginning starting list.

  2. Your code is basically the Euler Sieve, not the Sieve of Eratosthenes (SoE) as requested, and although you are correct that the "crossings" of composite numbers only happens once using List.exceptas that composite will lo longer exist in the candidates list afterwards, using the List.except just defines "under-the-covers" what one might do with a fold and filter function: think about what List.except is doing - for every element in the candidate list to be sieved it is scanning to see if that element matches any element in the base prime factor list, and if so filters it out. This is horrendously inefficient as these scans are compounded per element when implemented using list processing rather than with a mutable array. Following is your code fleshed out to be a complete answer for a uint32 argument to produce a sequence of primes of the same type:

let sieveTo n =
  let rec sieve primes = function
    | []        -> primes |> List.rev
    | p :: rest -> sieve  (p :: primes) (rest |> List.except [p*p..p..n])
  sieve [] [ 2u .. n ] |> List.toSeq```

This has an extremely high logarithmic complexity as it takes something about 2.3 seconds to sieve to a hundred thousand and 227 seconds to sieve to a million for a square law relationship - basically it is a useless functional sieve implemented for lists due to the rapidly increasing amount of work with range (all the scans per remaining element).

  1. The "Richard Bird" sieve from the O'Neill article referred to in the OP is a true list based SoE as she correctly identifies it, and it avoids the repeated processing by considering that the candidate list to be sieved is in order and a combined list of primes to be eliminated/excepted (a compound numbers list) is sorted in order so that only the head of each list needs to be compared. Using laziness, these lists also are infinite so don't need an n and produce an "infinite" list of primes. An expression of the Richard Bird sieve for all numbers (not odds-only) including laziness by way of Co Inductive Stream's (CIS's) is as follows in F#:
type 'a CIS = CIS of 'a * (unit -> 'a CIS) //'Co Inductive Stream for laziness
 
let primesBird() =
  let rec (^^) (CIS(x, xtlf) as xs) (CIS(y, ytlf) as ys) = // stream merge function
    if x < y then CIS(x, fun() -> xtlf() ^^ ys)
    elif y < x then CIS(y, fun() -> xs ^^ ytlf())
    else CIS(x, fun() -> xtlf() ^^ ytlf()) // eliminate duplicate!
  let pmltpls p = let rec nxt c = CIS(c, fun() -> nxt (c + p)) in nxt (p * p)
  let rec allmltps (CIS(p, ptlf)) = CIS(pmltpls p, fun() -> allmltps (ptlf()))
  let rec cmpsts (CIS(CIS(c, ctlf), amstlf)) =
    CIS(c, fun() -> (ctlf()) ^^ (cmpsts (amstlf())))
  let rec minusat n (CIS(c, ctlf) as cs) =
    if n < c then CIS(n, fun() -> minusat (n + 1u) cs)
    else minusat (n + 1u) (ctlf())
  let rec baseprms() = CIS(2u, fun() -> baseprms() |> allmltps |> cmpsts |> minusat 3u)
  Seq.unfold (fun (CIS(p, ptlf)) -> Some(p, ptlf())) (baseprms())

The above takes about 2.3 seconds on my machine to count the primes to a million. The above sieve already has the improvement that it uses a baseprms secondary stream of small primes to introduce the composite cull streams.

  1. The problem with this other than it doesn't have the very minor changes to make it odds-only or higher degrees of wheel factorization is that, although it isn't as bad as the square law complexity for the above, it still takes about double a linear amount of execution time with range (empirical order of growth of about 1.3) or perhaps about proportional to square (log n). The solution to this is to sort the composite numbers by using an infinite tree-like structure instead of linear sorting to reduce to a log n complexity, as in the following code (also has the minor changes to make it odds-only):
type 'a CIS = CIS of 'a * (unit -> 'a CIS) //'Co Inductive Stream for laziness
 
let primesTreeFold() =
  let rec (^^) (CIS(x, xtlf) as xs) (CIS(y, ytlf) as ys) = // stream merge function
    if x < y then CIS(x, fun() -> xtlf() ^^ ys)
    elif y < x then CIS(y, fun() -> xs ^^ ytlf())
    else CIS(x, fun() -> xtlf() ^^ ytlf()) // no duplication
  let pmltpls p = let rec nxt c = CIS(c, fun() -> let adv = p + p
                                                  nxt (c + adv)) in nxt (p * p)
  let rec allmltps (CIS(p, ptlf)) = CIS(pmltpls p, fun() -> allmltps (ptlf()))
  let rec pairs (CIS(cs0, cs0tlf)) = // implements infinite tree-folding
    let (CIS(cs1, cs1tlf)) = cs0tlf() in CIS(cs0 ^^ cs1, fun() -> pairs (cs1tlf()))
  let rec cmpsts (CIS(CIS(c, ctlf), amstlf)) = // pairs is used below...
    CIS(c, fun() -> ctlf() ^^ (cmpsts << pairs << amstlf)())
  let rec minusat n (CIS(c, ctlf) as cs) =
    if n < c then CIS(n, fun() -> minusat (n + 2u) cs)
    else minusat (n + 2u) (ctlf())
  let rec oddprms() = CIS(3u, fun() -> oddprms() |> allmltps |> cmpsts |> minusat 5u)
  Seq.unfold (fun (CIS(p, ptlf)) -> Some(p, ptlf())) (CIS(2u, fun() -> oddprms()))

Notice the very minor changes to make this use infinite tree-folding rather than linear sorting; it also needs the recursive secondary feed for have an addition level of initialization to 2/3/5 instead of just 2/3 in order to prevent a runaway. This small change increases the speed to count the primes to a million to 0.437 seconds, to ten million in 4.91 seconds, and to a hundred million in 62.4 seconds for a growth rate that is tending toward being proportional to log n.

  1. Conclusion: Your implementation is almost unusable over ranges that one could actually work out the primes using SoE by hand, and the best of these "functional" sieves is moderately useful up to ranges of a hundred million or so in a minute. However, although relatively simple, they suffer from not being able to be multi-threaded as each new state depends un the succession of the states before, and the overhead of the computations makes them hundreds of times slower than a sieve based on array mutation where we can find the primes to a billion in a fraction of a second.
Dovelike answered 4/12, 2020 at 17:50 Comment(1)
excuse me for editing my ideas into your answer, but you asked for a response and my ideas would be too long to post as comments...Varicella
X
2

Actually I tried to do the same, I tried first the same naive implementation as in question, but it was too slow. I then found this page YAPES: Problem Seven, Part 2, where he used real Sieve of Eratosthenes based on Melissa E. O’Neill. I took code from there, just a little modified it, because F# changed a little since publication.

let reinsert x table prime = 
   let comp = x+prime 
   match Map.tryFind comp table with 
   | None        -> table |> Map.add comp [prime] 
   | Some(facts) -> table |> Map.add comp (prime::facts) 

let rec sieve x table = 
  seq { 
    match Map.tryFind x table with 
    | None -> 
        yield x 
        yield! sieve (x+1I) (table |> Map.add (x*x) [x]) 
    | Some(factors) -> 
        yield! sieve (x+1I) (factors |> List.fold (reinsert x) (table |> Map.remove x)) } 

let primes = 
  sieve 2I Map.empty

primes |> Seq.takeWhile (fun elem -> elem < 2000000I) |> Seq.sum
Xhosa answered 14/7, 2012 at 2:35 Comment(1)
Isn't this very slow as well at 10's of seconds? This implementation is worse than your link to "YAPES..." in using the "LongInteger" type rather than the uint64 type he used. Both miss the obvious optimization of culling only odd composites; when these changes are made, the above code will run about 10 times faster. However, it's still slow as it adds all found primes to the map, which means that the performance decreases as about the log of the range sieved instead of as the log of the square root of the range sieved, memory requirements are very high, and uint64 numbers must be used.Varicella
V
2

I don't think this question is complete in only considering purely functional algorithms that hide state in either a Map or Priority Queue in the case of a few answers or a folded merge tree in the case of one of my other answers in that any of these are quite limited as to usability for large ranges of primes due to their approximate O(n^1.2) performance ('^' means raised to the power of where n is the top number in the sequence) as well as their computational overhead per culling operation. This means that even for the 32-bit number range, these algorithms will take something in the range of at least many minutes to generate the primes up to four billion plus, which isn't something that is very usable.

There have been several answers presenting solutions using various degrees of mutability, but they either haven't taken full advantage of their mutability and have been inefficient or have been just very simplistic translations of imperative code and ugly functionally. It seems to me that the F# mutable array is just another form of hiding mutable state inside a data structure, and that an efficient algorithm can be developed that has no other mutability used other than the mutable buffer array used for efficient culling of composite numbers by paged buffer segments, with the remainder of the code written in pure functional style.

The following code was developed after seeing the code by Jon Harrop, and improves on those ideas as follows:

  1. Jon's code fails in terms of high memory use (saves all generated primes instead of just the primes to the square root of the highest candidate prime, and continuously regenerates buffer arrays of ever increasing huge size (equal to the size of the last prime found) irrespective of the CPU cache sizes.

  2. As well, his code as presented does not including a generating sequence.

  3. Further, the code as presented does not have the optimizations to at least only deal with odd numbers let alone not considering the use of using wheel factorization.

If Jon's code were used to generate the range of primes to the 32-bit number range of four billion plus, it would have a memory requirement of Gigabytes for the saved primes in the list structure and another multi-Gigabytes for the sieve buffer, although there is no real reason that the latter cannot be of a fixed smaller size. Once the sieve buffer exceeds the size of the CPU cache sizes, performance will quickly deteriorate in "cache thrashing", with increasing loss of performance as first the L1, then the L2, and finally the L3 (if present) sizes are exceeded.

This is why Jon's code will only calculate primes up to about 25 million or so even on my 64-bit machine with eight Gigabytes of memory before generating an out-of-memory exception and also explains why there is a larger and larger drop in relative performance as the ranges get higher with about an O(n^1.4) performance with increasing range and is only somewhat saved because it has such low computational complexity to begin with.

The following code addresses all of these limitations, in that it only memoizes the base primes up to the square root of the maximum number in the range which are calculated as needed (only a few Kilobytes in the case of the 32-bit number range) and only uses very small buffers of about sixteen Kilobytes for each of the base primes generator and the main page segmented sieve filter (smaller than the L1 cache size of most modern CPU's), as well as including the generating sequence code and (currently) being somewhat optimized as to only sieving for odd numbers, which means that memory is used more efficiently. In addition, a packed bit array is used to further improve memory efficiency; it's computation cost is mostly made up for in less computations needing to be made in scanning the buffer.

let primesAPF32() =
  let rec oddprimes() =
    let BUFSZ = 1<<<17 in let buf = Array.zeroCreate (BUFSZ>>>5) in let BUFRNG = uint32 BUFSZ<<<1
    let inline testbit i = (buf.[i >>> 5] &&& (1u <<< (i &&& 0x1F))) = 0u
    let inline cullbit i = let w = i >>> 5 in buf.[w] <- buf.[w] ||| (1u <<< (i &&& 0x1F))
    let inline cullp p s low = let rec cull' i = if i < BUFSZ then cullbit i; cull' (i + int p)
                               cull' (if s >= low then int((s - low) >>> 1)
                                      else let r = ((low - s) >>> 1) % p in if r = 0u then 0 else int(p - r))
    let inline cullpg low = //cull composites from whole buffer page for efficiency
      let max = low + BUFRNG - 1u in let max = if max < low then uint32(-1) else max
      let sqrtlm = uint32(sqrt(float max)) in let sqrtlmndx = int((sqrtlm - 3u) >>> 1)
      if low <= 3u then for i = 0 to sqrtlmndx do if testbit i then let p = uint32(i + i + 3) in cullp p (p * p) 3u
      else baseprimes |> Seq.skipWhile (fun p -> //force side effect of culling to limit of buffer
          let s = p * p in if p > 0xFFFFu || s > max then false else cullp p s low; true) |> Seq.nth 0 |> ignore
    let rec mkpi i low =
      if i >= BUFSZ then let nlow = low + BUFRNG in Array.fill buf 0 buf.Length 0u; cullpg nlow; mkpi 0 nlow
      else (if testbit i then i,low else mkpi (i + 1) low)
    cullpg 3u; Seq.unfold (fun (i,lw) -> //force cull the first buffer page then doit
        let ni,nlw = mkpi i lw in let p = nlw + (uint32 ni <<< 1)
        if p < lw then None else Some(p,(ni+1,nlw))) (0,3u)
  and baseprimes = oddprimes() |> Seq.cache
  seq { yield 2u; yield! oddprimes() }

primesAPF32() |> Seq.nth 203280220 |> printfn "%A"

This new code calculates the 203,280,221 primes in the 32-bit number range in about ADDED/CORRECTED: 25.4 seconds with running times for the first 100000, one million, 10 million, and 100 million tested as 0.01, 0.088, 0.94, and 11.25 seconds, respectively on a fast desktop computer (i7-2700K @ 3.5 GHz), and can run on tryfsharp.org and ideone.com, although over a lesser range for the latter due to execution time constraints. It has a worse performance than Jon Harrop's code for small ranges of a few thousand primes due to it's increased computational complexity but very quickly passes it for larger ranges due to its better performance algorithm that makes up for that complexity such that it is about five times faster for the 10 millionth prime and about seven times faster just before Jon's code blows up at about the 25 millionth prime.

Of the total execution time, more than half is spent in the basic sequence enumeration and thus would not be helped to a great extent by running the composite number culling operations as background operations, although the wheel factorization optimizations in combination would help (although more computationally intensive, that complexity would run in the background) in that they reduce the number of buffer test operations required in enumeration. Further optimizations could be made if the order of the sequences didn't need to be preserved (as in just counting the number of primes or in summing the primes) as the sequences could be written to support the parallel enumeration interfaces or the code could be written as a class so that member methods could do the computation without enumeration. This code could easily be tuned to offer close to the same kind of performance as C# code but more concisely expressed, although it will never reach the performance of optimized C++ native code such as PrimeSieve which has been optimized primarily to the task of just counting the number of primes over a range and can calculate the number of primes in the 32-bit number range is a small fraction of a second (0.25 seconds on the i7-2700K).

Thus, further optimizations would be concentrated around further bit packing the sieving array using wheel factorization to minimize the work done in culling the composite numbers, trying to increase the efficiency of enumeration of the resulting primes, and relegating all composite culling to background threads where a four to eight core processor could hide the required extra computational complexity.

And it's mostly pure functional code, just that it uses a mutable array to merge composite culling....

Varicella answered 23/7, 2013 at 20:16 Comment(8)
I prefer to measure empirical complexity in n primes produced. your code shows n^1.08 at 10mln..15mln range. Theoretically it's n log n log (log n). Haskell's list-based tree-merge and PQ both show ~n^1.20..1.22, for an additional log n factor most probably. So we now have an additional data point. :)Disentail
if you could perhaps make a joint table with results for some of your various recent versions (and add it to one of the answers, or just put it somewhere) it would be great! (for a few important variants, that is). e.g. like this one. It's not much, but it is something. E.g. this code here finds 200 mln prime in 36 seconds. What about the same code, only using the tree folding structure instead of the array segment? :) etc. :)Disentail
@WillNess, I'll try to add a table somewhere, perhaps when I finish my current optimizations to the incremental Sieve of Atkin, which looks to be at least as fast as the incremental tree merging SoE. There are so many extra optimizations that could be made such as adding multi-threading to this version which can reduce the execution time to about 8 seconds on the reference machine. As to converting the above arrays to tree folding, the performance will revert to about the same as my tree folding answer in about ten times as slow as this or worse.Varicella
@WillNess cont'd, the reason that the above code is quite fast is that the only sequence operation used in an inner loop is the unfold to generate the final oddprimes output which only calls the mkpi recursive routine that makes tail recursive calls until a prime is found once per loop; the most work of culling the composites is performed by the tail call recursive (meaning the compiler turns it into a loop) cull' function. In short all most no virtual or otherwise function calls are being made in inner loops. Even at that, much of the 36 seconds run time is spent in the unfold operation.Varicella
@WillNess cont'd2, the reason the recursive tree folding is so much slower than this is that even though the form of the code is similar, the tree merging requires a function call/return sequence chain for every composite number cleared, which call/return chain can call several more successive call/return sequences down the tree; these call/return sequences take 10's (about 28) of CPU clock cycles each for an average of about 5 or 6 call/return's per composite for 100000 primes and more with increasing number range (which is why it's not O(n)). Array culling is pretty much proportional to n.Varicella
yes arrays have the advantage of direct access. arrays vs. merging is like integer sorting vs. comparison-based sorting: the integer value can be used as address and direct access give the extra performance. It's the same in Haskell too. Daniel Fischer's package arithmoi uses arrays and is much faster than lists.Disentail
@WillNess, more than the advantages of direct memory access, functional code as in the tree folding algorithm has the performance disadvantage that its "functional" meaning it has the chain of call/return continuations to chase with CPU overheads, whether the continuations be hidden in lazy lists or F# sequences or implemented explicitly in CoInductive Streams as I use; an implemention of the tree folding algorithm using a single tail recursive function could be almost as fast as an array based solution. However, I don't think it's possible as tree folding as a lazy structure based conceptVaricella
@WillNess cont'd, one could implement the SoE not using an array as in making a list of all found current base primes, but then incremental culling would still require at least one mutable entry in that list of base primes to record the current culling sequence position else the code would revert to something like the immutable Map based solution to avoid the requirement for mutability.Varicella
V
2

As this question specifically asks for other algorithms, I provide the following implementation:

or perhaps knows of alternative implementation methods or sieving algorithms

No submission of various Sieve of Eratosthenes (SoE) algorithms is really complete without mentioning the Sieve of Atkin (SoA), which is in fact a variation of SoE using the solutions to a set of quadratic equations to implement the composite culling as well as eliminating all multiples of the squares of the base primes (primes less or equal to the square root of the highest number tested for primality). Theoretically, the SoA is more efficient than the SoE in that there are slightly less operations over the range such that it should have have about 20% less complexity for the range of about 10 to 100 million, but practically it is generally slower due to the computational overhead of the complexity of solving several quadratic equations. Although, the highly optimized Daniel J. Bernstein's C implementation is faster than the SoE implementation against which he tested it for that particular range of test numbers, the SoE implementation against which he tested was not the most optimum and more highly optimized versions of straight SoE are still faster. This appears to be the case here, although I admit that there may be further optimizations that I have missed.

Since O'Neill in her paper on the SoE using incremental unbounded Sieves set out primarily to show that the Turner Sieve is not SoE both as to algorithm and as to performance, she did not consider many other variations of the SoE such as SoA. Doing a quick search of literature, I can find no application of SoA to the unbounded incremental sequences we are discussing here, so adapted it myself as in the following code.

Just as the pure SoE unbounded case can be considered to have as composite numbers an unbounded sequence of sequences of primes multiples, the SoA considers to have as potential primes the unbounded sequence of the unbounded sequences of all of the expressions of the quadratic equations with one of the two free variables, 'x' or 'y' fixed to a starting value and with a separate "elimination" sequence of the sequences of all of the multiples of the base primes, which last is very similar to the composite elimination sequences of sequences for SoE except that the sequences advance more quickly by the square of the primes rather than by a (lesser) multiple of the primes. I have tried to reduce the number of quadratic equations sequences expressed in recognizing that for the purposes of an incremental sieve, the "3*x^2 + y^2" and the "3*x^2 - y^2" sequences are really the same thing except for the sign of the second term and eliminating all solutions which are not odd, as well applying 2357 wheel factorization (although the SoA already has inherent 235 wheel factorization). It uses the efficient tree folding merging/combining algorithm as in SoE tree merging to deal with each sequence of sequences but with a simplification that the union operator does not combine in merging as the SoA algorithm depends on being able to toggle prime state based on the number of found quadratic solutions for a particular value. The code is slower than tree merging SoE due to about three times the number of overhead operations dealing with about three times the number of somewhat more complex sequences, but there is likely a range of sieving of very large numbers where it will pass SoE due to its theoretical performance advantage.

The following code is true to the formulation of the SoA, uses CoInductive Stream types rather than LazyList's or sequences as memoization is not required and the performance is better, also does not use Discriminated Unions and avoids pattern matching for performance reasons:

#nowarn "40"
type cndstate = class val c:uint32 val wi:byte val md12:byte new(cnd,cndwi,mod12) = { c=cnd;wi=cndwi;md12=mod12 } end
type prmsCIS = class val p:uint32 val cont:unit->prmsCIS new(prm,nxtprmf) = { p=prm;cont=nxtprmf } end
type stateCIS<'b> = class val v:uint32 val a:'b val cont:unit->stateCIS<'b> new(curr,aux,cont)= { v=curr;a=aux;cont=cont } end
type allstateCIS<'b> = class val ss:stateCIS<'b> val cont:unit->allstateCIS<'b> new(sbstrm,cont) = { ss=sbstrm;cont=cont } end

let primesTFWSA() =
  let WHLPTRN = [| 2uy;4uy;2uy;4uy;6uy;2uy;6uy;4uy;2uy;4uy;6uy;6uy;2uy;6uy;4uy;2uy;6uy;4uy;6uy;8uy;4uy;2uy;4uy;2uy;
                   4uy;8uy;6uy;4uy;6uy;2uy;4uy;6uy;2uy;6uy;6uy;4uy;2uy;4uy;6uy;2uy;6uy;4uy;2uy;4uy;2uy;10uy;2uy;10uy |]
  let rec prmsqrs v sqr = stateCIS(v,sqr,fun() -> let n=v+sqr+sqr in let n=if n<v then 0xFFFFFFFFu else n in prmsqrs n sqr)
  let rec allsqrs (prms:prmsCIS) = let s = prms.p*prms.p in allstateCIS(prmsqrs s s,fun() -> allsqrs (prms.cont()))
  let rec qdrtc v y = stateCIS(v,y,fun() -> let a=(y+1)<<<2 in let a=if a<=0 then (if a<0 then -a else 2) else a
                                            let vn=v+uint32 a in let vn=if vn<v then 0xFFFFFFFFu else vn in qdrtc vn (y+2))
  let rec allqdrtcsX4 x = allstateCIS(qdrtc (((x*x)<<<2)+1u) 1,fun()->allqdrtcsX4 (x+1u))
  let rec allqdrtcsX3 x = allstateCIS(qdrtc (((x*(x+1u))<<<1)-1u) (1 - int x),fun() -> allqdrtcsX3 (x+1u))
  let rec joinT3 (ass:allstateCIS<'b>) = stateCIS<'b>(ass.ss.v,ass.ss.a,fun()->
    let rec (^) (xs:stateCIS<'b>) (ys:stateCIS<'b>) = //union op for CoInductiveStreams
      match compare xs.v ys.v with
        | 1 -> stateCIS(ys.v,ys.a,fun() -> xs ^ ys.cont())
        | _ -> stateCIS(xs.v,xs.a,fun() -> xs.cont() ^ ys) //<= then keep all the values without combining
    let rec pairs (ass:allstateCIS<'b>) =
      let ys = ass.cont
      allstateCIS(stateCIS(ass.ss.v,ass.ss.a,fun()->ass.ss.cont()^ys().ss),fun()->pairs (ys().cont()))
    let ys = ass.cont() in let zs = ys.cont() in (ass.ss.cont()^(ys.ss^zs.ss))^joinT3 (pairs (zs.cont())))
  let rec mkprm (cs:cndstate) (sqrs:stateCIS<_>) (qX4:stateCIS<_>) (qX3:stateCIS<_>) tgl =
    let inline advcnd (cs:cndstate) =
      let inline whladv i = if i < 47uy then i + 1uy else 0uy
      let inline modadv m a = let md = m + a in if md >= 12uy then md - 12uy else md
      let a = WHLPTRN.[int cs.wi] in let nc = cs.c+uint32 a
      if nc<cs.c then failwith "Tried to enumerate primes past the numeric range!!!"
      else cndstate(nc,whladv cs.wi,modadv cs.md12 a)
    if cs.c>=sqrs.v then mkprm (if cs.c=sqrs.v then advcnd cs else cs) (sqrs.cont()) qX4 qX3 false //squarefree function
    elif cs.c>qX4.v then mkprm cs sqrs (qX4.cont()) qX3 false
    elif cs.c>qX3.v then mkprm cs sqrs qX4 (qX3.cont()) false
    else match cs.md12 with
            | 7uy -> if cs.c=qX3.v then mkprm cs sqrs qX4 (qX3.cont()) (if qX3.a>0 then not tgl else tgl) //only for a's are positive
                     elif tgl then prmsCIS(cs.c,fun() -> mkprm (advcnd cs) sqrs qX4 qX3 false)
                     else mkprm (advcnd cs) sqrs qX4 qX3 false
            | 11uy -> if cs.c=qX3.v then mkprm cs sqrs qX4 (qX3.cont()) (if qX3.a<0 then not tgl else tgl) //only for a's are negatve
                      elif tgl then prmsCIS(cs.c,fun() -> mkprm (advcnd cs) sqrs qX4 qX3 false)
                      else mkprm (advcnd cs) sqrs qX4 qX3 false
            | _ -> if cs.c=qX4.v then mkprm cs sqrs (qX4.cont()) qX3 (not tgl) //always must be 1uy or 5uy
                   elif tgl then prmsCIS(cs.c,fun() -> mkprm (advcnd cs) sqrs qX4 qX3 false)
                   else mkprm (advcnd cs) sqrs qX4 qX3 false
  let qX4s = joinT3 (allqdrtcsX4 1u) in let qX3s = joinT3 (allqdrtcsX3 1u)
  let rec baseprimes = prmsCIS(11u,fun() -> mkprm (cndstate(13u,1uy,1uy)) initsqrs qX4s qX3s false)
  and initsqrs = joinT3 (allsqrs baseprimes)
  let genseq ps = Seq.unfold (fun (psd:prmsCIS) -> Some(psd.p,psd.cont())) ps
  seq { yield 2u; yield 3u; yield 5u; yield 7u;
        yield! mkprm (cndstate(11u,0uy,11uy)) initsqrs qX4s qX3s false |> genseq }

As stated, the code is slower than the Tree Folding Wheel Optimized SoE as posted in another answer at about a half second for the first 100,000 primes, and has roughly the same empirical O(n^1.2) for primes found performance as the best of other pure functional solutions. Some further optimizations that could be tried are that the primes square sequences do not use wheel factorization to eliminate the 357 multiples of the squares or even use only the prime multiples of the prime squares to reduce the number of values in the squares sequence streams and perhaps other optimizations related to the quadratic equation expression sequence streams.

EDIT_ADD: I have take a little time to look into the SoA modulo optimizations and see that in addition to the above "squarefree" optimizations, which probably won't make much difference, that the quadratic sequences have a modulo pattern over each 15 elements that would allow many of the passed toggled composite test values to be pre-screened and would eliminate the need for the specific modulo 12 operations for every composite number. All of these optimizations are likely to result in a reduction in computational work submitted to the tree folding of up to about 50% to make a slightly more optimized version of the SoA run at close to or slightly better than the best tree fold merging SoE. I don't know when I might find the time to do these few days of investigation to determine the result. END_EDIT_ADD

EDIT_ADD2: In working on the above optimizations which will indeed increase performance by about a factor of two, I see why the current empirical performance with increasing n is not as good as SoE: while the SoE is particularly suited to the tree folding operations in that the first sequences are more dense and repeat more often with later sequences much less dense, the SoA "4X" sequences are denser for later sequences when they are added and while the "3X" sequences start out less dense, they become more dense as y approaches zero, then get less dense again; this means that the call/return sequences aren't kept to a minimum depth as for SoE but that this depth increases beyond being proportional to number range. Solutions using folding aren't pretty as one can implement left folding for the sequences that increase in density with time, but that still leaves the negative portions of the "3X" sequences poorly optimized, as does breaking the "3X" sequences into positive and negative portions. The easiest solution is likely to save all sequences to a Map meaning that access time will increase by something like the log of the square root of the range, but that will be better for larger number range than the current tree folding. END_EDIT_ADD2

Although slower, I present this solution here to show how code can be evolved to express ideas originally conceived imperatively to pure functional code in F#. It provides examples of use of continuations as in CoInductive Streams to implement laziness without using the Lazy type, implementing (tail) recursive loops to avoid any requirement for mutability, threading an accumulator (tgl) through recursive calls to obtain a result (the number of times the quadratic equations "struck" the tested number), presenting the solutions to equations as (lazy) sequences (or streams in this case), etcetera.

For those who would like to play further with this code even without a Windows based development system, I have also posted it to tryfsharp.org and Ideone.com, although it runs slower on both those platforms, with tryfsharp both proportional to the speed of the local client machine and slower due to running under Silverlight, and Ideone running on the Linux server CPU under Mono-project 2.0, which is notoriously slow in both implementation and in particular as to garbage collections.

Varicella answered 1/8, 2013 at 17:30 Comment(1)
@WillNess, although more complex than SoE (as all numerical analysis prime sieving algorithms are), it has the advantage that other than the minor amount of work in the square free eliminations, the SoA toggles potential prime numbers directly with less and less cases of toggling composite numbers with increasing range. Unfortunately, as noted in a recent addition to the above, tree folding doesn't really work well with the SoA and I may have to revert to using a Map for the saved sequences. SoA is quite suited to an incremental implementation as sequences strike less and less often with n.Varicella
C
1

I'm not very familiar with Haskell multimaps, but the F# Power Pack has a HashMultiMap class, whose xmldoc summary is: "Hash tables, by default based on F# structural "hash" and (=) functions. The table may map a single key to multiple bindings." Perhaps this might help you?

Cornela answered 7/1, 2011 at 20:44 Comment(3)
If I'm reading the source right, that class seems to be using a .net library Dictionary<_,_> under the hood, which unfortunately isn't immutable.Eight
I haven't peered that closely at the source... I wonder if a completely pure implementation would be horribly inefficient on the .NET runtime?Cornela
Pure implementations are horribly inefficient anyway.Prothalamium
V
1

I have already submitted an answer that is "Almost Functional" and didn't want to confuse it by further additions/refinements so am submitting this answer which includes maximal wheel factorization an multi-threading here - it seems to me that buying a computer with multi-threads (even smartphones are multi-core) and running single threaded is like buying a car with a multi-cylinder engine and racing it firing on only one.

Again, the following code is mostly functional except for the mutation of the culling buffer's contents and the optimizations to do with enumeration, if used, which always require the idea of current state (although these details are hidden by some slower ways of doing this such as by using F#'s built-in seq's - but they are slow); the code as follows:

/// F# version of the Wheel Factorized  Sieve of Eratosthenes...

/// This is a "combo" sieve where
///   it is fully wheel factorized by the primes of 2, 3, 5, and 7; then
///   pre-sieved by the pattern of the 11, 13, 17, and 19 primes...

/// This version is almost fully functional with no mutation used except for
/// the contents of the sieve buffer arrays on composite number culling/sieving.

module SoE =

  type private Prime = uint64 // JavaScript doesn't have anything bigger!
  type private PrimeNdx = int64
  type private BasePrimeRep = uint32

  let inline public prime n = uint64 n // match these convenience conversions
  let inline private primendx n = int64 n // with the types above!
  let inline private bprep n = uint32 n // with the types above!

  let private cPGSZBTS = (1 <<< 14) * 8 // sieve buffer size in bits = CPUL1CACHE - THIS SHOULD REALLY BE AN ARGUMENT!!!!

  type private BasePrimeRepArr = BasePrimeRep[]
  type private SieveBuffer = uint8[][] // multiple levels by residue index, by segment, by byte

  /// a Co-Inductive Stream (CIS) of an "infinite" non-memoized series...
  type private CIS<'T> = CIS of 'T * (unit -> CIS<'T>) //' apostrophe formatting adjustment

  /// lazy list (memoized) series of base prime page arrays...
  type private BasePrime = uint32
  type private BasePrimeRepArrs =
    BasePrimeRepArrs of BasePrimeRepArr * Option<Lazy<BasePrimeRepArrs>>

// constants and Look Up Tables to do with culling start address calculation...
  let private FRSTSVPRM = prime 23 // past the precull primes!
  let private WHLNDXCNST = primendx (FRSTSVPRM * (FRSTSVPRM - prime 1) >>> 1)
  let private WHLPRMS = [| prime 2; prime 3; prime 5; prime 7;
                           prime 11; prime 13; prime 17; prime 19 |]
  let private WHLHITS = 48 // (3 - 1) * (5 - 1) * (7 - 1)!
  let private WHLODDCRC = 105 // 3 * 5 * 7; no evens!
  let private WHLPTRNLEN = 11 * 13 * 17 * 19 // repeating pattern of pre-cull primes
  let private NUMPCULLPRMS = 4
  let private PCULLPRMREPS: BasePrimeRepArrs =
    BasePrimeRepArrs( [| uint32 (-1 <<< 6) + 44u; uint32 (-1 <<< 6) + 45u;
                         uint32 (-1 <<< 6) + 46u; uint32 (-1 <<< 6) + 47u |], None)
  // number of primes to a million minus number wheel prims; go sieving to 10^12
  let private NUMSTRTSBASEPRMS = 78498 + WHLPRMS.Length + 1 // +1 for end 0xFFFFFFFFu
  let private NUMSTRTSPRMS = (6542 - WHLPRMS.Length + 1) // enough for  65536 squared
  let private RESIDUES = [|
    23; 29; 31; 37; 41; 43; 47; 53; 59; 61; 67; 71;
    73; 79; 83; 89; 97; 101; 103; 107; 109; 113; 121; 127;
    131; 137; 139; 143; 149; 151; 157; 163; 167; 169; 173; 179;
    181; 187; 191; 193; 197; 199; 209; 211; 221; 223; 227; 229; 233 |]
  let private WHLNDXS = [|
    0; 0; 0; 1; 2; 2; 2; 3; 3; 4; 5; 5; 6; 6; 6;
    7; 7; 7; 8; 9; 9; 9; 10; 10; 11; 12; 12; 12; 13; 13;
    14; 14; 14; 15; 15; 15; 15; 16; 16; 17; 18; 18; 19; 20; 20;
    21; 21; 21; 21; 22; 22; 22; 23; 23; 24; 24; 24; 25; 26; 26;
    27; 27; 27; 28; 29; 29; 29; 30; 30; 30; 31; 31; 32; 33; 33;
    34; 34; 34; 35; 36; 36; 36; 37; 37; 38; 39; 39; 40; 41; 41;
    41; 41; 41; 42; 43; 43; 43; 43; 43; 44; 45; 45; 46; 47; 47; 48 |]
  let private WHLRNDUPS = [| // two rounds to avoid overflow; used in start address calcs...
      0; 3; 3; 3; 4; 7; 7; 7; 9; 9; 10; 12; 12; 15; 15;
      15; 18; 18; 18; 19; 22; 22; 22; 24; 24; 25; 28; 28; 28; 30;
      30; 33; 33; 33; 37; 37; 37; 37; 39; 39; 40; 42; 42; 43; 45;
      45; 49; 49; 49; 49; 52; 52; 52; 54; 54; 57; 57; 57; 58; 60;
      60; 63; 63; 63; 64; 67; 67; 67; 70; 70; 70; 72; 72; 73; 75;
      75; 78; 78; 78; 79; 82; 82; 82; 84; 84; 85; 87; 87; 88; 93;
      93; 93; 93; 93; 94; 99; 99; 99; 99; 99; 100; 102; 102; 103; 105;
      105; 108; 108; 108; 109; 112; 112; 112; 114; 114; 115; 117; 117; 120; 120;
      120; 123; 123; 123; 124; 127; 127; 127; 129; 129; 130; 133; 133; 133; 135;
      135; 138; 138; 138; 142; 142; 142; 142; 144; 144; 145; 147; 147; 148; 150;
      150; 154; 154; 154; 154; 157; 157; 157; 159; 159; 162; 162; 162; 163; 165;
      165; 168; 168; 168; 169; 172; 172; 172; 175; 175; 175; 177; 177; 178; 180;
      180; 183; 183; 183; 184; 187; 187; 187; 189; 189; 190; 192; 192; 193; 198;
      198; 198; 198; 198; 199; 204; 204; 204; 204; 204; 205; 207; 207; 208; 210; 210 |]
  /// LUT of relative cull start points given the residual bit plane index (outer index),
  /// and the combination of the base prime residual index and the bit plane index of
  /// the first cull position for the page (multiply combined for the inner index), giving
  /// a 16-bit value which contains the multipier (the upper byte) and the extra
  /// cull index offset (the lower byte) used to multiply by the base prime wheel index
  /// and add the offset with the result added with the start wheel index to obtain
  /// the residual bit plane start wheel index...
  /// for "PG11", these arrays get huge (quarter meg elements with elements of 4 bytes for
  /// a megabyte size), which will partially (or entirely) cancell out the benefit for
  /// smaller prime ranges; may help for the huge prime ranges...
  let private WHLSTRTS: uint16[][] =
    let arr = Array.init WHLHITS <| fun _ -> Array.zeroCreate (WHLHITS * WHLHITS)
    for pi = 0 to WHLHITS - 1 do
      let mltsarr = Array.zeroCreate WHLHITS
      let p = RESIDUES.[pi] in let s = (p * p - int FRSTSVPRM) >>> 1 
      // build array of relative mults and offsets to `s`...
      { 0 .. WHLHITS - 1 } |> Seq.iter (fun ci ->
        let rmlt0 = (RESIDUES.[(pi + ci) % WHLHITS] - RESIDUES.[pi]) >>> 1
        let rmlt = rmlt0 + if rmlt0 < 0 then WHLODDCRC else 0 in let sn = s + p * rmlt
        let snd = sn / WHLODDCRC in let snm = sn - snd * WHLODDCRC
        mltsarr.[WHLNDXS.[snm]] <- rmlt) // new rmlts 0..209!
      let ondx = pi * WHLHITS
      { 0 .. WHLHITS - 1 } |> Seq.iter (fun si ->
        let s0 = (RESIDUES.[si] - int FRSTSVPRM) >>> 1 in let sm0 = mltsarr.[si]
        { 0 .. WHLHITS - 1 } |> Seq.iter (fun ci ->
          let smr = mltsarr.[ci]
          let rmlt = if smr < sm0 then smr + WHLODDCRC - sm0 else smr - sm0
          let sn = s0 + p * rmlt in let rofs = sn / WHLODDCRC
          // we take the multiplier times 2 so it multiplies by the odd wheel index...
          arr.[ci].[ondx + si] <- (rmlt <<< 9) ||| rofs |> uint16))
    arr

  let private makeSieveBuffer btsz: SieveBuffer =
    let sz = ((btsz + 31) >>> 5) <<< 2 // rounded up to nearest 32 bit boundary
    { 1 .. WHLHITS } |> Seq.map (fun _ -> Array.zeroCreate sz) |> Array.ofSeq

  // a dedicated BITMSK LUT may be faster than bit twiddling...
  let private BITMSK = [| 1uy; 2uy; 4uy; 8uy; 16uy; 32uy; 64uy; 128uy |]

  /// all the heavy lifting work is done here...
  let private cullSieveBuffer (lwi: PrimeNdx) (bpras: BasePrimeRepArrs)
                              (strtsa: uint32[]) (sb: SieveBuffer) =
    let sz = sb.[0].Length in let szbits = sz <<< 3 in let bplmt = sz >>> 4
    let lowndx = lwi * primendx WHLODDCRC
    let nxti = (lwi + primendx szbits) * primendx WHLODDCRC
    // set up strtsa for use by each modulo residue bit plane...
    let rec looppi ((BasePrimeRepArrs(bpra, bprastl)) as obpras) pi j =
      if pi < bpra.Length then
        let ndxdprm = bpra.[pi] in let rsd = RESIDUES.[int ndxdprm &&& 0x3F]
        let bp = (int ndxdprm >>> 6) * (WHLODDCRC <<< 1) + rsd
        let i = (bp - int FRSTSVPRM) >>> 1 |> primendx
        let s = (i + i) * (i + primendx FRSTSVPRM) + WHLNDXCNST
        if s >= nxti then strtsa.[j] <- 0xFFFFFFFFu else // enough base primes!
          let si = if s >= lowndx then int (s - lowndx) else
                    let wp = (rsd - int FRSTSVPRM) >>> 1
                    let r = (lowndx - s) %
                              (primendx bp * primendx WHLODDCRC) |> int
                    if r = 0 then 0 else
                    bp * (WHLRNDUPS.[wp + (int r + bp - 1) / bp] - wp) - r
          let sd = si / WHLODDCRC in let sn = WHLNDXS.[si - sd * WHLODDCRC]
          strtsa.[j] <- (uint32 sn <<< 26) ||| uint32 sd
          looppi obpras (pi + 1) (j + 1)
      else match bprastl with
           | None -> ()
           | Some lv -> looppi lv.Value 0 j       
    looppi bpras 0 0
    // do the culling based on the preparation...
    let rec loopri ri =
      if ri < WHLHITS then
        let pln = sb.[ri] in let plnstrts = WHLSTRTS.[ri]
        let rec looppi (BasePrimeRepArrs(bpra, bprastl) as obpras) pi =
          if pi < bpra.Length then
            let prmstrt = strtsa.[pi]
            if prmstrt < 0xFFFFFFFFu then
              let ndxdprm = bpra.[pi]
              let pd = int ndxdprm >>> 6 in let prmndx = int ndxdprm &&& 0x3F
              let bp = pd * (WHLODDCRC <<< 1) + RESIDUES.[prmndx]
              let sd = int prmstrt &&& 0x3FFFFFF in let sn = int (prmstrt >>> 26)
              let adji = prmndx * WHLHITS + sn in let adj = plnstrts.[adji]
              let s0 = sd + int (adj >>> 8) * pd + (int adj &&& 0xFF)
              if bp < bplmt then
                let slmt = min szbits (s0 + (bp <<< 3))
                let rec loops s8 =
                  if s8 < slmt then
                    let msk = BITMSK.[s8 &&& 7]
                    let rec loopc c =
                      if c < pln.Length then pln.[c] <- pln.[c] ||| msk; loopc (c + bp)
                    loopc (s8 >>> 3); loops (s8 + bp) in loops s0
              else
                let rec loopsi si =
                  if si < szbits then
                    let w = si >>> 3 in pln.[w] <- pln.[w] ||| BITMSK.[si &&& 7]
                    loopsi (si + bp) in loopsi s0
              looppi obpras (pi + 1)
          else match bprastl with
               | None -> ()
               | Some lv -> looppi lv.Value 0        
        looppi bpras 0; loopri (ri + 1) in loopri 0

  /// pre-culled wheel pattern with a 131072 extra size to avoid overflow...
  /// (copy by 16 Kilobytes per time!)
  let private WHLPTRN: SieveBuffer = // rounded up to next 32-bit alignmenet!
    let sb = makeSieveBuffer ((WHLPTRNLEN <<< 3) + 131072 + 31)
    let strtsa = Array.zeroCreate NUMPCULLPRMS
    cullSieveBuffer (primendx 0) PCULLPRMREPS strtsa sb; sb

  /// fill the SieveBuffer from the WHLPTRN according to the modulo of the low wheel index...
  let private fillSieveBuffer (lwi: PrimeNdx) (sb: SieveBuffer) =
    let len = sb.[0].Length in let cpysz = min len 16384 in let mdlo0 = lwi / (primendx 8)
    { 0 .. WHLHITS - 1 } |> Seq.iter (fun i ->
      { 0 .. 16384 .. len - 1 } |> Seq.iter (fun j ->
        let mdlo = (mdlo0 + primendx j) % (primendx WHLPTRNLEN) |> int
        Array.blit WHLPTRN.[i] mdlo sb.[i] j cpysz))

  /// fast value set bit count Look Up Table (CLUT) for 16-bit input...
  let private CLUT: uint8[] =
    let arr = Array.zeroCreate 65536
    let rec cntem i cnt = if i <= 0 then cnt else cntem (i &&& (i - 1)) (cnt + 1)
    for i = 0 to 65535 do arr.[i] <- cntem i 0 |> uint8
    arr

  /// count the zero (prime) bits in the SieveBuffer up to the "lsti" odd index...
  let private countSieveBuffer (bitlmt: int) (sb: SieveBuffer): int =
    let lstwi = bitlmt / WHLODDCRC
    let lstri = WHLNDXS.[bitlmt - lstwi * WHLODDCRC]
    let lst = (lstwi >>> 5) <<< 2 in let lstm = lstwi &&& 31
    let rec loopr ri cr =
      if ri >= WHLHITS then cr else
      let pln = sb.[ri]
      let rec cntem i cnt =
        if i >= lst then
          let msk = (0xFFFFFFFFu <<< lstm) <<< if ri <= lstri then 1 else 0
          let v = (uint32 pln.[lst] + (uint32 pln.[lst + 1] <<< 8) +
                   (uint32 pln.[lst + 2] <<< 16) + (uint32 pln.[lst + 3] <<< 24)) ||| msk
          cnt - int CLUT.[int v &&& 0xFFFF] - int CLUT.[int (v >>> 16)] else
        let v = uint32 pln.[i] + (uint32 pln.[i + 1] <<< 8) +
                (uint32 pln.[i + 2] <<< 16) + (uint32 pln.[i + 3] <<< 24)
        cntem (i + 4) (cnt - int CLUT.[int v &&& 0xFFFF] - int CLUT.[int (v >>> 16)])
      let cnti = cntem 0 cr in loopr (ri + 1) cnti
    loopr 0 ((lst * 8 + 32) * WHLHITS)

  /// it's rediculously easy to make this multi-threaded with the following change...
// (*
  /// a CIS series of pages from the given start index with the given SieveBuffer size,
  /// and provided with a polymorphic converter function to produce
  /// and type of result from the culled page parameters...
  let cNUMPROCS = System.Environment.ProcessorCount
  let rec private makePrimePages strtwi btsz strtsasz
                                 (cnvrtrf: PrimeNdx -> SieveBuffer -> 'T): CIS<'T> =
    let bpas = makeBasePrimes() in let tsks = Array.zeroCreate cNUMPROCS
    let sbs = Array.init cNUMPROCS (fun _ -> Array.zeroCreate (btsz >>> 3))
    let mktsk lwi i = System.Threading.Tasks.Task.Run(fun() ->
      let sb = makeSieveBuffer btsz in let strtsa = Array.zeroCreate strtsasz
      fillSieveBuffer lwi sb; cullSieveBuffer lwi bpas strtsa sb
      cnvrtrf lwi sb)
    let rec jobfeed lwi i =
      CIS(lwi, fun() -> let ni = i + 1
                        jobfeed (lwi + primendx btsz)
                                (if ni >= cNUMPROCS then 0 else ni))
    let rec strttsks (CIS(lwi, jbfdtlf) as jbfd) i =
      if i >= cNUMPROCS then jbfd else
      tsks.[i] <- mktsk lwi i; strttsks (jbfdtlf()) (i + 1)
    let rec mkpgrslt (CIS(lwi, jbfdtlf)) i =
      let rslt = tsks.[i].Result in tsks.[i] <- mktsk lwi i
      CIS(rslt,
          fun() -> mkpgrslt (jbfdtlf())
                            (if i >= cNUMPROCS - 1 then 0 else i + 1))
    mkpgrslt <| strttsks (jobfeed strtwi 0) 0 <| 0
// *)

  // the below is single threaded...
(*
  /// a CIS series of pages from the given start index with the given SieveBuffer size,
  /// and provided with a polymorphic converter function to produce
  /// and type of result from the culled page parameters...
  let rec private makePrimePages strtwi btsz strtsasz
                                 (cnvrtrf: PrimeNdx -> SieveBuffer -> 'T): CIS<'T> =
    let bpas = makeBasePrimes() in let sb = makeSieveBuffer btsz
    let strtsa = Array.zeroCreate strtsasz
    let rec nxtpg lwi =
      fillSieveBuffer lwi sb; cullSieveBuffer lwi bpas strtsa sb
      CIS(cnvrtrf lwi sb, fun() -> nxtpg (lwi + primendx btsz))
    nxtpg strtwi
// *)

  /// secondary feed of lazy list of memoized pages of base primes...
  and private makeBasePrimes(): BasePrimeRepArrs =
    let sb2bpa lwi (sb: SieveBuffer) =
      let btsz = sb.[0].Length <<< 3
      let arr =
        Array.zeroCreate <| countSieveBuffer ((btsz * WHLODDCRC) - 1) sb
      let rec loop ri i j =
        if i < btsz then
          if ri < WHLHITS then
            if sb.[ri].[i >>> 3] &&& BITMSK.[i &&& 7] <> 0uy then
              loop (ri + 1) i j
            else arr.[j] <- ((bprep lwi + bprep i) <<< 6) ||| bprep ri
                 loop (ri + 1) i (j + 1)
          else loop 0 (i + 1) j in loop 0 0 0; arr
    // finding the first page as not part of the loop and making succeeding
    // pages lazy breaks the recursive data race!
    let fksb = makeSieveBuffer 64 in fillSieveBuffer (primendx 0) fksb
    let fkbpra = sb2bpa (primendx 0) fksb
    let fkbpas = BasePrimeRepArrs(fkbpra, None)
    let strtsa = Array.zeroCreate (fkbpra.Length + 1)
    let frstsb = makeSieveBuffer 512 in fillSieveBuffer (primendx 0) frstsb
    cullSieveBuffer (primendx 0) fkbpas strtsa frstsb
    let rec nxtbpas (CIS(bpa, tlf)) =
      BasePrimeRepArrs(bpa, Some(lazy (nxtbpas (tlf()))))
    let restbpras =
      Some(lazy (nxtbpas <|
                   makePrimePages (primendx 512) 512 NUMSTRTSPRMS sb2bpa))
    let frstbpa = sb2bpa (primendx 0) frstsb
    BasePrimeRepArrs(frstbpa, restbpras)                    

  /// produces a generator of primes; uses mutability for better speed...
  let primes(): unit -> Prime =
    let sb2prmsarr lwi (sb: SieveBuffer) =
      let btsz = sb.[0].Length <<< 3
      let arr = Array.zeroCreate <| countSieveBuffer (btsz * WHLODDCRC - 1) sb
      let baseprm = prime (lwi + lwi) * prime WHLODDCRC
      let inline notprm ri i = sb.[ri].[i >>> 3] &&& BITMSK.[i &&& 7] <> 0uy
      let rec loop ri i j =
        if ri >= WHLHITS then loop 0 (i + 1) j else
        if i < btsz then
          if notprm ri i then loop (ri + 1) i j
          else arr.[j] <- baseprm + prime (i + i) * prime WHLODDCRC
                            + prime RESIDUES.[ri]
               loop (ri + 1) i (j + 1) in loop 0 0 0
      arr
    let mutable i = -WHLPRMS.Length
    let (CIS(nprms, npgtlf)) = // use page generator function above!
      makePrimePages (primendx 0) cPGSZBTS NUMSTRTSPRMS sb2prmsarr
    let mutable prmarr = nprms in let mutable pgtlf = npgtlf
    fun() -> 
      if i >= 0 && i < prmarr.Length then
        let oi = i in i <- i + 1; prmarr.[oi] else // ready next call!      
        if i < 0 then i <- i + 1; WHLPRMS.[7 + i] else
        let (CIS(nprms, npgtlf)) = pgtlf() // use page generator function above!
        i <- 1; prmarr <- nprms; pgtlf <- npgtlf; prmarr.[0]

  let countPrimesTo (limit: Prime): int64 = // much faster!
    let precnt = WHLPRMS |> Seq.takeWhile ((>=) limit) |> Seq.length |> int64
    if limit < FRSTSVPRM then precnt else
    let topndx = (limit - FRSTSVPRM) >>> 1 |> primendx
    let lmtlwi = topndx / primendx WHLODDCRC
    let sb2cnt lwi (sb: SieveBuffer) =
      let btsz = sb.[0].Length <<< 3 in let lmti = lwi + primendx (btsz - 1)
      countSieveBuffer
        (if lmti < lmtlwi then btsz * WHLODDCRC - 1
         else int (topndx - lwi * primendx WHLODDCRC)) sb |> int64, lmti
    let rec loop (CIS((cnt, nxti), tlf)) count =
      if nxti < lmtlwi then loop (tlf()) (count + cnt)
      else count + cnt
    loop <| makePrimePages (primendx 0) cPGSZBTS NUMSTRTSBASEPRMS sb2cnt <| precnt

open System
open SoE

[<EntryPoint>]
let main argv =
  let limit = prime 2000000000

  let frstprms = primes()
  printf "The first 23 primes are:  "
  for _ in 1 .. 25 do printf "%d " (frstprms())
  printfn ""

  let numprms = primes() in let mutable cnt = 0
  printf "Number of primes up to a million:  "
  while numprms() <= prime 1000000 do cnt <- cnt + 1
  printfn "%d" cnt

  let strt = DateTime.Now.Ticks

(*  // the slow way of enumerating and counting...
  let primegen = primes() in let mutable answr = 0
  while primegen() <= limit do answr <- answr + 1
// *)

  // the fast way of counting...
  let answr = countPrimesTo (prime 2000000000)

  let elpsd = (DateTime.Now.Ticks - strt) / 10000L

  printfn "Found %d primes up to %d in %d milliseconds" answr limit elpsd

  0 // return an integer exit code

And the output as run on an old Intel I3-2100 at 3.1 GHz with two cores/four threads is:

The first 23 primes are:  2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
Number of primes up to a million:  78498
Found 98222287 primes to 2000000000 in 468 milliseconds

for about 5.8 CPU clock cycles per culling operation (half a billion culling operations to this range). It will be proportionately faster given more real (not Hyper Threaded) threads, higher CPU clock rates, and newer CPU generations with improved Instructions Per Clock (IPC).

This is about the Ultimate in speeds for DotNet code up to this range, but for larger ranges over about 17 billion, a further refinement of adjusting the cull buffer size to be proportional to the square root of the maximum number being sieved will help maintain the speed as the range increases up to huge ranges taking days...weeks...months to complete if the entire range is sieved and not just a narrower span of the overall range.

Varicella answered 6/4, 2020 at 10:22 Comment(0)
V
1

As mentioned in another Almost Functional answer to this thread, purely functional algorithms are not really practical for large sieving ranges such as 1e9 and higher. The below answer differs from the previous in that it is not limited to ranges of only about 1.7e10 but could be extended to sieve over the full 64 bit range of about 1.2e19, although the below code is limited by a integer overflow to about 1.76e17. This multi-threaded page-segmented wheel-factorized Almost Functional (only mutating the contents of arrays) answer is about the most complex that can be fit in a single StackOverflow answer

let setupstrt = System.DateTime.Now.Ticks // time the setups before code starts

#if !FABLE
// for use with DotNet, we can use 64 bit values
type private Prime = uint64 // JavaScript doesn't have anything bigger!
type private PrimeNdx = int64

let inline private prime n = uint64 n // match these convenience conversions
let inline private primendx n = int64 n // with the types above!
#else

// for use with Fable producing JavaScript from F#, we must use float as
type private Prime = float // JavaScript doesn't have anything bigger!
type private PrimeNdx = float

let inline private prime n = float n // match these convenience conversions
let inline private primendx n = float n // with the types above!
#endif

/// Just one byte per base prime is used so that the maximum used storage is
/// only about 0.8 Gigabytes for all the primes in the 32 bit number range as
/// might be required for a 64 bit sieving range.  The encoding is partially
/// incremental, where the least significant 6 bits are the modulo index 0 to 47)
/// and the uuper two bits represent the 0 to 3 wheel ofset from the last base
/// prime.  This works because the maximum prime gap over this range is 368 and
/// 3 wheel spands are 3 time 210 is 630 so easily covers it.
type private BasePrimeRep = uint8
let inline private bprep n = uint8 n // with the types above!

/// delta format with first element the first number in the page;
/// second elemsnt is an array of successive prime increments...
type PrimeRepsPg = Prime * uint16[]

/// sieve buffer size in bits = CPUL2CACHE; all CPU's >= this...
let private cPGSZBTS = (1 <<< 18) * 8

type private BasePrimeRepArr = BasePrimeRep[] * int
type private SieveBuffer = uint8[][] // multiple levels by residue index, by segment, by byte

/// a Co-Inductive Stream (CIS) of an "infinite" non-memoized series...
type CIS<'T> = CIS of 'T * (unit -> CIS<'T>) //' apostrophe formatting adjustment

/// lazy list (memoized) series of base prime page arrays...
type private BasePrime = uint32
type private BasePrimeRepArrs =
  BasePrimeRepArrs of BasePrimeRepArr * Option<Lazy<BasePrimeRepArrs>>

/// constants and Look Up Tables to do with culling start address calculation;
/// starting sieving at 11 makes this finding k-tuple primes friendly, as
/// all k-tuples will be at the same wheel index, just different moduloes...
let private cFRSTSVPRM = prime 11 // past the precull primes!
let private cWHLNDXCNST = primendx (cFRSTSVPRM * (cFRSTSVPRM - prime 1) / prime 2)
let private cWHLPRMS = [| prime 2; prime 3; prime 5; prime 7 |]
let private cWHLHITS = 48 // (3 - 1) * (5 - 1) * (7 - 1)!
let private cWHLODDCRC = 105 // 3 * 5 * 7; no evens!
let private cWHLPTRNLEN = 11 * 13 * 17 * 19 // repeating pattern of pre-cull primes
let private cNUMPCULLPRMS = 4
let private cPCULLPRMREPS: BasePrimeRepArrs =
  BasePrimeRepArrs( ([| bprep 0; bprep 1; bprep 2; bprep 3 |], 0), None)
/// number of primes to a million (2^20) minus number wheel prims;
/// enough FOR "bit plane" algorithm to sieve to over 4 * 10^14 (400 * 2^40)...
let private cNUMSTRTSPRMS = 82025 - cWHLPRMS.Length + 1 // +1 for end 0xFFFFFFFFu
let private cNUMSTRTSBASEPRMS = 6542 - cWHLPRMS.Length + 1 // enough for 65536^2
let private cWHLGAPS = [| // prime gaps for one 2/3/5/7 wheel from cFRSTSVPRM
  1; 2; 1; 2; 3; 1; 3; 2; 1; 2; 3; 3; 1; 3; 2; 1;
  3; 2; 3; 4; 2; 1; 2; 1; 2; 4; 3; 2; 3; 1; 2; 3;
  1; 3; 3; 2; 1; 2; 3; 1; 3; 2; 1; 2; 1; 5; 1; 5 |]
let private cWHLRSDUS = [|
  11; 13; 17; 19; 23; 29; 31; 37; 41; 43; 47; 53;
  59; 61; 67; 71; 73; 79; 83; 89; 97; 101; 103; 107;
  109; 113; 121; 127; 131; 137; 139; 143; 149; 151; 157; 163;
  167; 169; 173; 179; 181; 187; 191; 193; 197; 199; 209; 211; 221 |]
let private cWHLNDXS = [| // LUT: whl position -> index; rounded down
  0; 1; 1; 2; 3; 3; 4; 4; 4; 5; 6; 6; 6; 7; 7;
  8; 9; 9; 10; 10; 10; 11; 11; 11; 12; 13; 13; 13; 14; 14;
  15; 16; 16; 16; 17; 17; 18; 18; 18; 19; 19; 19; 19; 20; 20;
  21; 22; 22; 23; 24; 24; 25; 25; 25; 25; 26; 26; 26; 27; 27;
  28; 28; 28; 29; 30; 30; 31; 31; 31; 32; 33; 33; 33; 34; 34;
  34; 35; 35; 36; 37; 37; 38; 38; 38; 39; 40; 40; 40; 41; 41;
  42; 43; 43; 44; 45; 45; 45; 45; 45; 46; 47; 47; 47; 47; 47; 48 |]
// LUT: wheel position -> wheel position; rounded up;
// used in start address calcs...
let private cWHLRNDUPS = [| // two rounds to avoid overflow
    0; 1; 3; 3; 4; 6; 6; 9; 9; 9; 10; 13; 13; 13; 15;
    15; 16; 18; 18; 21; 21; 21; 24; 24; 24; 25; 28; 28; 28; 30;
    30; 31; 34; 34; 34; 36; 36; 39; 39; 39; 43; 43; 43; 43; 45;
    45; 46; 48; 48; 49; 51; 51; 55; 55; 55; 55; 58; 58; 58; 60;
    60; 63; 63; 63; 64; 66; 66; 69; 69; 69; 70; 73; 73; 73; 76;
    76; 76; 78; 78; 79; 81; 81; 84; 84; 84; 85; 88; 88; 88; 90;
    90; 91; 93; 93; 94; 99; 99; 99; 99; 99; 100; 105; 105; 105; 105; 
    105; 106; 108; 108; 109; 111; 111; 114; 114; 114; 115; 118; 118; 118; 120;
    120; 121; 123; 123; 126; 126; 126; 129; 129; 129; 130; 133; 133; 133; 135;
    135; 136; 139; 139; 139; 141; 141; 144; 144; 144; 148; 148; 148; 148; 150;
    150; 151; 153; 153; 154; 156; 156; 160; 160; 160; 160; 163; 163; 163; 165;
    165; 168; 168; 168; 169; 171; 171; 174; 174; 174; 175; 178; 178; 178; 181;
    181; 181; 183; 183; 184; 186; 186; 189; 189; 189; 190; 193; 193; 193; 195;
    195; 196; 198; 198; 199; 204; 204; 204; 204; 204; 205; 210; 210; 210; 210; 210 |]
/// LUT of relative cull start points given the residual bit plane index (outer index),
/// and the combination of the base prime residual index and the bit plane index of
/// the first cull position for the page (multiply combined for the inner index), giving
/// a 16-bit value which contains the multipier (the upper byte) and the extra
/// cull index offset (the lower byte) used to multiply by the base prime wheel index
/// and add the offset with the result added with the start wheel index to obtain
/// the residual bit plane start wheel index...
/// for "PG11", these arrays get huge (quarter meg elements with elements of 4 bytes for
/// a megabyte size), which will partially (or entirely) cancell out the benefit for
/// smaller prime ranges; may help for the huge prime ranges...
let private cWHLSTRTS: uint16[][] =
  let arr = Array.init cWHLHITS <| fun _ -> Array.zeroCreate (cWHLHITS * cWHLHITS)
  for pi = 0 to cWHLHITS - 1 do
    let mltsarr = Array.zeroCreate cWHLHITS
    let p = cWHLRSDUS.[pi] in let s = (p * p - int cFRSTSVPRM) >>> 1 
    // build array of relative mults and offsets to `s`...
    { 0 .. cWHLHITS - 1 } |> Seq.iter (fun ci ->
      let rmlt0 = (cWHLRSDUS.[(pi + ci) % cWHLHITS] - cWHLRSDUS.[pi]) >>> 1
      let rmlt = rmlt0 + if rmlt0 < 0 then cWHLODDCRC else 0 in let sn = s + p * rmlt
      let snd = sn / cWHLODDCRC in let snm = sn - snd * cWHLODDCRC
      mltsarr.[cWHLNDXS.[snm]] <- rmlt) // new rmlts 0..209!
    let ondx = pi * cWHLHITS
    { 0 .. cWHLHITS - 1 } |> Seq.iter (fun si ->
      let s0 = (cWHLRSDUS.[si] - int cFRSTSVPRM) >>> 1 in let sm0 = mltsarr.[si]
      { 0 .. cWHLHITS - 1 } |> Seq.iter (fun ci ->
        let smr = mltsarr.[ci]
        let rmlt = if smr < sm0 then smr + cWHLODDCRC - sm0 else smr - sm0
        let sn = s0 + p * rmlt in let rofs = sn / cWHLODDCRC
        // we take the multiplier times 2 so it multiplies by the odd wheel index...
        arr.[ci].[ondx + si] <- (rmlt <<< 9) ||| rofs |> uint16))
  arr
/// a Wheel State Look Up Table (LUT) for culling of large base primes,
/// whose smallest span exceeds the range of the largest used bit plane which is
/// about 20 million (20 * 2^20) bits.  This use of a LUT is much slower than the
/// highly optimized culling over single bit planes used for small primes,
/// but only a small fraction of the actual culling operations actually need to
/// use this, so the avaerage culling speed doesn't increase that much even for
/// extremely large sieving upper limits.  In fact, the time to run the LUT-based
/// culling loop isn't much different than the actual memory access speed since
/// the memory span jumps work well outside the CPU cache ranges, and thus
/// much of the execution time is masked by the memory access wait times.
/// The LUT is indexed by a combination of the wheel apan modulo of the
/// current base prime and the wheel span modulo of the current cull possition;
/// the contents are the combined base prime multiple (mlt) to muultiply by
/// the bit plane index of the current base prime to be added to
/// the extra (xtra) carry bit plane index resulting from the modulo multiply
/// operation to offset to the next cull position index and the modulo residual
/// index of the next bit plane to cull for the current base prime (nxt).
let private cWHLSTATES: uint16[] =
  let arr = Array.zeroCreate <| cWHLHITS * cWHLHITS
  let rec loopx x =
    if x < cWHLHITS then
      let ondx = x * cWHLHITS // below get bp modulo as we actually use!
      let bp = cWHLRSDUS.[x] // in let bpm = bp % cWHLODDCRC
      let rec loopy y pos =
        if y < cWHLHITS then
          let mlt = cWHLGAPS.[(x + y) % cWHLHITS]
          let opos = pos % cWHLODDCRC in let npos = opos + mlt * bp
          let xtra = npos / cWHLODDCRC
          let nxt = cWHLNDXS.[npos - xtra * cWHLODDCRC]
          arr.[ondx + cWHLNDXS.[opos]] <- // mlt: 3/xtr: 4/nxt: 6 bits
            (uint16 mlt <<< 11) + (uint16 xtra <<< 6) + uint16 nxt
            // mlt is shifted an extra above to result in a times two!
          loopy (y + 1) npos in loopy 0 ((bp * bp - int cFRSTSVPRM) >>> 1)
      loopx (x + 1) in loopx 0
  arr

let private makeSieveBuffer btsz: SieveBuffer =
  let sz = ((btsz + 31) >>> 5) <<< 2 // rounded up to nearest 32 bit boundary
  { 1 .. cWHLHITS } |> Seq.map (fun _ -> Array.zeroCreate sz) |> Array.ofSeq

// a dedicated cBITMSK LUT may be faster than bit twiddling...
let private cBITMSK = [| 1uy; 2uy; 4uy; 8uy; 16uy; 32uy; 64uy; 128uy |]

/// all the heavy lifting work is done here culling composite bits...
let private cullSieveBuffer (lwi: PrimeNdx) (bpras: BasePrimeRepArrs)
                            (strtsa: uint32[]) (sb: SieveBuffer) =
  let sz = sb.[0].Length in let szbits = sz <<< 3 in let bplmt = sz >>> 2
  let lowndx = lwi * primendx cWHLODDCRC
  let dlta = primendx szbits * primendx cWHLODDCRC
  let nxti = if dlta > primendx 0x7FFFFFFFFFFFFFFFUL - lowndx then
               primendx 0x7FFFFFFFFFFFFFFFUL else lowndx + dlta
  // set up strtsa for use by each modulo residue bit plane...
  let rec loopbpa (BasePrimeRepArrs((bpra, lstg), bprastl)) oobpd osi =
    let rec looppi pi obpd isi =
      if pi >= bpra.Length then isi, obpd + lstg else
        let ndxdprm = int bpra.[pi]
        let bpd = obpd + (ndxdprm >>> 6) in let bpn = ndxdprm &&& 0x3F
        let rsdi = (cWHLRSDUS.[bpn] - int cFRSTSVPRM) >>> 1
        let i = primendx bpd * primendx cWHLODDCRC + primendx rsdi 
        let bp = primendx 2 * i + primendx cFRSTSVPRM // over uint32 range!
        if bp > primendx 0xFFFFFFFFu then strtsa.[isi] <- 0xFFFFFFFFu; 0, 0 else
        let s = (i + i) * (i + primendx cFRSTSVPRM) + cWHLNDXCNST // not too big
        if s >= nxti then strtsa.[isi] <- 0xFFFFFFFFu; 0, 0 else // enough base primes!
          let strti = if s >= lowndx then int (s - lowndx) else
                        let wp = primendx rsdi
                        let r = (lowndx - s) % (bp * primendx cWHLODDCRC)
                        if r = primendx 0 then 0 else
                        let wrus =
                          cWHLRNDUPS.[wp + (r + bp - primendx 1) / bp |> int]
                        bp * (primendx wrus - wp) - r |> int
          let sd = strti / cWHLODDCRC in let sn = cWHLNDXS.[strti - sd * cWHLODDCRC]
          // use LUT culling method and "bucket sieving" for large base primes...
          // this avoids much cache thrashing with a cost of some code execution time, at
          // a net advantage of about three times in execution time...
          if isi >= strtsa.Length - 1 then
            let bpdadj = bp / primendx cWHLODDCRC |> int
            let ondx = bpn * cWHLHITS
            let rec cull cwi ndx =
              if cwi < szbits then
                let w = cwi >>> 3
                sb.[ndx].[w] <- sb.[ndx].[w] ||| cBITMSK.[cwi &&& 7]
                let ws = int cWHLSTATES.[ondx + ndx] in let mlt = ws >>> 10
                let xtr = (ws >>> 6) &&& 0x0F in let nxtndx = ws &&& 0x3F
                let ncwi = cwi + mlt * bpd + xtr
                cull ncwi nxtndx in cull sd sn; looppi (pi + 1) bpd isi
          else
            strtsa.[isi] <- (uint32 sn <<< 26) ||| uint32 sd
            looppi (pi + 1) bpd (isi + 1)
    let nxtsi, lstbpd = looppi 0 oobpd osi
    if nxtsi <> 0 then match bprastl with
                         | None -> ()
                         | Some lv -> loopbpa lv.Value lstbpd nxtsi
  loopbpa bpras 0 0
  // do the culling based on the preparation...
  let rec loopri ri =
    if ri < cWHLHITS then
      let pln = sb.[ri] in let plnstrts = cWHLSTRTS.[ri]
      let rec loopbpa (BasePrimeRepArrs((bpra, lstg), bprastl)) oobpd osi =
        let rec looppi pi obpd isi =
          if pi >= bpra.Length then isi, obpd + lstg else
            let prmstrt = strtsa.[isi]
            if prmstrt >= 0xFFFFFFFFu then 0, 0 else
              let ndxdprm = int bpra.[pi]
              let bpd = obpd + (ndxdprm >>> 6) in let rsdi = ndxdprm &&& 0x3F
              let bp = bpd * (cWHLODDCRC <<< 1) + cWHLRSDUS.[rsdi]
              let sd = int prmstrt &&& 0x3FFFFFF in let sn = int (prmstrt >>> 26)
              let adji = rsdi * cWHLHITS + sn in let adj = plnstrts.[adji]
              let s0 = sd + int (adj >>> 8) * bpd + (int adj &&& 0xFF)
              if bp < bplmt then
                let slmt = min szbits (s0 + (bp <<< 3))
                let rec loops s8 =
                  if s8 < slmt then
                    let msk = cBITMSK.[s8 &&& 7]
                    let rec loopc c =
                      if c < pln.Length then pln.[c] <- pln.[c] ||| msk; loopc (c + bp)
                    loopc (s8 >>> 3); loops (s8 + bp) in loops s0
              else
                let rec loopsi si =
                  if si < szbits then
                    let w = si >>> 3 in pln.[w] <- pln.[w] ||| cBITMSK.[si &&& 7]
                    loopsi (si + bp) in loopsi s0
              looppi (pi + 1) bpd (isi + 1)
        let nxtsi, lstbpd = looppi 0 oobpd osi
        if nxtsi <> 0 then match bprastl with
                             | None -> () // only for precull when not LUT
                             | Some lv -> loopbpa lv.Value lstbpd nxtsi
      loopbpa bpras 0 0; loopri (ri + 1) in loopri 0

/// pre-culled wheel pattern with a 131072 extra size to avoid overflow...
/// (copy by 16 Kilobytes per time!)
let private cWHLPTRN: SieveBuffer = // rounded up to next 32-bit alignmenet!
  let sb = makeSieveBuffer ((cWHLPTRNLEN <<< 3) + 131072 + 31)
  let strtsa = Array.zeroCreate <| cNUMPCULLPRMS + 1
  cullSieveBuffer (primendx 0) cPCULLPRMREPS strtsa sb; sb

/// fill the SieveBuffer from the cWHLPTRN according to the modulo of the low wheel index;
/// lwi must be an even number of bytes to work!!!...
let private fillSieveBuffer (lwi: PrimeNdx) (sb: SieveBuffer) =
  let len = sb.[0].Length in let cpysz = min len 16384 in let mdlo0 = lwi / (primendx 8)
  { 0 .. cWHLHITS - 1 } |> Seq.iter (fun i ->
    { 0 .. 16384 .. len - 1 } |> Seq.iter (fun j ->
      let mdlo = (mdlo0 + primendx j) % (primendx cWHLPTRNLEN) |> int
      Array.blit cWHLPTRN.[i] mdlo sb.[i] j cpysz))

/// fast value set bit count Look Up Table (cCLUT) for 16-bit input...
let private cCLUT: uint8[] =
  let arr = Array.zeroCreate 65536
  let rec cntem i cnt = if i <= 0 then cnt else cntem (i &&& (i - 1)) (cnt + 1)
  for i = 0 to 65535 do arr.[i] <- cntem i 0 |> uint8
  arr

/// count the zero (prime) bits in the SieveBuffer up to the "lsti" odd index...
let private countSieveBuffer (bitlmt: int) (sb: SieveBuffer): int =
  let lstwi = bitlmt / cWHLODDCRC
  let lstri = cWHLNDXS.[bitlmt - lstwi * cWHLODDCRC]
  let lst = (lstwi >>> 5) <<< 2 in let lstm = lstwi &&& 31
  let rec loopr ri cr =
    if ri >= cWHLHITS then cr else
    let pln = sb.[ri]
    let rec cntem i cnt =
      if i >= lst then
        let msk = (0xFFFFFFFFu <<< lstm) <<< if ri <= lstri then 1 else 0
        let v = (uint32 pln.[lst] + (uint32 pln.[lst + 1] <<< 8) +
                  (uint32 pln.[lst + 2] <<< 16) + (uint32 pln.[lst + 3] <<< 24)) ||| msk
        cnt - int cCLUT.[int v &&& 0xFFFF] - int cCLUT.[int (v >>> 16)] else
      let v = uint32 pln.[i] + (uint32 pln.[i + 1] <<< 8) +
              (uint32 pln.[i + 2] <<< 16) + (uint32 pln.[i + 3] <<< 24)
      cntem (i + 4) (cnt - int cCLUT.[int v &&& 0xFFFF] - int cCLUT.[int (v >>> 16)])
    let cnti = cntem 0 cr in loopr (ri + 1) cnti
  loopr 0 ((lst * 8 + 32) * cWHLHITS)

/// a CIS series of pages from the given start index with the given SieveBuffer size,
/// and provided with a polymorphic converter function to produce
/// and type of result from the culled page parameters...
let rec private makePrimePages strtwi btsz strtsasz thrds
                                (cnvrtrf: PrimeNdx -> SieveBuffer -> 'T): CIS<'T> =
  let bpas = makeBasePrimes thrds

  #if !FABLE
  if thrds then
    // it's ridiculously easy to make this multi-threaded with the following change...
    let cNUMPROCS = System.Environment.ProcessorCount
    let tsks = Array.zeroCreate cNUMPROCS
    let sbs = Array.init cNUMPROCS (fun _ -> makeSieveBuffer btsz)
    let strtsas = Array.init cNUMPROCS (fun _ -> Array.zeroCreate strtsasz)
    let mktsk lwi i = System.Threading.Tasks.Task.Run(fun() ->
      fillSieveBuffer lwi sbs.[i]; cullSieveBuffer lwi bpas strtsas.[i] sbs.[i]
      cnvrtrf lwi sbs.[i])
    let rec jobfeed lwi i =
      CIS(lwi, fun() -> let ni = i + 1
                        jobfeed (lwi + primendx btsz)
                                (if ni >= cNUMPROCS then 0 else ni))
    let rec strttsks (CIS(lwi, jbfdtlf) as jbfd) i =
      if i >= cNUMPROCS then jbfd else
      tsks.[i] <- mktsk lwi i; strttsks (jbfdtlf()) (i + 1)
    let rec mkpgrslt (CIS(lwi, jbfdtlf)) i =
      let rslt = tsks.[i].Result in tsks.[i] <- mktsk lwi i
      CIS(rslt,
          fun() -> mkpgrslt (jbfdtlf())
                            (if i >= cNUMPROCS - 1 then 0 else i + 1))
    mkpgrslt <| strttsks (jobfeed strtwi 0) 0 <| 0
  else
    // the below is the even easier single threaded method...
    let strtsa = Array.zeroCreate strtsasz in let sb = makeSieveBuffer btsz
    let rec nxtpg lwi =
      fillSieveBuffer lwi sb; cullSieveBuffer lwi bpas strtsa sb
      CIS(cnvrtrf lwi sb, fun() -> nxtpg (lwi + primendx btsz)) in nxtpg strtwi
  #else
  // the below is the even easier single threaded method...
  let strtsa = Array.zeroCreate strtsasz in let sb = makeSieveBuffer btsz
  let rec nxtpg lwi =
    fillSieveBuffer lwi sb; cullSieveBuffer lwi bpas strtsa sb
    CIS(cnvrtrf lwi sb, fun() -> nxtpg (lwi + primendx btsz)) in nxtpg strtwi
  #endif

/// secondary feed of lazy list of memoized pages of base primes...
and private makeBasePrimes thrds: BasePrimeRepArrs =
  let sb2bpa lwi (sb: SieveBuffer) =
    let btsz = sb.[0].Length <<< 3
    let arr =
      Array.zeroCreate <| countSieveBuffer ((btsz * cWHLODDCRC) - 1) sb
    let rec loop ri i oi j =
      if i >= btsz then btsz - oi else
        if ri < cWHLHITS then
          if sb.[ri].[i >>> 3] &&& cBITMSK.[i &&& 7] <> 0uy then
            loop (ri + 1) i oi j
          else arr.[j] <- (bprep (i - oi) <<< 6) ||| bprep ri
               loop (ri + 1) i i (j + 1)
        else loop 0 (i + 1) oi j in arr, loop 0 0 0 0
  // finding the first page as not part of the loop and making succeeding
  // pages lazy breaks the recursive data race!
  let fksb = makeSieveBuffer 64 in fillSieveBuffer (primendx 0) fksb
  let (fkbpra, lstdlta) as fk = sb2bpa (primendx 0) fksb
  let fkbpas = BasePrimeRepArrs(fk, None)
  let strtsa = Array.zeroCreate (fkbpra.Length + 1)
  let frstsb = makeSieveBuffer 512 in fillSieveBuffer (primendx 0) frstsb
  cullSieveBuffer (primendx 0) fkbpas strtsa frstsb
  let rec nxtbpas (CIS(bpa, tlf)) =
    BasePrimeRepArrs(bpa, Some(lazy (nxtbpas (tlf()))))
  let restbpras =
    Some(lazy (nxtbpas <| makePrimePages (primendx 512) 65536
                                         cNUMSTRTSBASEPRMS thrds sb2bpa))
  let frstbpa = sb2bpa (primendx 0) frstsb
  BasePrimeRepArrs(frstbpa, restbpras)                    

/// produces a generator of primes; uses mutability for better speed...
let primePagesFrom (lwr: Prime) (thrds: bool): CIS<PrimeRepsPg> =
  let btmndx = if lwr <= cFRSTSVPRM then primendx 0 else // round up if odd
               (lwr + prime 1 - cFRSTSVPRM) / prime 2 |> primendx
  let lowd = btmndx / primendx cWHLODDCRC
  let lown = cWHLNDXS.[cWHLRNDUPS.[btmndx - lowd * primendx cWHLODDCRC |> int]]
  let lowndx = if lown >= cWHLHITS then 0 else lown // round up; adj 4 ovrflw
  let lowi = lowd / primendx 8 * primendx 8 // to even byte
  let cntlmti = int (lowd - lowi) + if lown >= cWHLHITS then 1 else 0
  let sb2prmsarr lwi (sb: SieveBuffer) =
    let btsz = sb.[0].Length <<< 3
    let inline notprm ri i = sb.[ri].[i >>> 3] &&& cBITMSK.[i &&& 7] <> 0uy
    let adjcnt = if lwi <> lowi then 0 else
                 let rec loop i m acnt =
                   if i >= cntlmti && m >= lowndx then acnt else
                   if m >= cWHLHITS then loop (i + 1) 0 acnt else
                   let dlta = if notprm m i then 0 else 1
                   loop i (m + 1) (acnt - dlta) in loop 0 0 0
    let arr = Array.zeroCreate
                <| adjcnt + countSieveBuffer (btsz * cWHLODDCRC - 1) sb
    let basei = prime lwi * prime 2 * prime cWHLODDCRC
    let basev = basei + cFRSTSVPRM
    let rec loop ri i op pi =
      if ri >= cWHLHITS then loop 0 (i + 1) op pi else
      if i < btsz then
        if notprm ri i then loop (ri + 1) i op pi
        else let p = prime i * prime 2 * prime cWHLODDCRC +
                        prime cWHLRSDUS.[ri] + basei
             arr.[pi] <- p - op |> uint16
             loop (ri + 1) i p (pi + 1)
    (if lwi = lowi then loop lowndx cntlmti basev 0
     else loop 0 0 basev 0); basev, arr
  let dfrd =
    fun() -> makePrimePages lowi cPGSZBTS cNUMSTRTSPRMS thrds sb2prmsarr
  if lwr >= cFRSTSVPRM then dfrd()
  else
    let frstpg = let fltrd = cWHLPRMS |> Seq.skipWhile ((>) lwr)
                 Seq.zip fltrd ((Seq.append (Seq.singleton (prime 0))) fltrd)
                   |> Seq.map (fun (v1, v2) -> uint16 (v1 - v2)) |> Seq.toArray
    CIS((prime 0, frstpg), dfrd)

/// extremely slow convenience function to convert above
/// CIS of primes pages to a sequence of Prime's...
let primesFrom (strt: Prime): seq<Prime> =
  Seq.unfold (fun (lstv, CIS((frstv, prmdlts): PrimeRepsPg, prmdltstlf)) ->
      if frstv < lstv then None else
      Some(Seq.unfold (fun (l, i) ->
        if i >= prmdlts.Length then None else
        let np = l + prime prmdlts.[i]
        if np < l then None else
        Some(np, (np, i + 1))) (frstv, 0), (frstv, prmdltstlf())))
    (prime 0, primePagesFrom (prime strt) false) |> Seq.concat

/// the slow way of enumerating and counting but much faster due to
/// enumerating by prime pages and then just doing array processing...
let slowCountPrimesFromTo lolmt hilmt thrdd prgrs =
  let (CIS((frstv, fprma): PrimeRepsPg, fprmpgstlf)) = primePagesFrom lolmt thrdd
  let ttl = float (hilmt - lolmt) / float 100
  let rec tkwhl oprm i (prma: _[]) prmpgstlf tm cnt =
    if i >= prma.Length then let ctm = System.DateTime.Now.Ticks
                             let ntm = if ctm - tm >= 10000000L then tm else
                                       let pc = float (oprm - lolmt) / ttl
                                       prgrs pc; ctm         
                             let (CIS((nfrstv, nprma), nprmpgstlf) )= prmpgstlf()
                             tkwhl nfrstv 0 nprma nprmpgstlf ntm cnt else
    let nprm = oprm + prime prma.[i]
    if nprm > hilmt || nprm < oprm then cnt
    else tkwhl nprm (i + 1) prma prmpgstlf tm (cnt + primendx 1)
  tkwhl frstv 0 fprma fprmpgstlf System.DateTime.Now.Ticks <| primendx 0

/// fast counter of number of primes over a range, inclusive...
let countPrimesFromTo (strt: Prime) (stop: Prime) (thrds: bool)
                      (prgrs: float -> unit): PrimeNdx =
  let precnt = cWHLPRMS |> Array.fold (fun c p ->
                 if p >= strt && p <= stop then c + 1 else c) 0 |> primendx
  if stop < cFRSTSVPRM then precnt else
  let btmndx = if strt <= cFRSTSVPRM then primendx 0 else // round up if odd
               (strt + prime 1 - cFRSTSVPRM) / prime 2 |> primendx
  let lowd = btmndx / primendx cWHLODDCRC
  let lown = cWHLNDXS.[cWHLRNDUPS.[btmndx - lowd * primendx cWHLODDCRC |> int]]
  let lowndx = if lown >= cWHLHITS then 0 else lown // round up; adj 4 ovrflw
  let lowi = lowd / primendx 8 * primendx 8 // to even byte
  let cntlmti = int (lowd - lowi) + if lown >= cWHLHITS then 1 else 0
  let topndx = (stop - cFRSTSVPRM) / prime 2 |> primendx
  let lmtlwi = topndx / primendx cWHLODDCRC
  let jbszf = float (lmtlwi - lowi) / float 100
  let sb2cnt lwi (sb: SieveBuffer) =
    let btsz = sb.[0].Length <<< 3 in let lmti = lwi + primendx (btsz - 1)
    let adjcnt = if lwi <> lowi then 0 else
                 let rec loop i m acnt =
                   if i >= cntlmti && m >= lowndx then acnt else
                   if m >= cWHLHITS then loop (i + 1) 0 acnt else
                   let dlta = if sb.[m].[0] &&& cBITMSK.[i &&& 7] = 0uy
                                then 1 else 0
                   loop i (m + 1) (acnt - dlta) in loop 0 0 0
    adjcnt + countSieveBuffer
      (if lmti < lmtlwi then btsz * cWHLODDCRC - 1
        else int (topndx - lwi * primendx cWHLODDCRC)) sb |> primendx, lmti
  let rec loop (CIS((cnt, nxti), tlf)) count tm =    
    if nxti < lmtlwi then
      let ntm = System.DateTime.Now.Ticks
      if ntm - tm < 10000000L then loop (tlf()) (count + cnt) tm
      else prgrs (float (nxti - lowi) / jbszf); loop (tlf()) (count + cnt) ntm
    else count + cnt
  loop <| makePrimePages lowi cPGSZBTS cNUMSTRTSPRMS thrds sb2cnt
       <| precnt <| System.DateTime.Now.Ticks

printfn "Took %d milliseconds to setup." <| (System.DateTime.Now.Ticks -  setupstrt) / 10000L

/// testing it...
let lowlimit = prime 0e17
let highlimit = lowlimit + prime 1e11
let threaded = true
let progress = true

printfn "Wheel-Factorized Page-Segmented Bit-Packed Sieve of Eratosthenes:"
printfn "Written in F# for DotNet.\n"

#if !FABLE
if threaded then
 printfn "Running Multi-threaded with %d threads."
           System.Environment.ProcessorCount
else printfn "Running single threaded."
#else
printfn "Running single threaded."
#endif

// extremely slow way, but adequate for small ranges as here...
printf "The first 25 primes are:  "
primesFrom (prime 0) |> Seq.take 25 |> Seq.iter (printf "%d " << uint64)
printfn ""

printfn "Number of primes to a million:  %d"
  <| int64 (countPrimesFromTo (prime 0) (prime 1000000) threaded ignore)

let shwprgrs =
  if progress then fun pc -> printf "%7.3f%%\b\b\b\b\b\b\b\b" pc else ignore

printfn "Sieving from %d to %d:" (uint64 lowlimit) (uint64 highlimit)
let strt = System.DateTime.Now.Ticks

// somewhat slow way of counting by partial enumeration...
// let answr = slowCountPrimesFromTo lowlimit highlimit threaded shwprgrs                 

// the fast way by just doing a population count of bits...
let answr = countPrimesFromTo lowlimit highlimit threaded shwprgrs

let elpsd = (System.DateTime.Now.Ticks - strt) / 10000L
printfn "Found %d primes from %d to %d in %d milliseconds"
          answr lowlimit highlimit elpsd

On my AMD 7840HS (about 3.8 GHz running 8 cores/16 threads), it can count the number of primes to 1e11 in about 3.2 seconds, which is just twice as long as Kim Walisch's version 11.1 of the highly optimized C++ code primesieve program. It is slower for the following three reasons:

  1. For smaller sieving base primes, it can't have the extreme loop unrolling techniques used due to the limitations of Just-In-Time (JIT) compilation.
  2. For middle sized base primes, it lacks the optmizations of "bucket-sieving" by bit plane, which could be added but would make the program size too large to post here.
  3. For large sized base primes, it lacks further optimizations of "bucket-sieving", which could be added but would make the program size too large to post here.

The program does not currently compile for Fable, the F# to JavaScript compiler, but if it did it would likely be just a little slower than as run under DotNet 7.0 single-threaded.

Varicella answered 18/1, 2024 at 5:27 Comment(0)
B
0

2 * 10^6 in 1 sec on Corei5

let n = 2 * (pown 10 6)
let sieve = Array.append [|0;0|] [|2..n|]

let rec filterPrime p = 
    seq {for mul in (p*2)..p..n do 
            yield mul}
        |> Seq.iter (fun mul -> sieve.[mul] <- 0)

    let nextPrime = 
        seq { 
            for i in p+1..n do 
                if sieve.[i] <> 0 then 
                    yield sieve.[i]
        }
        |> Seq.tryHead

    match nextPrime with
        | None -> ()
        | Some np -> filterPrime np

filterPrime 2

let primes = sieve |> Seq.filter (fun x -> x <> 0)
Brannan answered 7/11, 2015 at 20:59 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.