The last answer by Aaron Mugatroyd as from the translated Python source for a Sieve of Atkin (SoA) isn't too bad, but it can be improved in several respects as it misses some important optimizations, as follows:
His answer doesn't use the full modulo 60 original Atkin and Bernstein version of the Sieve but rather a slightly improved variation of the pseudo code from the Wikipedia article so uses about a factor of 0.36 of the numerical sieve range combined toggle/cull operations; my code below uses the reasonably efficient non-page segment pseudo code as per my comments in an answer commenting on the Sieve of Atkin which uses a factor of about 0.26 times the numerical range to reduce the amount of work done to about a factor of about two sevenths.
His code reduces the buffer size by only having odd number representations, whereas my code further bit packs to eliminate any representation of the numbers divisible by three and five as well as those divisible by two implied by "odds-only"; this reduces the memory requirement by a further factor of almost half (to 8/15) and helps make better use of the CPU caches for a further increase in speed due to reduced average memory access time.
My code counts the number of primes using a fast Look Up Table (LUT) pop count technique to take almost no time to count as compared to the approximately one second using the bit-by-bit technique he uses; however, in this sample code even that small time is taken out of the timed portion of the code.
Finally, my code optimizes the bit manipulation operations for a minimum of code per inner loop. For instance, it does not use continual right shift by one to produce the odd representation index and in fact little bit shifting at all by writing all of the inner loops as constant modulo (equals bit position) operations. As well, Aaron's translated code is quite inefficient in operations as for instance in prime square free culling it adds the square of the prime to the index then checks for an odd result rather than just adding two times the square so as not to require the check; then it makes even the check redundant by shifting the number right by one (dividing by two) before doing the cull operation in the inner loop, just as it does for all the loops. This inefficient code won't make much of a difference in execution time for large ranges using this "large sieve buffer array" technique, as most of the time per operation is used in RAM memory access (about 37 CPU clock cycles or more for a range of one billion), but will make the execution time much slower than it needs to be for smaller ranges which fit into the CPU caches; in other words it sets a too high lowest limit in execution speed per operation.
The code is as follows:
//Sieve of Atkin based on full non page segmented modulo 60 implementation...
using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;
namespace NonPagedSoA {
//implements the non-paged Sieve of Atkin (full modulo 60 version)...
class SoA : IEnumerable<ulong> {
private ushort[] buf = null;
private long cnt = 0;
private long opcnt = 0;
private static byte[] modPRMS = { 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 49, 53, 59, 61 };
private static ushort[] modLUT;
private static byte[] cntLUT;
//initialize the private LUT's...
static SoA() {
modLUT = new ushort[60];
for (int i = 0, m = 0; i < modLUT.Length; ++i) {
if ((i & 1) != 0 || (i + 7) % 3 == 0 || (i + 7) % 5 == 0) modLUT[i] = 0;
else modLUT[i] = (ushort)(1 << (m++));
}
cntLUT = new byte[65536];
for (int i = 0; i < cntLUT.Length; ++i) {
var c = 0;
for (int j = i; j > 0; j >>= 1) c += j & 1;
cntLUT[i] = (byte)c;
}
}
//initialization and all the work producing the prime bit array done in the constructor...
public SoA(ulong range) {
this.opcnt = 0;
if (range < 7) {
if (range > 1) {
cnt = 1;
if (range > 2) this.cnt += (long)(range - 1) / 2;
}
this.buf = new ushort[0];
}
else {
this.cnt = 3;
var nrng = range - 7; var lmtw = nrng / 60;
//initialize sufficient wheels to non-prime
this.buf = new ushort[lmtw + 1];
//Put in candidate primes:
//for the 4 * x ^ 2 + y ^ 2 quadratic solution toggles - all x odd y...
ulong n = 6; // equivalent to 13 - 7 = 6...
for (uint x = 1, y = 3; n <= nrng; n += (x << 3) + 4, ++x, y = 1) {
var cb = n; if (x <= 1) n -= 8; //cancel the effect of skipping the first one...
for (uint i = 0; i < 15 && cb <= range; cb += (y << 2) + 4, y += 2, ++i) {
var cbd = cb / 60; var cm = modLUT[cb % 60];
if (cm != 0)
for (uint c = (uint)cbd, my = y + 15; c < buf.Length; c += my, my += 30) {
buf[c] ^= cm; // ++this.opcnt;
}
}
}
//for the 3 * x ^ 2 + y ^ 2 quadratic solution toggles - x odd y even...
n = 0; // equivalent to 7 - 7 = 0...
for (uint x = 1, y = 2; n <= nrng; n += ((x + x + x) << 2) + 12, x += 2, y = 2) {
var cb = n;
for (var i = 0; i < 15 && cb <= range; cb += (y << 2) + 4, y += 2, ++i) {
var cbd = cb / 60; var cm = modLUT[cb % 60];
if (cm != 0)
for (uint c = (uint)cbd, my = y + 15; c < buf.Length; c += my, my += 30) {
buf[c] ^= cm; // ++this.opcnt;
}
}
}
//for the 3 * x ^ 2 - y ^ 2 quadratic solution toggles all x and opposite y = x - 1...
n = 4; // equivalent to 11 - 7 = 4...
for (uint x = 2, y = x - 1; n <= nrng; n += (ulong)(x << 2) + 4, y = x, ++x) {
var cb = n; int i = 0;
for ( ; y > 1 && i < 15 && cb <= nrng; cb += (ulong)(y << 2) - 4, y -= 2, ++i) {
var cbd = cb / 60; var cm = modLUT[cb % 60];
if (cm != 0) {
uint c = (uint)cbd, my = y;
for ( ; my >= 30 && c < buf.Length; c += my - 15, my -= 30) {
buf[c] ^= cm; // ++this.opcnt;
}
if (my > 0 && c < buf.Length) { buf[c] ^= cm; /* ++this.opcnt; */ }
}
}
if (y == 1 && i < 15) {
var cbd = cb / 60; var cm = modLUT[cb % 60];
if ((cm & 0x4822) != 0 && cbd < (ulong)buf.Length) { buf[cbd] ^= cm; /* ++this.opcnt; */ }
}
}
//Eliminate squares of base primes, only for those on the wheel:
for (uint i = 0, w = 0, pd = 0, pn = 0, msk = 1; w < this.buf.Length ; ++i) {
uint p = pd + modPRMS[pn];
ulong sqr = (ulong)p * (ulong)p; //to handle ranges above UInt32.MaxValue
if (sqr > range) break;
if ((this.buf[w] & msk) != 0) { //found base prime, square free it...
ulong s = sqr - 7;
for (int j = 0; s <= nrng && j < modPRMS.Length; s = sqr * modPRMS[j] - 7, ++j) {
var cd = s / 60; var cm = (ushort)(modLUT[s % 60] ^ 0xFFFF);
//may need ulong loop index for ranges larger than two billion
//but buf length only good to about 2^31 * 60 = 120 million anyway,
//even with large array setting and half that with 32-bit...
for (ulong c = cd; c < (ulong)this.buf.Length; c += sqr) {
this.buf[c] &= cm; // ++this.opcnt;
}
}
}
if (msk >= 0x8000) { msk = 1; pn = 0; ++w; pd += 60; }
else { msk <<= 1; ++pn; }
}
//clear any overflow primes in the excess space in the last wheel/word:
var ndx = nrng % 60; //clear any primes beyond the range
for (; modLUT[ndx] == 0; --ndx) ;
this.buf[lmtw] &= (ushort)((modLUT[ndx] << 1) - 1);
}
}
//uses a fast pop count Look Up Table to return the total number of primes...
public long Count {
get {
long cnt = this.cnt;
for (int i = 0; i < this.buf.Length; ++i) cnt += cntLUT[this.buf[i]];
return cnt;
}
}
//returns the number of toggle/cull operations used to sieve the prime bit array...
public long Ops {
get {
return this.opcnt;
}
}
//generate the enumeration of primes...
public IEnumerator<ulong> GetEnumerator() {
yield return 2; yield return 3; yield return 5;
ulong pd = 0;
for (uint i = 0, w = 0, pn = 0, msk = 1; w < this.buf.Length; ++i) {
if ((this.buf[w] & msk) != 0) //found a prime bit...
yield return pd + modPRMS[pn]; //add it to the list
if (msk >= 0x8000) { msk = 1; pn = 0; ++w; pd += 60; }
else { msk <<= 1; ++pn; }
}
}
//required for the above enumeration...
IEnumerator IEnumerable.GetEnumerator() {
return this.GetEnumerator();
}
}
class Program {
static void Main(string[] args) {
Console.WriteLine("This program calculates primes by a simple full version of the Sieve of Atkin.\r\n");
const ulong n = 1000000000;
var elpsd = -DateTime.Now.Ticks;
var gen = new SoA(n);
elpsd += DateTime.Now.Ticks;
Console.WriteLine("{0} primes found to {1} using {2} operations in {3} milliseconds.", gen.Count, n, gen.Ops, elpsd / 10000);
//Output prime list for testing...
//Console.WriteLine();
//foreach (var p in gen) {
// Console.Write(p + " ");
//}
//Console.WriteLine();
//Test options showing what one can do with the enumeration, although more slowly...
// Console.WriteLine("\r\nThere are {0} primes with the last one {1} and the sum {2}.",gen.Count(),gen.Last(),gen.Sum(x => (long)x));
Console.Write("\r\nPress any key to exit:");
Console.ReadKey(true);
Console.WriteLine();
}
}
}
This code runs about twice as fast as Aaron's code (about 2.7 seconds using 64-bit or 32-bit mode on an i7-2700K (3.5 GHz) with the buffer about 16.5 Megabytes and about 0.258 billion combined toggle/prime square free cull operations (which can be shown by uncommenting the "++this.opcnt" statements) for a sieve range of one billion, as compared to 5.4/6.2 seconds (32-bit/64-bit) for his code without the count time and almost twice the memory use using about 0.359 billion combined toggle/cull operations for sieving up to one billion.
Although it is faster than his most optimized naive odds-only implementation of the non-paged Sieve of Eratosthenes (SoE), that does not make the Sieve of Atkin faster than the Sieve of Eratosthenes, as if one applies similar techniques as used in the above SoA implementation to the SoE plus uses maximal wheel factorization, the SoE will about the same speed as this.
Analysis: Although the number of operations for the fully optimized SoE are about the same as the number of operations for the SoA for a sieve range of one billion, the main bottleneck for these non-paged implementations is memory access once the sieve buffer size exceeds the CPU cache sizes (32 KiloBytes L1 cache at one clock cycle access, 256 Kilobytes L2 cache at about four clock cycles access time and 8 Megabytes L3 cache at about 20 clock cycles access time for my i7), after which memory access can exceed a hundred clock cycles.
Now both have a factor of about eight improvement in memory access speeds when one adapts the algorithms to page segmentation so one can sieve ranges that would not otherwise fit into available memory. However, the SoE continues to gain over the SoA as the sieve range starts to get very large due to difficulties in implementing the "primes square free" part of the algorithm due to the huge strides in culling scans that quickly grow to many hundreds of times the size of the page buffers. As well, and perhaps more serious, it gets very memory and/or computationally intensive to compute the new start point for each value of 'x' as to the value of 'y' at the lowest representation of each page buffer for a further quite large loss in efficiency of the paged SoA comparaed to the SoE as the range grows.
EDIT_ADD: The odds-only SoE as used by Aaron Murgatroyd uses about 1.026 billion cull operations for a sieve range of one billion so about four times as many operations as the SoA and thus should run about four times slower, but the SoA even as implemented here has a more complex inner loop and especially due to a much higher proportion of the odds-only SoE culls have a much shorter stride in the culling scans than the strides of the SoA the naive odds-only SoE has much better average memory access times in spite of the sieve buffer greatly exceeding the CPU cache sizes (better use of cache associativity). This explains why the above SoA is only about twice as fast as the odds-only SoE even though it would theoretically seem to be doing only one quarter of the work.
If one were to use a similar algorithm using constant modulo inner loops as for the above SoA and implemented the same 2/3/5 wheel factorization, the SoE would reduce the number of cull operations to about 0.405 billion operations so only about 50% more operations than the SoA and would theoretically run just slightly slower than the SoA, but may run at about the same speed due to the cull strides still being a little smaller than for the SoA on the average for this "naive" large memory buffer use. Increasing the wheel factorization to the 2/3/5/7 wheel means the SoE cull operations are reduced to about 0.314 for a cull range of one billion and may make that version of the SoE run about the same speed for this algorithm.
Further use of wheel factorization can be made by pre-culling the sieve array (copying in a pattern) for the 2/3/5/7/11/13/17/19 prime factors at almost no cost in execution time to reduce the total number of cull operations to about 0.251 billion for a sieve range of one billion and the SoE will run faster or about the same speed than even this optimized version of the SoA, even for these large memory buffer versions, with the SoE still having much less code complexity than the above.
Thus, it can be seen that the number of operations for the SoE can be greatly reduced from a naive or even odds-only or 2/3/5 wheel factorization version such that the number of operations are about the same as for the SoA while at the same time the time per operation may actually be less due to both less complex inner loops and more efficient memory access. END_EDIT_ADD
EDIT_ADD2: I here add the code for a SoE using a similiar constant modulo/bit position technique for the innermost loops as for the SoA above according to the pseudo code further down the answer as linked above. The code is quite a bit less complex than the above SoA in spite of having high wheel factorization and pre-culling applied such that the total number of cull operations are actually less than the combined toggle/cull operations for the SoA up to a sieving rang of about two billion. The code as follows:
EDIT_FINAL Corrected code below and comments related to it END_EDIT_FINAL
//Sieve of Eratosthenes based on maximum wheel factorization and pre-culling implementation...
using System;
using System.Collections;
using System.Collections.Generic;
namespace NonPagedSoE {
//implements the non-paged Sieve of Eratosthenes (full modulo 210 version with preculling)...
class SoE : IEnumerable<ulong> {
private ushort[] buf = null;
private long cnt = 0;
private long opcnt = 0;
private static byte[] basePRMS = { 2, 3, 5, 7, 11, 13, 17, 19 };
private static byte[] modPRMS = { 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, //positions + 23
97, 101, 103, 107, 109, 113, 121, 127, 131, 137, 139, 143, 149, 151, 157, 163,
167, 169, 173, 179, 181 ,187 ,191 ,193, 197, 199, 209, 211, 221, 223, 227, 229 };
private static byte[] gapsPRMS = { 6, 2, 6, 4, 2, 4, 6, 6, 2, 6, 4, 2, 6, 4, 6, 8,
4, 2, 4, 2, 4, 8, 6, 4, 6, 2, 4, 6, 2, 6, 6, 4,
2, 4, 6, 2, 6, 4, 2, 4, 2, 10, 2, 10, 2, 4, 2, 4 };
private static ulong[] modLUT;
private static byte[] cntLUT;
//initialize the private LUT's...
static SoE() {
modLUT = new ulong[210];
for (int i = 0, m = 0; i < modLUT.Length; ++i) {
if ((i & 1) != 0 || (i + 23) % 3 == 0 || (i + 23) % 5 == 0 || (i + 23) % 7 == 0) modLUT[i] = 0;
else modLUT[i] = 1UL << (m++);
}
cntLUT = new byte[65536];
for (int i = 0; i < cntLUT.Length; ++i) {
var c = 0;
for (int j = i ^ 0xFFFF; j > 0; j >>= 1) c += j & 1; //reverse logic; 0 is prime; 1 is composite
cntLUT[i] = (byte)c;
}
}
//initialization and all the work producing the prime bit array done in the constructor...
public SoE(ulong range) {
this.opcnt = 0;
if (range < 23) {
if (range > 1) {
for (int i = 0; i < modPRMS.Length; ++i) if (modPRMS[i] <= range) this.cnt++; else break;
}
this.buf = new ushort[0];
}
else {
this.cnt = 8;
var nrng = range - 23; var lmtw = nrng / 210; var lmtwt3 = lmtw * 3;
//initialize sufficient wheels to prime
this.buf = new ushort[lmtwt3 + 3]; //initial state of all zero's is all potential prime.
//initialize array to account for preculling the primes of 11, 13, 17, and 19;
//(2, 3, 5, and 7 already eliminated by the bit packing to residues).
for (int pn = modPRMS.Length - 4; pn < modPRMS.Length; ++pn) {
uint p = modPRMS[pn] - 210u; ulong pt3 = p * 3;
ulong s = p * p - 23;
ulong xrng = Math.Min(9699709, nrng); // only do for the repeating master pattern size
ulong nwrds = (ulong)Math.Min(138567, this.buf.Length);
for (int j = 0; s <= xrng && j < modPRMS.Length; s += p * gapsPRMS[(pn + j++) % 48]) {
var sm = modLUT[s % 210];
var si = (sm < (1UL << 16)) ? 0UL : ((sm < (1UL << 32)) ? 1UL : 2UL);
var cd = s / 210 * 3 + si; var cm = (ushort)(sm >> (int)(si << 4));
for (ulong c = cd; c < nwrds; c += pt3) { //tight culling loop for size of master pattern
this.buf[c] |= cm; // ++this.opcnt; //reverse logic; mark composites with ones.
}
}
}
//Now copy the master pattern so it repeats across the main buffer, allow for overflow...
for (long i = 138567; i < this.buf.Length; i += 138567)
if (i + 138567 <= this.buf.Length)
Array.Copy(this.buf, 0, this.buf, i, 138567);
else Array.Copy(this.buf, 0, this.buf, i, this.buf.Length - i);
//Eliminate all composites which are factors of base primes, only for those on the wheel:
for (uint i = 0, w = 0, wi = 0, pd = 0, pn = 0, msk = 1; w < this.buf.Length; ++i) {
uint p = pd + modPRMS[pn];
ulong sqr = (ulong)p * (ulong)p;
if (sqr > range) break;
if ((this.buf[w] & msk) == 0) { //found base prime, mark its composites...
ulong s = sqr - 23; ulong pt3 = p * 3;
for (int j = 0; s <= nrng && j < modPRMS.Length; s += p * gapsPRMS[(pn + j++) % 48]) {
var sm = modLUT[s % 210];
var si = (sm < (1UL << 16)) ? 0UL : ((sm < (1UL << 32)) ? 1UL : 2UL);
var cd = s / 210 * 3 + si; var cm = (ushort)(sm >> (int)(si << 4));
for (ulong c = cd; c < (ulong)this.buf.Length; c += pt3) { //tight culling loop
this.buf[c] |= cm; // ++this.opcnt; //reverse logic; mark composites with ones.
}
}
}
++pn;
if (msk >= 0x8000) { msk = 1; ++w; ++wi; if (wi == 3) { wi = 0; pn = 0; pd += 210; } }
else msk <<= 1;
}
//clear any overflow primes in the excess space in the last wheel/word:
var ndx = nrng % 210; //clear any primes beyond the range
for (; modLUT[ndx] == 0; --ndx) ;
var cmsk = (~(modLUT[ndx] - 1)) << 1; //force all bits above to be composite ones.
this.buf[lmtwt3++] |= (ushort)cmsk;
this.buf[lmtwt3++] |= (ushort)(cmsk >> 16);
this.buf[lmtwt3] |= (ushort)(cmsk >> 32);
}
}
//uses a fast pop count Look Up Table to return the total number of primes...
public long Count {
get {
long cnt = this.cnt;
for (int i = 0; i < this.buf.Length; ++i) cnt += cntLUT[this.buf[i]];
return cnt;
}
}
//returns the number of cull operations used to sieve the prime bit array...
public long Ops {
get {
return this.opcnt;
}
}
//generate the enumeration of primes...
public IEnumerator<ulong> GetEnumerator() {
yield return 2; yield return 3; yield return 5; yield return 7;
yield return 11; yield return 13; yield return 17; yield return 19;
ulong pd = 0;
for (uint i = 0, w = 0, wi = 0, pn = 0, msk = 1; w < this.buf.Length; ++i) {
if ((this.buf[w] & msk) == 0) //found a prime bit...
yield return pd + modPRMS[pn];
++pn;
if (msk >= 0x8000) { msk = 1; ++w; ++wi; if (wi == 3) { wi = 0; pn = 0; pd += 210; } }
else msk <<= 1;
}
}
//required for the above enumeration...
IEnumerator IEnumerable.GetEnumerator() {
return this.GetEnumerator();
}
}
class Program {
static void Main(string[] args) {
Console.WriteLine("This program calculates primes by a simple maximually wheel factorized version of the Sieve of Eratosthenes.\r\n");
const ulong n = 1000000000;
var elpsd = -DateTime.Now.Ticks;
var gen = new SoE(n);
elpsd += DateTime.Now.Ticks;
Console.WriteLine("{0} primes found to {1} using {2} operations in {3} milliseconds.", gen.Count, n, gen.Ops, elpsd / 10000);
// Console.WriteLine();
// foreach (var p in gen) {
// Console.Write(p + " ");
// }
// Console.WriteLine();
// Console.WriteLine("\r\nThere are {0} primes with the last one {1} and the sum {2}.",gen.Count(),gen.Last(),gen.Sum(x => (long)x));
Console.Write("\r\nPress any key to exit:");
Console.ReadKey(true);
Console.WriteLine();
}
}
}
This code actually runs a few percent faster than the above SoA as it should as there are slightly less operations and the main bottleneck for this large array size for a range of a billion is memory access time of something like 40 to over 100 CPU clock cycles depending on CPU and memory specifications; this means that code optimizations (other than reducing the total number of operations) are ineffective as most of the time is spend waiting on memory access. At any rate, using a huge memory buffer isn't the most efficient way to sieve large ranges, with a factor of up to about eight times improvement for the SoE using page segmentation with the same maximum wheel factorization (which also paves the way for multi-processing).
It is in implementing page segmentation and multi-processing that the SoA is really deficient for ranges much above four billion as compared to the SoE as any gains due to the reduced asymptotic complexity of the SoA rapidly get eaten up by page processing overhead factors related to the prime square free processing and calculating the much larger number of page start addresses; alternatively, one overcomes this by storing markers in RAM memory at a huge cost in memory consumption and further inefficiencies in accessing these marker store structures. END_EDIT_ADD2
In short, the SoA isn't really a practical sieve as compared to the the fully wheel factorized SoE since just as the gain in asymptotic complexity starts to bring it close in performance to the fully optimized SoE, it starts to lose efficiency due to the details of practical implementation as to relative memory access time and page segmentation complexities as well as generally being more complex and difficult to write. In my opinion it is more of an interesting intellectual concept and mental exercise than a practical sieve as compared to the SoE.
Some day I will adapt these techniques to a multi-threaded page segmented Sieve of Eratosthenes to be about as fast in C# as Atkin and Bernstein's "primegen" implementation of the SoA in 'C' and will blow it out of the water for large ranges above about four billion even single threaded, with an extra boost in speed of up to about four when multi-threading on my i7 (eight cores including Hyper Threading).