In my original answer to this question, in order to assist in better understanding of the Sieve of Atkin alogithm I extended the Wikipedia Sieve of Atkin (SoA) pseudo code to correct some of the deficiencies in the simplified "modulo 12" algorithm given, which many find confusing as compared to the full "modulo 60" algorithm. As well, the original Wikipedia pseudo code algorithm leads to inefficient implementations; specifically, the ranges of the 'x' and 'y' variables are each wider than necessary and the modulo 12 simplification results in in an additional 20% redundant work. However, as discussed in that answer, that pseudo code is still not efficient as the modulo tests are still in the innermost quadratic solution loops and thus almost half of those solution loops are unproductive for those solutions that do not pass the modulo tests; this means that a program implementing that pseudo code would likely be almost twice as slow as necessary even when there is no memory access speed bottleneck.
The algorithm as used by Atkin and Bernstein in their "primegen" program to avoid that loop redundancy can be better understood by a progression through the following observations based on the "hit" tables in my previous answer:
The symmetry of the "hit" tables is dependent on whether 'y' is odd (vertical symmetry) or even (horizontal symmetry).
Thus, if we wanted the advance of the patterns to proceed in the horizontal direction as shown there, we can flip the order of the 'x' and 'y' loops for the "3x+" and "3x-;odd x -> even y" cases meaning odd 'x' cases.
Now it's easy to eliminate the zero cases for the two flipped "3x" tables that only have five repetitions horizontally by a condition in an outer loop as those cases where "y mod 3 equals 0" plus the three extra cases only for the middle "axis" column where "y mod 5 equals 0".
For the last "3x-;even x -> odd y" case, it is easy to eliminate most of the zero's by just filtering those cases where "(y + 2) mod 3 equals 0" plus the extra two cases only for the last line where the "(x mod 60) equals 59" where "(y + 2) mod 5 equals 0" (duplicate eliminations for the symmetrical vertical axis column, which is always eliminated). Although these "y tests" would appear to have to happen in an inner loop, as there are only five "hits" per side of the vertical symmetry axis, these can be inline coded by the 'x' loop.
The most difficult eliminations are for the "4x" table where the pattern is more complex: there are five distinct horizontal patterns where each can be recognized because the "(x mod 60) for the first column" is distinct, with one of the leading zero row patterns in the first table above having a modulo of 5 and the other a modulo of 25; now the five different patterns can be determined by "inlining" the cases as offsets on either side of the vertical symmetrical axis for each pattern.
The spacing to the next row is exactly 4 × ((x + 1)² - x²) or 8x + 4 for the first table, (y + 2)² - y² or 4y + 4 for the second two (swapped) new tables, and 3 × ((x + 2)² - x²) or 12x + 12 for the last table.
For all (now) vertically symmetrical tables, the centre axis offset can be determined from the offset of the first column as either (y + 16)² - y² or 32y + 256 for the long rows and 3 × ((x + 4)² - x²) or 24x + 48 for the short rows.
In a similar way, the symmetrical offsets on either side of the vertical axis can easily be calculated from the offset of the vertical axis, for instance by the pair (y + 2)² - y² or 4 + 4x with (y - 2)² - y² or 4 - 4y where y is the centre axis offset for the long row cases and plus and minus two positions; the case for the short row 'x' cases is similar but just multiplied by three for the overall scale of the "3x" 'x' values.
The above filters determined by horizontal variable conditions would seem to break our objective of eliminating conditional code in inner loops, but that rule does not apply when determining the pattern is not the inner loop but rather starts a whole series of new inner loops across all of the patterns horizontally in the array for each valid element of the pattern. This works because we can show mathematically that each of the same equivalent position in the next pattern (which has the same modulo) are separated as follows: For horizontal pattern advances by 30 for the long row cases, the patterns are separated by (x + 30)² - x² or 60x + 900 so 60 × (x + 15); for the horizontal pattern advances by 10 for the short row cases, the patterns are separated by 3 × ((x + 10)² - x²) so 60 * (x + 5). Since both have a modulo 60 of zero, this is the exact explanation of why there are repeating patterns of the same modulo in every position. Thus, once we find the offset for each top left corner position, we can determine the modulo and offset for the starting locations for the non-zero pattern positions and just roll an inner loop using the pattern relationships from that point to the sieve array limit.
All of the above column and row increments can be done without using a "divide" operation using modulo addition where the algorithm keeps track of the pairs of quotients and modulo's of 60 with just an "overflow check and adjust" operation on the modulo/quotient pair after every operation, but the conditional code to do the "check and adjust" is likely more computationally expensive than the way as written, for which most modern optimizing compilers will generate a multiply operation using overflow truncation characteristics to replace the computationally expensive division operation for division by small integers, which usually takes less time than the combination of operations plus the conditional code required for the "check and adjust" method or using a direct divide operation.
The above sets of eliminations work very well for a packed bit representation of the candidate primes in that each of the 16 valid modulo values can be represented as one bit in a 16-bit word and the word index presents the the indexes of each modulo 60 wheel; thus we can advance by even increments evenly divisible by 60 by just advancing an integer number of 16-bit words.
For use as a non-page segmented algorithm as for the pseudo code here, it is sufficient to just move the modulo checks up out of the innermost loop without eliminating all of the "hit misses" for the resulting middle loop, as although they still are not maximally efficient in that the middle loop still processes "hit misses", that doesn't add more than about one percent to the run time; however, for paged implementations, "missed hits" even in the outer loops can add greatly to the computational overhead as these loops are run once per page over the range of 'x' values for that page so there is a factor of about the logarithm of the square root of the sieved range more of the outer loops for the SoA as compared to the Sieve of Eratosthenes (SoE). A version of more complex pseudo code that represents a (optionally bit packed) non-page-segmented version of the "hit optimized" Atkin and Bernstein method can then be written as follows:
// arbitrary search limit
limit ← 1000000000
// the set of base wheel prime candidates
s ∈ {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61}
// Initialize the sieve as an array of wheels with
// enough wheels to include the representation for limit
for n ← 60 * w + x where w ∈ {0,...(limit - 7) ÷ 60}, x in s:
sieve(n) ← false // start with no primes.
// Put in candidate primes:
// integers which have an odd number of
// representations by certain quadratic forms.
while n ≤ limit when n ← 4x²+y² // all x's and only odd y's
where x ∈ {1,...}, y ∈ {1,31,...}: // skip by pattern width in y
for cb ≤ limit when midy ← y + i, cb ← n - y² + midy²
where i ∈ {0,2,...28}: // middle level loop to "hit" test modulos
cbd ← (cb - 7) ÷ 60, cm ← cb mod 60
if cm in {1,13,17,29,37,41,49,53}:
while c ≤ limit when c ← 60 × (cbd - midy² + ((midy + 15 × j) × j)²) + cm
where j {0,...}:
sieve(c) ← ¬sieve(c) // toggles the affected bit
while n ≤ limit when n ← 3x²+y² // only odd x's and even y's
where x ∈ {1,3,...}, y ∈ {2,32,...}: // skip by pattern width in y
for cb ≤ limit when midy ← y + i, cb ← n - y² + midy²
where i ∈ {0,2,...28}: // middle level loop to "hit" test modulos
cbd ← (cb - 7) ÷ 60, cm ← cb mod 60
if cm in {7,19,31,43}:
while c ≤ limit when c ← 60 × (cbd - midy² + ((midy + 15 × j) × j)²) + cm
where j {0,...}:
sieve(c) ← ¬sieve(c) // toggles the affected bit
while n ≤ limit when n ← 3x²-y² //even/odd & odd/even combos
where x ∈ {2,3,...}, y ∈ {x-1,x-31,...,1}: // skip by pattern width in y
for midy ≥ 1 and cb ≤ limit when midy ← y - i, cb ← n + y² - midy²
where i ∈ {0,2,...28}: // middle level loop to "hit" test modulos
cbd ← (cb - 7) ÷ 60, cm ← cb mod 60
if cm in {11,23,47,59}:
while yn ≥ 1 and c ≤ limit when yn = midy - 30 * j and
c ← 60 × (cbd + midy² - ((midy - 15 × j) × j)²) + cm
where j {0,...}:
sieve(c) ← ¬sieve(c) // toggles the affected bit
// Eliminate prime squares by sieving, only for those occurrences on the wheel
for sqr ≤ limit where sqr ← n², n in {7,11,13,17,19,23,29,31, 37,41,43,47,49,53,59,61,...}:
if is_prime(n):
// n is prime, omit multiples of its square; this is
// sufficient because composites which managed to get
// on the list cannot be square-free
if c0 ≤ limit where c0 ← sqr × (m mod 60) where m in s:
c0d ← (c0 - 7) ÷ 60, cm ← (c0 - 7) mod 60 + 7 // means cm in s
while c ≤ limit for c ← 60 × (c0d + sqr × j) + cm
where j ∈ {0,...}:
sieve(c) ← false // cull composites of this prime square
output 2, 3, 5
for n ≤ limit when n ← 60 × k + x where k ∈ {0..}, x in s:
if sieve(n): output n
In particular, notice how the innermost quadratic loops are very simple but increment by a linearly increasing variable amount per loop, whereas the prime squares free elimination, which uses a type of Eratosthenes progression, increments by a fixed amount "sqr". These simple loops will execute quite quickly on a modern CPU, taking about 4 CPU clock cycles per loop if memory access is fast enough as for small sieve ranges or as in using a page segmented version where the pages fit within the L2/L1 CPU caches that have memory access times of about this range.
When the array is large as in many Megabytes, then the average memory access time is likely to average something between 20 and over a 100 CPU clock cycles (depending on the speed of RAM, efficiency of memory access by the CPU, and efficiency of use of intermediate caches) and this will be the main limiting factor in execution speed rather than the tightness of the loops other than the number of toggle/cull operations that the algorithm requires. These particular algorithms, both SoA and SoE, put a particular strain on the memory interface for large buffers in that they skip through the buffer incrementing by whole wheel circumferences times a multiplier at a time and repeat for a different offsets rather than handling all the "hits" of the wheel in sequence; this is due to moving the "hit" tests into a middle loop, which saves "hit" testing but has this larger buffer "jump" as a consequence, with the average "jump" for the SoA higher than for the SoE. Thus, an estimate of the execution time for this algorithm over the range of numbers to a billion is about 60 CPU clock cycles per operation (which will be less by a factor for the SoE due to a smaller average "jump") multiplied by the number of toggle/cull operations (about 260,000 for the Sieve of Atkin) divided by the CPU clock speed. Therefore, a 3.5 GHz CPU will execute this algorithm over a range to a billion in about four seconds and the main difference between execution speed for non-paged implementations of the SoA and the SoE in this style is the difference in memory access efficiency for large buffers, where a highly wheel factorized SoE is more efficient by about a factor of two.
Given that for large buffers the memory access time is the limiting speed factor, it becomes very important to find an algorithm with a minimum number of operations. The SoA as implemented by Atkin and Bernstein in their "primegen" sample program takes about a constant factor of "hit" operations as a ratio to the range sieved, and by implementing even a simple version of a program from my first answer we can determine that this factor is about 0.26 of the prime range sieved: meaning that there are about 260,000 operations for a sieve range of a billion.
For a SoE implemented with a high wheel factorization of a 210 element wheel with 48 prime candidate "hits" per rotation (small wheel for primes of 2;3;5;7) in combination with a pre-cull by the additional primes of 11;13;17;19, the number of cull operations for a sieve range n is approximately n times (ln((ln n) / 2) + M - 1/(ln n) - 1/2 - 1/3 - 1/5 - 1/7 - 1/11 - 1/13 - 1/17 - 1/19) * 48 / 210 where '*' means multiplied by, '/' means divided by, M is the Meissel-Mertens constant M ≈ 0.2614972128... correcting the number of operations arising from the (ln ln n) estimate, the reciprocal "ln n" term is the adjustment for the culling starting at the square of the base primes (less than the square root of the range n); for n of one billion, the factor compared to n is about 0.250485568 or about 250 million cull operations for a range to a billion.
This is about the same number of operations as for the SoA and contrasts with the simple 2;3;5 wheel used in the Atkin and Bernstein comparison test, which has a factor of 0.404805008 or about 400,000 operations for a sieve range of one billion. This better ratio for the more highly factorized SoE means that the constant factors for this quite realizable implementation are about the same as for the SoA except for any differences in constant factors due to loop complexity.
EDIT_ADD1: For comparison, the following is equivalent pseudo code for a highly factorized and pre-culled SoE written in the same style as the above SoA algorithm:
// arbitrary search limit
limit ← 1000000000
// the set of base wheel primes
r ∈ {23,29,31,37,41,43,47,53, 59,61,67,71,73,79,83, //positions + 19
89,97,101,103,107,109,113,121,127, 131,137,139,143,149,151,157,163,
167,169,173,179,181,187,191,193, 197,199,209,211,221,223,227,229}
// an array of length 11 times 13 times 17 times 19 = 46189 wheels initialized
// so that it doesn't contain multiples of the large wheel primes
for n where n ← 210 × w + x where w ∈ {0,...46189}, x in r: // already
if (n mod cp) not equal to 0 where cp ∈ {11,13,17,19}: // no 2,3,5,7
mstr(n) ← true else mstr(n) ← false // factors
// Initialize the sieve as an array of the smaller wheels with
// enough wheels to include the representation for limit
for n where n ← 210 × w + x, w ∈ {0,...(limit - 19) ÷ 210}, x in r:
sieve(n) ← mstr(n mod (210 × 46189)) // init pre-culled primes.
// Eliminate composites by sieving, only for those occurrences on the
// wheel using wheel factorization version of the Sieve of Eratosthenes
for n² ≤ limit when n ← 210 × k + x where k ∈ {0..}, x in r
if sieve(n):
// n is prime, cull its multiples
s ← n² - n × (x - 23) // zero'th modulo cull start position
while c0 ≤ limit when c0 ← s + n × m where m in r:
c0d ← (c0 - 23) ÷ 210, cm ← (c0 - 23) mod 210 + 23 //means cm in r
while c ≤ limit for c ← 210 × (c0d + n × j) + cm
where j ∈ {0,...}:
sieve(c) ← false // cull composites of this prime
output 2, 3, 5, 7, 11, 13, 17, 19,
for n ≤ limit when n ← 210 × k + x where k ∈ {0..}, x in r:
if sieve(n): output n
Notice how much less complex this code is in spite of it being more highly wheel factorized so that there is less of a constant overhead of operations than the SoA code while the innermost culling loop is actually slightly simpler and therefore (if anything) likely slightly faster. Also notice how many more outer loops must be run for the SoA code in there are a total of very approximately the same number of outer loops as the square root of the sieving range for the SoA where there are only about the sieve range divided by the natural logarithm of the square root of that range outer loops for the SoE - a factor of about ten times less for the SoE compared to the SoA for sieve ranges of one billion. This is the trade-off that the SoA makes: more outer loops for less operations per inner loop with bigger buffer "jumps" between operations; however, this is also what makes the SoA less efficient per operation, especially for larger sieve ranges. END_EDIT_ADD1
As to the code complexity for larger ranges, these are always implemented as page segmented sieves in order to avoid the necessity of having the whole sieve array in RAM memory (memory constraints would limit the range) and in order to take advantage of CPU caching locality to reduce average memory access times as discussed above. For the SoE, this can easily be implemented based on a saved copy of the a base primes representation where there are approximately the square root of the range divided by log of that square root number of that range; for the SoA one needs to refer to that same base primes representation for the prime squares free elimination but as well needs to use values of sequence offsets plus saved values of 'x' or 'y' (or save the current values of both 'x' and 'y') in order to resume at the next page offset with a total number of saved sequences as about the square root of the range (not divided by natural logarithm of root n), or alternatively new 'x'/'y' pair offsets can be calculated for the start of each new page at a considerable computational overhead. The first method adds greatly to the memory use overhead, especially for multi-threaded implementations where a copy needs to be kept for each thread; the second doesn't use more memory but uses more computational power as compared to the SoE for these larger ranges.
Page segmented algorithms are also important in order to take full advantage of modern multi-core CPU's which can reduce the clock time required for a given task by the ratio of the number of cores used, especially for page divided algorithms where each page processing can be assigned to a separate thread.
The way that the "primesieve" SoE implementation manages average operation times of about 2.5 CPU clock cycles as compared to the about 4.5 CPU clock cycles for Atkin and Bernstein's "primegen" sample implementation of the SoA is by extreme in-lining of loop unrolling with page segmentation; this technique can also be applied to the SoA but this particular pseudo code is unsuitable for doing that and an alternate technique of "hit rate" optimization that handles all of the modulos in a single loop (per quadratic equation loop) needs to be used; however, that is a complex algorithm that would seem to be beyond the scope of this question. Suffice it to say that the jumps in complexity so far are nothing as compared to what is required to add this extra optimization to the SoA, and even with that the ratio of outer loops still makes the SoE much more efficient.
Thus, while it is fairly easy to implement a naive SoA algorithm as in the original Wikipedia pseudo code or even using the more complex SoA pseudo code as in these two answers, it is extremely complex to write something that would compete with even Bernstein's "primegen" (SoA) let alone with the faster "primesieve" (SoE), whereas it is quite easy to write code that comes close to the same performance as "primegen" and just a little more work to write something that competes with "primesieve" using the SoE algorithm with the slight modifications to support page segmentation.
EDIT_ADD2: To put this theory into practical terms, I would normally write an example program to demonstrate these principles; however, in this case we have all we really need in the "primegen" source code, which is the reference implementation by the inventors of the SoA. On my machine (i7-2600K, 3.5 GHz) with the buffer size adjusted to 32768 from 8192 to reflect my L1 cache size, this "primespeed" sieves the primes to one billion in about 0.365 seconds. In fact, Atkin and Bernstein cheated a little in their comparison to their Eratosthenes sieve algorithm in that they only assigned about four Kilobyte buffers to it; adjusting that buffer size to about the same size (the "#define B32 1001" to "#define B32 8008") results in an execution time of about 0.42 seconds for "eratspeed" to about one billion, leaving it still slightly slower. However, as discussed above, it is doing more work in that it is doing about 0.4048 billion cull operations whereas the SoA is only doing about 0.26 billion combined toggle/cull operations. Using the algorithm from the above SoE pseudo code to change this is have maximum wheel factorization means that the SoE would actually do slightly less operation than the SoA at about 0.2505 billion, meaning the the execution time would be reduced to about two thirds or more or about 0.265 to 0.3 seconds making it faster than "primegen".
Meanwhile, if one compares the "primespeed" to the "eratspeed" code one sees that the paged SoE is much less complex than the SoA code, just as indicated by the above pseudo code.
Of course, one could counter that the SoA maintains the ratio of 0.26 times the sieving range operations to as high a sieving range as one wants whereas the SoE ratio climbs as per the equations given above - by about an extra 20% for a sieving range even to one hundred billion. However, given a slight tweak to both algorithms to increase buffer size and "sub slice" it to process it in L1 caches sized slices, the SoE will perform as predicted whereas the overhead of the SoA continues to rise due to the extra ratio of middle/inner loops for the SoA as compared to the SoE. While the SoE has an increase in culling operations by this about 20%, the SoA has an increase in quadratic processing loops of about this same 20% for a very small net gain if any. If these optimizations were made to the SoA, then it might just catch up to the maximally wheel factorized SoE at some very large sieving range such as ten to the nineteenth power or so, but we are talking hundreds of years of processing on a desktop computer even using multi-processing.
The SoA might be practical for sieving a sub-range (ie. a single page segment) from a very large range offset (say ten raised to the twenty-second power), but there is still that cost of extra memory or computation in dealing with all those values of 'x' and 'y' that are not necessary with the SoE. END_EDIT_ADD2
Just as for my first answer, the comparison says that SoA loses to the highly wheel factorized SoE on almost all fronts, as follows:
The SoA code is much more complex and it is harder to write and maintain, thus much harder to implement page segmentation which is so important for larger ranges in order to take advantage of maximum CPU capabilities as to memory access speeds.
The SoA likely loses on CPU clock cycles by at least 20% due to slightly more complex inner loop code and higher numbers of outer loops.
The SoA has slightly more operations as compared to a highly wheel factorized SoE (higher wheel factorization than as per the above discussion) for a reasonably large range of one to four billion.
The SoA wins for very large ranges by having an improvement factor of log (log n) as ranges get bigger, but that factor would only allow the SoA to catch up to the highly wheel factorized SoE at a very high number range (perhaps at about 10 raised to the nineteenth power) if ever, and then only if the relative complexity stays the same, which it isn't likely to do.
So I say again: "Why use the Sieve of Atkin when the Sieve of Eratosthenes is better in almost every respect and especially is less complex while having less of a constant computational factor overhead?" Based on my conclusions here that I feel the SoA is an inferior sieve for any kind of reasonable sieve range and will likely never write a page segmented version of the Sieve of Atkin but have written many of them in various computer languages for the Sieve of Eratosthenes. While the Sieve of Atkin is interesting intellectually and mathematically, it isn't really practical and the Atkin and Bernstein comparison was flawed in overly restricting the wheel factorization of their implementation of the Sieve of Eratosthenes used in their comparison to show an advantage of about 30%, which should have actually been about 60% other than for the constant implementation overheads of the Sieve of Atkin.