Implementing the Page Segmented Sieve of Eratosthenes in Javascript

Asked 3/9, 2016 at 23:12 Answered 2/4, 2020 at 3:43

Solved javascript primes sieve-of-eratosthenes sieve

I recently read about a faster implementation of Segmented Sieve of Eratosthenes for really big numbers.

Following is an implementation of the same:

function sieve(low, high) {

    var primeArray = [], ll = Math.sqrt(low), output = [];

    for (var i = 0; i < high; i++) {
        primeArray[i] = true;
    }

    for (var i = 2; i <= ll; i++) {
        if (primeArray[i]) {
            for (var j = i * i; j < high; j += i) {
                primeArray[j] = false;
            }
        }
    }

    for (var i = 2; i < ll; i++) {
        if(primeArray[i])
        {
            var segmentStart = Math.floor(low/i) * i;

            for(var j = segmentStart; j <= high; j+=i)
            {
                primeArray[j] = false;
            }
        }
    }

    for(var i = low; i <= high; i++)
    {
        if(primeArray[i])
        {
            output.push(i);
        }
    }

    return output;
};

I cannot seem to figure out where have I got it wrong. Probably been working at it too long.

For example:

sieve(4,10) should return [5,7] But it is returning [5,7,9]

Bethea answered 3/9, 2016 at 23:12 Comment(5)

And what exactly does this implementation do wrong? Does it throw an unexpected error? Does it get the answer wrong? – Miseno 3/9, 2016 at 23:18

Yes, it fails in a few cases. – Bethea 3/9, 2016 at 23:25

Specifically, when I begin from 4. – Bethea 3/9, 2016 at 23:26

Could you post these cases, both expected and actual results? – Miseno 3/9, 2016 at 23:26

In fact, it's not just a few cases. It is skipping over a few numbers to find out the primes in all the cases. Updated. – Bethea 3/9, 2016 at 23:29

Although you have gathered from your reading that a Page Segmented Sieve of Eratosthenes is a fast way of finding the primes over a large range, your question code (even when corrected) does not implement a Page Segmented SoE, test the code over a large range, nor is it particularly fast as SoE implementations go. The following discussion will show how to progress toward using a true Page Segmented SoE over a large range.

Synopsis

Following is a staged progression of increasingly fast implementations leading to your intention, with commentary explaining the reasons and implementation details of every step. It includes runnable snippets in JavaScript, but these techniques are not limited to just JavaScript and other languages don't place limits on some further refinements such as multi-threading of the resulting pages (other than by Web Workers, which are difficult to control as to order of processing), some further optimizations in extreme loop unrolling, and which last is related to the limited efficiency of the code due to having to be Just In Time (JIT) compiled to native code by the JavaScript engine in your browser; these limits are as compared to languages that compile directly to very efficient native code such as C/C++, Nim, Rust, Free Pascal, Haskell, Julia, etc.

Chapter 1 - Setting a Baseline

First, lets start with a working version of your current code algorithm used over a reasonably large range with timing information to establish a base line; the following code starts culling per prime at the square of the culling prime which avoids the problem of culling the given prime values and some redundant starting culls and there is no reason to generate an output array of resulting primes for the large range as we can produce the primes directly from the culling array; also, the determination of the answer is outside the timing because we will be developing better techniques to find an "answer" which for large ranges is usually the count of the number of primes found, the sum of the primes, the first occurrences of prime gaps, etc., none of which need to actually view the found primes:

"use strict";

function soePrimesTo(limit) {
  var sz = limit - 1;
  var cmpsts = new Uint8Array(sz); // index 0 represents 2; sz-1 is limit
  // no need to zero above composites array; zeroed on creation...
  for (var p = 2; ; ++p) {
    var sqr = p * p;
    if (sqr > limit) break; // while p is the square root of limit -> cull...
    if (cmpsts[p - 2] == 0 >>> 0) // 0/1 is false/true; false means prime...
      for (var c = sqr - 2; c < sz; c += p) // set true for composites...
          cmpsts[c] = 1 >>> 0; // use asm.js for some extra efficiency...
  }
  var bi = 0
  return function () {
    while (bi < sz && cmpsts[bi] != 0 >>> 0) ++bi;
    if (bi >= sz) return null;
    return bi++ + 2;
  };
}

// show it works...
var gen = soePrimesTo(100);
var p = gen();
var output = [];
while (p != null) { output.push(p); p = gen(); }
console.log("Primes to 100 are:  " + output + ".");

var n = 1000000000; // set the range here...
var elpsd = -new Date();
gen = soePrimesTo(n);
elpsd += +new Date();
var count = 0;
while (gen() != null) { count++; }
console.log("Found " + count + " primes up to " + n + " in " + elpsd + " milliseconds.");

Now, there is little purpose in quoting my run times for the above code sieving to a billion as your times will almost certainly be faster as I am using a very low end tablet CPU in the Intel x5-Z8350 running at 1.92 Gigahertz (CPU clock speed for a single active thread). I am going to quote my execution time exactly once as an example of how to calculate average CPU clock cycles per cull: I take the execution time for the above code of about 43350 milliseconds is 43.35 seconds times 1.92 billion clocks per second divided by about 2.514 billion culling operations to sieve to a billion (which can be calculated from the formula or from the table for no wheel factorization on the SoE Wikipedia page) to get about 33.1 CPU clock cycles per cull to a billion. From now on, we will only use CPU clock cycles per cull to compare performance.

Further note that these performance scores are very dependent on the browser JavaScript engine used, with the above score as run on Google Chrome (version 72); Microsoft Edge (version 44) is about seven times slower for the version of Page Segmented SoE towards which we are moving, and Firefox and Safari are likely to be close to Chrome in performance.

This performance is likely better than for the previous answers code due to using the Uint8Array TypedArray and more asm.js but the timing for such "one huge array sieves" (a GigaByte of memory used here) are subject to bottlenecks due to memory access speed for main RAM memory outside the CPU cache limits. This is why we are working toward a Page Segmented Sieve, but first let's do something about reducing the amount of memory used and the number of required culling cycles.

Chapter 2 - Bit Packing and Odds Only Wheel Factorization

The following code does bit packing, which requires slightly more complexity in the tight inner culling loop, but as it uses just one bit per composite number considered, it reduces the memory use by a factor of eight; as well, since two is the only even prime, it uses basic wheel factorization to sieve odds only for a further reduction of memory use by a factor of two and a reduction in number of culling operations by a factor of about 2.5.

This minimal wheel factorization of odds-only works as follows:

We make a "wheel" with two positions on it, one half that "hits" on the even numbers, and the other half that hits on the odd numbers; thus this "wheel" has a circumferential span of two numbers as we "roll" it over all the potential prime numbers, but it only "windows" one out of two or half of the numbers over which it "rolls".
We then divide all of numbers over which we "roll" into two bit planes of successive values, in one bit plane we place all of the evens starting at four, and in the other bit plane we place all of the odds starting at three.
Now we discard the even plane as none of the represented numbers there can be prime.
The starting index of p * p for considered odd base primes always is an odd number because multiplying an odd number by an odd number is always an odd number.
When we increment the index into the odd bit plane by the prime value, we are actually adding two times the value since adding an odd and an odd produces an even, which would be in the bit plane we discarded, so we add the prime value again to get back to the odd bit plane again.
The odd bit plane index positions automatically account for this since we have thrown away all the even values that were previously between each odd bit position index.
That is why we can cull by just repeatedly adding prime to the index in the following code.

"use strict";

function soeOddPrimesTo(limit) {
  var lmti = (limit - 3) >> 1; // bit index for limit value
  var sz = (lmti >> 3) + 1; // size in bytes
  var cmpsts = new Uint8Array(sz); // index 0 represents 3
  // no need to zero above composites array; zeroed on creation...
  for (var i = 0; ; ++i) {
    var p = i + i + 3; // the square index is (p * p - 3) / 2 but we
    var sqri = (i << 1) * (i + 3) + 3; // calculate start index directly 
    if (sqri > lmti) break; // while p is < square root of limit -> cull...
    // following does bit unpacking to test the prime bit...
    // 0/1 is false/true; false means prime...
    // use asm.js with the shifts to make uint8's for some extra efficiency...
    if ((cmpsts[i >> 3] & ((1 >>> 0) << (i & 7))) == 0 >>> 0)
      for (var c = sqri; c <= lmti; c += p) // set true for composites...
          cmpsts[c >> 3] |= (1 >>> 0) << (c & 7); // masking in the bit
  }
  var bi = -1
  return function () { // return function to return successive primes per call...
    if (bi < 0) { ++bi; return 2 } // the only even prime is a special case
    while (bi <= lmti && (cmpsts[bi >> 3] & ((1 >>> 0) << (bi & 7))) != 0 >>> 0) ++bi;
    if (bi > lmti) return null; // return null following the last prime
    return (bi++ << 1) + 3; // else calculate the prime from the index
  };
}

// show it works...
var gen = soeOddPrimesTo(100);
var p = gen();
var output = [];
while (p != null) { output.push(p); p = gen(); }
console.log("Primes to 100 are:  " + output + ".");

var n = 1000000000; // set the range here...
var elpsd = -new Date();
gen = soeOddPrimesTo(n);
elpsd += +new Date();
var count = 0;
while (gen() != null) { count++; }
console.log("Found " + count + " primes up to " + n + " in " + elpsd + " milliseconds.");

The performance is now about 34.75 CPU clock cycles per cull operation for 1.0257 billion culling operations for a range of a billion for odds only (from Wikipedia), meaning that the reduction in time is almost entirely due to the reduction in number of culling operations, with the extra complexity of the "bit-twiddling" bit packing only taking about the same amount of extra time as the savings due to the reduction in memory use by a factor of sixteen. Thus, this version uses one sixteenth the memory and is about 2.5 times faster.

But we aren't done yet, page segmentation can speed us up even more just as your source told you.

Chapter 3 - Page Segmentation Including Faster Primes Counting

So what is Page Segmentation applied to the SoE and what does it do for us?

Page Segmentation is dividing the work of sieving from one huge array sieved at once to a series of smaller pages that are sieved successively. It then requires a little more complexity in that there must be a separate stream of base primes available, which can be obtained by sieving recursively with an inner sieve generating the list of memoized base primes to be used by the main sieve. As well, the generation of output results is generally a little more complex as it involves successive scanning and reduction for each of the generated sieved pages.

Page Segmentation has many benefits as follows:

It further reduces memory requirements. To sieve to a billion with the previous odds-only code takes about 64 MegaBytes of RAM, but couldn't use that algorithm to sieve beyond about a range of 16 to 32 billion (even if we wanted to wait quite a long time) due to using all the memory available to JavaScript. By sieving by a page size of (say) a square root of that amount, we could sieve to a range of that amount squared or beyond anything we would have time for which to wait. We also have to store the base primes up to the square root of the desired range, but that is only a few tens of MegaBytes to any range we would care to consider.
It increases memory access speed which has a direct impact on the number of CPU clack cycles per culling operation. If culling operations mostly occur within the CPU caches, memory access changes from tens of CPU clock cycles per access for main RAM memory (depending on the CPU and the quality and type of RAM) to about one clock cycle for the CPU L1 cache (smaller at about 16 KiloByte's) to about eight clock cycles from the CPU L2 cache (larger for at least about 128 KiloByte's) and we can work out culling algorithms so that we use all the caches to their best advantage, using the small fast L1 cache for the majority of the operations with small base prime values, the larger little-bit-slower L2 cache for middle sized base primes, and only use main RAM access for the vary few operations using large base primes for extremely huge ranges.
It opens the possibility of multi-threading by assigning the work of sieving each largish page to a different CPU core for a fairly coarse grained concurrency, although this doesn't apply to JavaScript other than through the use of Web Workers (messy).

Other than the added complexity, Page Segmentation has one other problem to work around: unlike the "one huge array" sieve where start indexes are calculated easily once and then used for the entire array, segmented sieves require either that the start addresses be calculated by modulo (division) operations per prime per page (computationally expensive), or additional memory needs to be used to store the index reached per base prime per page so the start index doesn't have to be recalculated, but this last technique precludes multi-threading as these arrays get out of sync. The best solution that will be used in the Ultimate version is to use a combination of these techniques, where several page segments are grouped to form a reasonably large work unit for threading so that these address calculations take an acceptably small portion of the total time, and the index storage tables are used for the base primes across these larger work units per thread so that the complex calculations need only be done once per larger work unit. Thus we get both the possibility of multi-threading and a reduced overhead. However, the following code doesn't reduce this overhead, which costs about 10% to 20% when sieving to a billion.

In addition to Page Segmentation, the following code adds an efficient counting of the found primes by using an Counting Look Up Table (CLUT) population counting algorithm that counts 32 bits at a time so that the overhead of continuously finding the result of the number of found primes a a small percentage of the sieving time. If this were not done, enumerating the individual found primes just to determine how many there are would take at least as long as it takes to sieve over a given range. Similar fast routines can easily be developed to do such things as sum the found primes, find prime gaps, etc.

START_EDIT:

The following code adds one other speed-up: for smaller primes (where this optimization is effective), the code does a form of loop separation by recognizing that the culling operations follow a eight step pattern. This is because a byte has an even number of bits and we are culling by odd primes that will return to the same bit position in a byte every eight culls; this means that for each of the bit positions, we can simplify the inner culling loop to mask a constant bit, greatly simplifying the inner loop and making culling up to about two times faster due to each culling loop in the pattern not needing to do the "bit-twiddling" bit packing operations. This change saves about 35% of the execution time sieving to a billion. It can be disabled by changing the 64 to 0. It also sets the stage for the native code extreme unrolling of the eight loop due to this pattern that can increase the culling operation speed by another factor of about two when native code compilers are used.

A further minor modification makes the loop faster for the larger primes (greater than 8192) by using a Look Up Table (LUT) for the mask value rather than the left shift operation to save about half a CPU clock cycle per culling operation on average when culling a range of a billion; this saving will be increased slightly as ranges go up from a billion but isn't that effective in JavaScript and has been removed.

END_EDIT

ANOTHER_EDIT:

As well as the above edits, we have removed the LUT BITMASK but now zero the Sieve Buffer by fast byte copying from a zero buffer of the same size, and added a Counting LUT population count technique for about a 10% overall gain in speed.

END_ANOTHER_EDIT

FINAL_EDIT:

Made the sieve buffer size about the CPU L2 cache size instead of the L1, as the culling loop speed will never be limited by the L2 cache access speed. This results in a slight increase in speed and a 64 times increase in range while maintaining efficiency.

END_FINAL_EDIT

// JavaScript implementation of Page Segmented Sieve of Eratosthenes...
// This takes almost no memory as it is bit-packed and odds-only,
// and only uses memory proportional to the square root of the range;
// it is also quite fast for large ranges due to imrproved cache associativity...

"use strict";

const PGSZBYTES = 16384 * 8;
const PGSZBITS = PGSZBYTES * 8;
const ZEROSPTRN = new Uint8Array(PGSZBYTES);

function soePages(bitsz) {
  let len = bitsz >> 3;
  let bpa = [];
  let buf =  new Uint8Array(len);
  let lowi = 0;
  let gen;
  return function () {
let nxt = 3 + ((lowi + bitsz) << 1); // just beyond the current page
buf.set(ZEROSPTRN.subarray(0,buf.length)); // clear sieve array!
if (lowi <= 0 && bitsz < 131072) { // special culling for first page as no base primes yet:
  for (let i = 0, p = 3, sqr = 9; sqr < nxt; ++i, p += 2, sqr = p * p)
    if ((buf[i >> 3] & (1 << (i & 7))) === 0)
      for (let j = (sqr - 3) >> 1; j < 131072; j += p)
        buf[j >> 3] |= 1 << (j & 7);
} else { // other than the first "zeroth" page:
  if (!bpa.length) { // if this is the first page after the zero one:
    gen = basePrimes(); // initialize separate base primes stream:
    bpa.push(gen()); // keep the next prime (3 in this case)
  }
  // get enough base primes for the page range...
  for (let p = bpa[bpa.length - 1], sqr = p * p; sqr < nxt;
       p = gen(), bpa.push(p), sqr = p * p);
  for (let i = 0; i < bpa.length; ++i) { // for each base prime in the array
    let p = bpa[i] >>> 0;
    let s = (p * p - 3) >>> 1; // compute the start index of the prime squared
    if (s >= lowi) // adjust start index based on page lower limit...
      s -= lowi;
    else { // for the case where this isn't the first prime squared instance
      let r = (lowi - s) % p;
      s = (r != 0) ? p - r : 0;
    }
    if (p <= 32) {
      for (let slmt = Math.min(bitsz, s + (p << 3)); s < slmt; s += p) {
        let msk = ((1 >>> 0) << (s & 7)) >>> 0;
        for (let c = s >>> 3, clmt = bitsz >= 131072 ? len : len; c < clmt | 0; c += p)
          buf[c] |= msk;
      }
    }
    else
      // inner tight composite culling loop for given prime number across page
      for (let slmt = bitsz >= 131072 ? bitsz : bitsz; s < slmt; s += p)
        buf[s >> 3] |=  ((1 >>> 0) << (s & 7)) >>> 0;
  }
}
let olowi = lowi;
lowi += bitsz;
return [olowi, buf];
  };
}

function basePrimes() {
  var bi = 0;
  var lowi;
  var buf;
  var len;
  var gen = soePages(256);
  return function () {
while (true) {
  if (bi < 1) {
    var pg = gen();
    lowi = pg[0];
    buf = pg[1];
    len = buf.length << 3;
  }
  //find next marker still with prime status
  while (bi < len && buf[bi >> 3] & ((1 >>> 0) << (bi & 7))) bi++;
  if (bi < len) // within buffer: output computed prime
    return 3 + ((lowi + bi++) << 1);
  // beyond buffer range: advance buffer
  bi = 0;
  lowi += len; // and recursively loop to make a new page buffer
}
  };
}

const CLUT = function () {
  let arr = new Uint8Array(65536);
  for (let i = 0; i < 65536; ++i) {
let nmbts = 0|0; let v = i;
while (v > 0) { ++nmbts; v &= (v - 1)|0; }
arr[i] = nmbts|0;
  }
  return arr;
}();

function countPage(bitlmt, sb) {
  let lst = bitlmt >> 5;
  let pg = new Uint32Array(sb.buffer);
  let cnt = (lst << 5) + 32;
  for (let i = 0 | 0; i < lst; ++i) {
let v = pg[i]; cnt -= CLUT[v & 0xFFFF]; cnt -= CLUT[v >>> 16];
  }
  var n = pg[lst] | (0xFFFFFFFE << (bitlmt & 31));
  cnt -= CLUT[n & 0xFFFF]; cnt -= CLUT[n >>> 16];
  return cnt;
}

function countSoEPrimesTo(limit) {
  if (limit < 3) {
if (limit < 2) return 0;
return 1;
  }
  var cnt = 1;
  var lmti = (limit - 3) >>> 1;
  var lowi;
  var buf;
  var len;
  var nxti;
  var gen = soePages(PGSZBITS);
  while (true) {
var pg = gen();
lowi = pg[0];
buf = pg[1];
len = buf.length << 3;
nxti = lowi + len;
if (nxti > lmti) {
  cnt += countPage(lmti - lowi, buf);
  break;
}
cnt += countPage(len - 1, buf);
  }
  return cnt;
}

var limit = 1000000000; // sieve to this limit...
var start = +new Date();
var answr = countSoEPrimesTo(limit);
var elpsd = +new Date() - start;
console.log("Found " + answr + " primes up to " + limit + " in " + elpsd + " milliseconds.");

As implemented here, that code takes about 13.0 CPU clock cycles per cull sieving to a billion. The extra about 20% saving by improving the address calculation algorithm is automatically improved when using the following maximum wheel factorization algorithm due to the effective page sized increasing by a factor of 105 so that this overhead becomes only a few percent and comparable to the few percent used for array filling and result counting.

Chapter 4 - Further Advanced Work to add Maximal Wheel Factorization

Now we consider the more extensive changes to use maximum wheel factorization (by not only 2 for "odds-only", but also 3, 5, and 7 for a wheel that covers a span of 210 potential primes instead of a span of just 2) and also pre-culling on initialization of the small sieving arrays so that it is not necessary to cull by the following primes of 11, 13, 17, and 19. This reduces the number of composite number culling operations when using the page segmented sieve by a factor of about four to a range of a billion (as shown in the tables/calculated from the formulas in the Wikipedia article - "combo wheel") and can be written so that it runs about four times as fast due to the reduced operations with each culling operation about the same speed as for the above code.

The way of doing the 210-span wheel factorization efficiently is to follow on the "odds-only" method: the current algorithm above can be thought of as sieving one bit-packed plane out of two as explained in the previous chapter where the other plane can be eliminated as it contains only the even numbers above two; for the 210-span we can define 48 bit-packed arrays of this size representing possible primes of 11 and above where all the other 162 planes contain numbers which are factors of two, three, five, or seven and thus don't need to be considered. Each bit plane can than be culled just by repeated indexing in increments by the base prime just as the odd number plane was done with the multiplications automatically being taken care of by the structure just as for odds only. In this way it is just as efficient to sieve with less memory requirements (by 48/210 as compared to "odds-only" at 1/2) and with as much efficiency as for odds only, where one 48-plane "page" represents for a 16 Kilobytes = 131072 bits per plane size times 210 is a range of 27,525,120 numbers per sieve page segment, thus only almost 40 page segments to sieve to a billion (instead of almost four thousand as above), and therefore less overhead in start address calculation per base prime per page segment for a further gain in efficiency.

Although the extended code described above is a couple of hundred lines and long to post here, it can count the number of primes to a billion in about two seconds on my low end Intel 1.92 Gigahertz CPU using the Google V8 JavaScript engine, which is about five times slower than the same algorithm run in native code.

Although the above code is quite efficient up to a range of about 16 billion, other improvements can help maintain the efficiency to even larger ranges of several tens of thousands of billions such as 1e14 or more. We accomplish this by adjusting the effective page sizes upward so that they are never smaller than the square root of the full range being sieved, yet sieve them incrementally by 16 KiloByte chunks for small primes, 128 KiloByte chunks for medium primes, and only sieve the a huge array as per our baseline implementation for the very few culling operations used for the largest base prime sizes. In this way, our clocks per cull doesn't increase more than a small factor of up to about two for the largest range we would likely consider.

As this answer is close to the limited size of 30,000 characters, further discussion on Maximal Wheel Factorization is continued on my followup Chapter 4.5a and (future) Chapter 4.5b answers for actual implementations of the techniques described above.

Chapter 5 - What JavaScript (and other VM Languages) Can't Do

For JavaScript and other Virtual Machine languages, the minimum culling loop time is in the order of ten CPU cycles per culling loop and that is not likely to change much. This is about three to four times slower than the about three CPU clock cycles that can easily be achieved with languages that compile directly to efficient machine code using the same algorithm such as C/C++, Nim, Rust, Free Pascal, Haskell, Julia, etc.

In addition, there are extreme loop unrolling techniques that can be used with at least some of those languages that can produce a reduction in average culling operation cycles by a further factor of about two that is denied to JavaScript.

Multi-threading can reduce the execution time by about the factor of effective CPU cores used, but with JavaScript the only way to get this is though the use of Web Workers and that is messy to synchronize. On my machine, I have four cores but only gain a factor of three in speed due the the CPU clock rate being reduced to three quarters with all cores active; this factor of three is not easy to gain for JavaScript.

So this is about the state-of-the-art using JavaScript and other current VM languages have about the same limitation other than they can easily use multi-threading, with the combination of the above factors meaning native code compilers can be about twenty times faster than JavaScript (including multi-threading, and even more with new CPU's with huge numbers of cores).

However, I believe that the future of Web programming in three to five years will be Web Assembly, and that has the potential to overcome all of these limitations. It is very near to supporting multi-threading now, and although currently only about 30% faster than JavaScript for this algorithm on Chrome, it is up to only a little slower than native code on some current browsers when compiled from some languages using some Web Assembly compilers. It is still early days of development for efficient compilers to Web Assembly and for efficient browser compilation to native code, but as Web Assembly is closer to native code than most VM's, it could easily be improved to produce native code that is as fast or nearly as fast as code from those other notive-code compiling languages.

However, other than for compilation of JavaScript libraries and Frameworks into Web Assembly, I don't believe the future of the Web will be JavaScript to Web Assembly compilers, but rather compilation from some other language. My favourite picks of the future of Web programming is F# with perhaps the Fable implementation converted to produce Web Assembly rather than JavaScript (asm.js) or Nim. There is even some possibility that Web Assembly could be produced that supports and shows the advantage of the extreme loop unrolling for very close to the fastest of known Page Segmented SoE speeds.

Conclusion

We have built a Page Segmented Sieve of Eratosthenes in JavaScript that is suitable for sieving large ranges in the billions, and have a means to further extend this work. The resulting code has about ten times less culling operations (when fully wheel factorized) and the culling operations are about three times faster meaning the code per given (large) range is about 30 times faster while the reduced memory use means that one can sieve to ranges of the 53 bit mantissa of about 9e15 in something of the order of a year (just leave a Chrome browser tab open and back up the power supply).

Although there are some other minor tweaks possible, this is about the state of the art in sieving primes using JavaScript: while it isn't a fast as native code for the given reasons, it is fast enough to find the number of primes to 1e14 in a day or two (even on a mid range smartphone) by leaving a browser tab open for the required computation time; this es quite amazing as the number of primes in this range wasn't known until 1985 and then by using numerical analysis techniques and not by using a sieve as the computers of that day weren't fast enough using the fastest coding techniques to do it in a reasonable and economic amount of time. Although we can do this just in just a few hours using these algorithms for the best of the native code compilers on modern desktop computers, now we can do it in an acceptable time with JavaScript on a smart phone!

Kevyn answered 19/4, 2019 at 11:17 Comment(9)

Sorry, I ran out of room without being able to post a code snippet that maximally wheel factorized to be about four times faster and capable to sieve over some huge ranges over 16 billion; I'm working on producing another answer and linking to it in here... – Kevyn 20/4, 2019 at 3:53

This is a great tutorial like answer. I think it involves everything one needs to know about sieving primes but more than that it also demonstrates an efficient use of typed arrays and bit packing. These are not easy to find inormation. I will curiously convert my modified SoS accord to the knowledge found here. This topic should be protected. Waiting to see the maximal wheel factorization version. – Opt 20/4, 2019 at 17:14

@Redu, I have added an answer with a further expansion of the Maximally Wheel Factorized technique with an example, and will likely add at least one more answer that is capable of extending these techniques to huge ranges. This didn't quite make four times faster than the previous best example, but it's not bad at about three and a half times faster. – Kevyn 19/7, 2019 at 8:53

@Redu, As to your further work with the Sieve of Sundaram (SoS), as discussed privately, the benefits of the SoS over the SoE is primarily that it is (slightly) easier to implement, but will always be slower by at least the ratio of ln(range) because it sieves using all odd numbers rather than only the odd primes <= the square root of the range. With minor changes to an optimized implementation of the SoS algorithm to use odd primes instead of all odds, it is no longer the SoS but a full odds-only SoE. Wheel factorization improvements are just constant factor improvements for both. – Kevyn 19/7, 2019 at 9:0

I have tried all implementations here, and they are all plagued by the same issue - failure to go beyond 32 bits, which translates into about 103mln primes. I cannot fathom how to amend your solutions here to go beyond 32 bits. If you can include this into your answer, this would make it hugely more useful, IMO (JavaScript can handle 53 bits for integers). Cheers! – Lefebvre 11/10, 2021 at 19:28

P.S. Here's your code that I used. How to change it so it can work beyond 32 bits? Thank you! – Lefebvre 11/10, 2021 at 19:34

@vitaly-t, Yes, that's my code, but that is not the code that is the answer to the question and is just a corrected version of the question code to be used as a speed baseline - it is not page-segmented as is required to be able to sieve to your desired range. Although the code at the bottom of the answer above in Chapter 3 is page segmented (takes about a second and a half on a common desktop to sieve to a billion), it uses JavaScript Int's as it's data type so can not directly be extended to the range you want. Chapter 4.5 below uses Number's as the controlling data type so can. – Kevyn 12/10, 2021 at 1:12

@vitaly-t, (cont'd), in addition, the answer that is Chapter 4.5a adds maximal-wheel-factorization for another gain of a factor of almost four; you should start from there. The reason that code has the wider range is that the "lwi" which is the page wheel index is consistently treated as "Number" with internal 32-bit Int's derived from it. As previously noted, if you want to do something with the huge number of generated primes other than count them, you need to change the count function to something else. Compare that answer to the last code here, you'll see what's done for extended range – Kevyn 12/10, 2021 at 3:8

@vitaly-t, (cont'd) of course you likely know that using "bit" operations like shifting, logical or'ing, and'ing, etc. is the signal to the JavaScript engine that we want to use the result as a 32-bit Int and not a number, so to preserve number we can't use those but must use the equivalent in normal mathematical operations, as in shifting becomes multiplying/dividing by 2/4/8..., etc. We must use Int for the internal composite marking operations so that we can use the JavaScript ByteArray for speed... – Kevyn 12/10, 2021 at 3:13

This expands my previous answer as to adding what was promised but for which there wasn't room in the 30,000 character per answer limit:

Chapter 4.5a - the Basics of Full Wheel Factorization

The non-Maximally-Wheel-Factorized Page-Segmented Sieve of Eratosthenes version from Chapter 3 in the previous answer was written as a Prime Generator with the output recursively fed back as an input for the base prime number feed; although that was very elegant and expandable, in the following work I have taken a step back to a more imperative style of code so the readers can more readily understand the core concepts of the Maximal Wheel Factorization. In a future Chapter 4.5b I will combine the concepts developed in the following example back into a Prime Generator style and add some extra refinements that won't make it any faster for smaller ranges of a few billion, but will make the concept usable without much of a loss of speed up to trillions to hundreds or thousands of trillions; the Prime Generator format is then more useful in making the program adapt as the range gets larger.

The following example's main extra refinements are in the various Look Up Tables (LUT's) used for efficient addressing of the wheel modulo residues, the generation of the special start addresses LUT that quite simply allows one to calculate the culling start address for each modulo residue bit plane given the start address wheel index and first modulo residue bit plane index of the very first cull in the entire structure without additional integer divisions (slow), and the Sieve Buffer composite number representation culling function that uses these.

This example is based on a 210 number circumference wheel using the small primes of two, three, five, and seven as it seems that hits the "sweet spot" of efficiency for the size of the arrays and numbers of bit planes but experiments have shown that about another 5% in performance can be gained by adding the next prime of eleven to the wheel for a circumference of 2310 numbers; the reason that this wasn't done is that initialization time goes up greatly and that it makes it hard to time for smaller ranges of "only " a billion as there are then only about four segments to reach that point and granularity becomes a problem.

Note that the first sieved number is 23 which is the first prime past the wheel primes and pre-cull primes; by using this, we avoid the problems of dealing with arrays that start from "one" and the problems of restoring the wheel primes that are eliminated and must be added back in by some algorithms.

EDIT_ADD: Adding a word of explanation of the purpose of the various WHL LUT's - Most of these are to do with reducing the need for computational-time expensive division operations to only about two per base prime per huge page segment. As described below, the WHLSTRTS LUT is used to convert start addresses to the bit index required per base prime per residual modulo index of the sieve page segment (48 in this case) to a very trivial look up, multiply, and add ooperation as is described below. END_EDIT_ADD

Basically, for every Page Segment culling sweep, there is a starting loop that fills a start address array with the wheel index and modulo residue index of the first cull address within the segment for each of the base primes less than the square root of the maximum number represented in the Page Segment, then this start address array is used sieving each modulo residue bit plane (48 of them) completely in turn with all base primes scanned for each bit plane and the appropriate offset calculated from the segment start address per base prime by use of the multiplier and offset in the WHLSTARTS LUT. This is done by multiplying the wheel index of the base prime my the looked up multiplier and adding the looked up offset to obtain the wheel index start position for the given modulo residue bit plane. From there on, the per bit plane culling is just like as for Chapter Three for the odd number bit plane. This is done 48 times for each of the bit planes, but the effective range per page segment for a 16 Kilobyte buffer (per bit plane) is 131072 times the 210 wheel span or 27,525,120 numbers per Page Segment and multiplied linearly up from that for larger sizes of bit planes.

Use of this sieve reduces memory use as a ratio of the effective range of the segment by a factor of 48 over 105 as compared to the Chapter 3 odds-only sieve or to a factor of less than half, but because each segment has all 48 bit planes, the full Sieve Buffer is 16 Kilobytes times the 48 bit planes or 768 Kilobytes (three quarters of a Megabyte) and a multiple of that for larger sizes. However, use of a Sieve Buffer of this size is only efficient up to about 16 billion and our next example in the next chapter will adapt the size of the buffer for huge ranges so that it grows to about a hundred Megabytes for the largest ranges. For multi-threaded languages (not JavaScript), this would be a requirement per thread.

Additional memory requirements are for the storage of the base primes array of 32-bit values representing the wheel index for the base prime and its modulo residue index (necessary for the modulo address calculations as explained above); for a range of a billion there are about 3600 base primes times four bytes each is about 14,000 bytes with the additional start address array the same size. These arrays grow as the square root of the maximum value to be sieved, so grow to about 5,761,455 values (times four bytes each) for the base primes less than a hundred million necessary to sieve to 10^16 (ten thousand trillion) or about 23 Megabytes each.

A further refinement is adapted to the following example in using a "combo" sieve where the Sieve Buffer is pre-filled from a larger wheel pattern from which the factors of the primes of eleven, thirteen, seventeen, and nineteen, have been eliminated; bigger ranges of eliminations than this are impractical as the saved pre-culled pattern grows from only about 64 Kilobytes per modulo bit plane to about twenty times that large or about one and a half Megabytes for each of the 48 modulo residue number planes or about sixty Megabytes just by adding the extra factor of the prime of 23 - again, it is quite a large cost in memory and initialization for only a few percent in performance. Do note that this array can be shared for use across all threads.

As implemented, the WHLPTRN array is about 64 Kilobytes times 48 modulo bit planes are about 3 Megabytes, which isn't that large and which is a fixed size not changing with increasing sieving range; this is a quite workable size as to access speed and initialization time.

These "Maximal Wheel Factorization" refinements reduce the total number of composite number culling operation loops for sieving a range of a billion from about a billion operations for the Chapter 3 odds-only example to about a quarter billion for this "combo" sieve, with the goal to try to keep the number of CPU clock cycles the same per culling operation and thus a gain of a factor of four in speed.

EDITED: The following snippet has been adjusted to add a rudimentary HTML Single Web Page Application user interface so that parameters can easily be adjusted for experimentation. For best ease of use, one should use the upper right "Full page" link after clicking the "Run code snippet" button, and can close the full page with a top right link when finished. To run on a smartphone (preferably in Chrome), use the "Desktop site" checkbox in the settings menu (the triple dot menu).

EDIT_CORRECTION: with the range limit able to be easily changed within the specified upper bound, the dummy base base indexed base prime array size is no longer adequate to cover the square root of the square root of the specified upper bound of LIMIT of 362 (previously only 229) so has been increased to two wheel spans or 439.

FURTHER_EDIT_CORRECTION: There was a slight error in the fillSieveBuffer function when filling SieveBuffer residual bit plane buffers of larger than 16384 bytes, which has been corrected

SPEED_OMISSION_CORRECTION: From working with other languages, it was realized that the version is about 20% slower than it should be due to not effectively using "loop unpeeling" as to not calculating an appropriate limit where this should be applied. The "bplmt" has been added and applied to correct this. Upon first running the code, one should press the Sieve button several times to allow the JavaScript engine to hot tune the generated code for optimizations and thus improve speed, which it will reach after four or five iterations.

EDIT_POLISHING: It seems that there is really no advantage to setting the sieve buffer size to the CPU L1 cache size as JavaScript isn't fast enough to take advantage of its speed, and about the CPU L2 cache size or less is better. The only problem with this is the "granularity of the sieve increase since the sieve buffer size now represents a range of about two hundred million, four hundred million, and eight hundred million for each of the sieve buffer choice sizes, respectively. This makes the timing measurments for trivial ranges such as a billion inaccurate as to the real time as there may be a huge overflow in calculations by a whole extra buffer to cover the range.

As well, since the sieve range capability has been greatly increased, a progress indication and cancel capability have been added.

However, although the sieving range capability has been increased, extra improvements of a "bucket sieve" would need to be added in order to maintain efficiency about about 1e12 so it isn't recommented that one sieves much beyond this point.

The JavaScript example as described above is implemented as follows:

"use strict";

const WHLPRMS = new Uint32Array([2,3,5,7,11,13,17,19]);
const FRSTSVPRM = 23;
const WHLODDCRC = 105 | 0;
const WHLHITS = 48 | 0;
const WHLODDGAPS = new Uint8Array([
  3, 1, 3, 2, 1, 2, 3, 3, 1, 3, 2, 1, 3, 2, 3, 4,
  2, 1, 2, 1, 2, 4, 3, 2, 3, 1, 2, 3, 1, 3, 3, 2,
  1, 2, 3, 1, 3, 2, 1, 2, 1, 5, 1, 5, 1, 2, 1, 2 ]);
const RESIDUES = new Uint32Array([
  23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
  73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 121, 127,
  131, 137, 139, 143, 149, 151, 157, 163, 167, 169, 173, 179,
  181, 187, 191, 193, 197, 199, 209, 211, 221, 223, 227, 229, 233 ]);
const WHLNDXS = new Uint8Array([
  0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6,
  7, 7, 7, 8, 9, 9, 9, 10, 10, 11, 12, 12, 12, 13, 13,
  14, 14, 14, 15, 15, 15, 15, 16, 16, 17, 18, 18, 19, 20, 20,
  21, 21, 21, 21, 22, 22, 22, 23, 23, 24, 24, 24, 25, 26, 26,
  27, 27, 27, 28, 29, 29, 29, 30, 30, 30, 31, 31, 32, 33, 33,
  34, 34, 34, 35, 36, 36, 36, 37, 37, 38, 39, 39, 40, 41, 41,
  41, 41, 41, 42, 43, 43, 43, 43, 43, 44, 45, 45, 46, 47, 47, 48 ]);
const WHLRNDUPS = new Uint8Array( // two rounds to avoid overflow, used in start address calcs...
  [ 0, 3, 3, 3, 4, 7, 7, 7, 9, 9, 10, 12, 12, 15, 15,
    15, 18, 18, 18, 19, 22, 22, 22, 24, 24, 25, 28, 28, 28, 30,
    30, 33, 33, 33, 37, 37, 37, 37, 39, 39, 40, 42, 42, 43, 45,
    45, 49, 49, 49, 49, 52, 52, 52, 54, 54, 57, 57, 57, 58, 60,
    60, 63, 63, 63, 64, 67, 67, 67, 70, 70, 70, 72, 72, 73, 75,
    75, 78, 78, 78, 79, 82, 82, 82, 84, 84, 85, 87, 87, 88, 93,
    93, 93, 93, 93, 94, 99, 99, 99, 99, 99, 100, 102, 102, 103, 105,
    105, 108, 108, 108, 109, 112, 112, 112, 114, 114, 115, 117, 117, 120, 120,
    120, 123, 123, 123, 124, 127, 127, 127, 129, 129, 130, 133, 133, 133, 135,
    135, 138, 138, 138, 142, 142, 142, 142, 144, 144, 145, 147, 147, 148, 150,
    150, 154, 154, 154, 154, 157, 157, 157, 159, 159, 162, 162, 162, 163, 165,
    165, 168, 168, 168, 169, 172, 172, 172, 175, 175, 175, 177, 177, 178, 180,
    180, 183, 183, 183, 184, 187, 187, 187, 189, 189, 190, 192, 192, 193, 198,
    198, 198, 198, 198, 199, 204, 204, 204, 204, 204, 205, 207, 207, 208, 210, 210 ]);
const WHLSTARTS = function () {
  let arr = new Array(WHLHITS);
  for (let i = 0; i < WHLHITS; ++i) arr[i] = new Uint16Array(WHLHITS * WHLHITS);
  for (let pi = 0; pi < WHLHITS; ++pi) {
    let mltsarr = new Uint16Array(WHLHITS);
    let p = RESIDUES[pi]; let i = (p - FRSTSVPRM) >> 1;
    let s = ((i << 1) * (i + FRSTSVPRM) + (FRSTSVPRM * ((FRSTSVPRM - 1) >> 1))) | 0;
    // build array of relative mults and offsets to `s`...
    for (let ci = 0; ci < WHLHITS; ++ci) {
      let rmlt = (RESIDUES[((pi + ci) % WHLHITS) | 0] - RESIDUES[pi | 0]) >> 1;
      rmlt += rmlt < 0 ? WHLODDCRC : 0; let sn = s + p * rmlt;
      let snd = (sn / WHLODDCRC) | 0; let snm = (sn - snd * WHLODDCRC) | 0;
      mltsarr[WHLNDXS[snm]] = rmlt | 0; // new rmlts 0..209!
    }
    let ondx = (pi * WHLHITS) | 0
    for (let si = 0; si < WHLHITS; ++si) {
      let s0 = (RESIDUES[si] - FRSTSVPRM) >> 1; let sm0 = mltsarr[si];
      for (let ci = 0; ci < WHLHITS; ++ci) {
        let smr = mltsarr[ci];
        let rmlt = smr < sm0 ? smr + WHLODDCRC - sm0 : smr - sm0;
        let sn = s0 + p * rmlt; let rofs = (sn / WHLODDCRC) | 0;
        // we take the multiplier times 2 so it multiplies by the odd wheel index...
        arr[ci][ondx + si] = ((rmlt << 9) | (rofs | 0)) >>> 0;
      }
    }
  }
  return arr;
}();
const PTRNLEN = (11 * 13 * 17 * 19) | 0;
const PTRNNDXDPRMS = new Int32Array([ // the wheel index plus the modulo index
  (-1 << 6) + 44, (-1 << 6) + 45, (-1 << 6) + 46, (-1 << 6) + 47 ]);

function makeSieveBuffer(szbits) { // round up to 32 bit boundary!
  let arr = new Array(WHLHITS); let sz = ((szbits + 31) >> 5) << 2;
  for (let ri = 0; ri < WHLHITS; ++ri) arr[ri] = new Uint8Array(sz);
  return arr;
}

function cullSieveBuffer(lwi, bps, prmstrts, sb) {
  let len = sb[0].length; let szbits = len << 3; let bplmt = len >> 1;
  let lowndx = lwi * WHLODDCRC; let nxti = (lwi + szbits) * WHLODDCRC;
  // set up prmstrts for use by each modulo residue bit plane...
  for (let pi = 0, bpslmt = bps.length; pi < bpslmt; ++pi) {
    let ndxdprm = bps[pi] | 0;
    let prmndx = ndxdprm & 0x3F; let pd = ndxdprm >> 6;
    let rsd = RESIDUES[prmndx] | 0; let bp = (pd * (WHLODDCRC << 1) + rsd) | 0;
    let i = (bp - FRSTSVPRM) / 2;
    let s = (i + i) * (i + FRSTSVPRM) + (FRSTSVPRM * ((FRSTSVPRM - 1) / 2));
    if (s >= nxti) { prmstrts[pi] = 0xFFFFFFFF >>> 0; break; } // enough base primes!
    if (s >= lowndx) s = (s - lowndx) | 0;
    else {
      let wp = (rsd - FRSTSVPRM) >>> 1; let r = ((lowndx - s) % (WHLODDCRC * bp)) >>> 0;
      s = r == 0
        ? 0 | 0
        : (bp * (WHLRNDUPS[(wp + ((r + bp - 1) / bp) | 0) | 0] - wp) - r) | 0;
    }
    let sd = (s / WHLODDCRC) | 0; let sn = WHLNDXS[(s - sd * WHLODDCRC) | 0];
    prmstrts[pi | 0] = ((sn << 26) | sd) >>> 0;
  }
//  if (szbits == 131072) return;
  for (let ri = 0; ri < WHLHITS; ++ri) {
    let pln = sb[ri]; let plnstrts = WHLSTARTS[ri];
    for (let pi = 0, bpslmt = bps.length; pi < bpslmt; ++pi) {
      let prmstrt = prmstrts[pi | 0] >>> 0; if (prmstrt == 0xFFFFFFFF) break;
      let ndxdprm = bps[pi | 0] | 0;
      let prmndx = ndxdprm & 0x3F; let pd = ndxdprm >> 6;
      let bp = (((pd * (WHLODDCRC << 1)) | 0) + RESIDUES[prmndx]) | 0;
      let sd = prmstrt & 0x3FFFFFF; let sn = prmstrt >>> 26;
      let adji = (prmndx * WHLHITS + sn) | 0; let adj = plnstrts[adji];
      sd += ((((adj >> 8) * pd) | 0) + (adj & 0xFF)) | 0;
      if (bp < bplmt) {
        for (let slmt = Math.min(szbits, sd + (bp << 3)) | 0; sd < slmt; sd += bp) {
          let msk = (1 << (sd & 7)) >>> 0;
//          for (let c = sd >> 3, clmt = len == 16384 ? 0 : len; c < clmt; c += bp) pln[c] |= msk;
          for (let c = sd >> 3; c < len; c += bp) pln[c] |= msk;
        }
      }
//      else for (let sdlmt = szbits == 131072 ? 0 : szbits; sd < sdlmt; sd += bp) pln[sd >> 3] |= (1 << (sd & 7)) >>> 0;
      else for (; sd < szbits; sd += bp) pln[sd >> 3] |= (1 << (sd & 7)) >>> 0;
    }
  }
}

const WHLPTRN = function () {
  let sb = makeSieveBuffer((PTRNLEN + 16384) << 3); // avoid overflow when filling!
  cullSieveBuffer(0, PTRNNDXDPRMS, new Uint32Array(PTRNNDXDPRMS.length), sb);
  return sb;
}();

const CLUT = function () {
  let arr = new Uint8Array(65536);
  for (let i = 0; i < 65536; ++i) {
    let nmbts = 0|0; let v = i;
    while (v > 0) { ++nmbts; v &= (v - 1)|0; }
    arr[i] = nmbts|0;
  }
  return arr;
}();

function countSieveBuffer(bitlmt, sb) {
  let lstwi = (bitlmt / WHLODDCRC) | 0;
  let lstri = WHLNDXS[(bitlmt - lstwi * WHLODDCRC) | 0];
  let lst = lstwi >> 5; let lstm = lstwi & 31;
  let count = (lst * 32 + 32) * WHLHITS;
  for (let ri = 0; ri < WHLHITS; ++ri) {
    let pln = new Uint32Array(sb[ri].buffer);
    for (let i = 0; i < lst; ++i) {
      let v = pln[i]; count -= CLUT[v & 0xFFFF]; count -= CLUT[v >>> 16];
    }
    let msk = 0xFFFFFFFF << lstm; if (ri <= lstri) msk <<= 1;
    let v = pln[lst] | msk; count -= CLUT[v & 0xFFFF]; count -= CLUT[v >>> 16];
  }
  return count;
}

function fillSieveBuffer(lwi, sb) {
  let len = sb[0].length; let cpysz = len > 16384 ? 16384 : len;
  let mod0 = lwi / 8;
  for (let ri = 0; ri < WHLHITS; ++ri) {
    for (let i = 0; i < len; i += 16384) {
      let mod = ((mod0 + i) % PTRNLEN) | 0;
      sb[ri].set(WHLPTRN[ri].subarray(mod, (mod + cpysz) | 0), i);
    }
  }
}

// a mutable cancelled flag...
let cancelled = false;

function doit() {
  const LIMIT =  Math.floor(parseFloat(document.getElementById('limit').value));
  if (!Number.isInteger(LIMIT) || (LIMIT < 0) || (LIMIT > 1e15)) {
    document.getElementById('output').innerText = "Top limit must be an integer between 0 and 9e15!";
    return;
  }
  const SIEVEBUFFERSZ = parseInt(document.getElementById('L1').value, 10);
  let startx = +Date.now();
  let count = 0;
  for (let i = 0; i < WHLPRMS.length; ++i) {
    if (WHLPRMS[i] > LIMIT) break;
    ++count;
  }
  if (LIMIT >= FRSTSVPRM) {
    const cmpsts = makeSieveBuffer(SIEVEBUFFERSZ);
    const bparr = function () {
      let szbits = (((((((Math.sqrt(LIMIT) | 0) - 23) >> 1) + WHLODDCRC - 1) / WHLODDCRC)
                                                                         + 31) >> 5) << 5;
      let cmpsts = makeSieveBuffer(szbits); fillSieveBuffer(0, cmpsts);
      let ndxdrsds = new Int32Array(2 * WHLHITS);
      for (let i = 0; i < ndxdrsds.length; ++i)
        ndxdrsds[i] = ((i < WHLHITS ? 0 : 64) + (i % WHLHITS)) >>> 0;
      cullSieveBuffer(0, ndxdrsds, new Uint32Array(ndxdrsds.length), cmpsts);
      let len = countSieveBuffer(szbits * WHLODDCRC - 1, cmpsts);
      let ndxdprms = new Uint32Array(len); let j = 0;
      for (let i = 0; i < szbits; ++i)
        for (let ri = 0; ri < WHLHITS; ++ri)
          if ((cmpsts[ri][i >> 3] & (1 << (i & 7))) == 0) {
            ndxdprms[j++] = ((i << 6) + ri) >>> 0;
          }
      return ndxdprms;
    }();
    let lwilmt = (LIMIT - FRSTSVPRM) / 2 / WHLODDCRC;
    let strts = new Uint32Array(bparr.length);
    let lwi = 0;
    const pgfnc = function () {
      if (cancelled) {
        document.getElementById('output').innerText = "Cancelled!!!";
        document.getElementById('go').value = "Start Sieve...";
        document.getElementById('go').disabled = false;
        cancelled = false;
        return;
      }
      const smlllmt = lwi + 4194304;
      const lmt = (smlllmt < lwilmt) ? smlllmt : lwilmt;
      for (; lwi <= lmt; lwi += SIEVEBUFFERSZ) {
        const nxti = lwi + SIEVEBUFFERSZ;
        fillSieveBuffer(lwi, cmpsts);
        cullSieveBuffer(lwi, bparr, strts, cmpsts);
        if (nxti <= lwilmt) count += countSieveBuffer(SIEVEBUFFERSZ * WHLODDCRC - 1, cmpsts);
        else count += countSieveBuffer((LIMIT - FRSTSVPRM) / 2 - lwi * WHLODDCRC, cmpsts);
      }
      if (lwi <= lwilmt) {
        document.getElementById('output').innerText = "Sieved " + ((lwi / lwilmt * 100).toFixed(3)) + "%";
        setTimeout(pgfnc, 7);
      }
      else {
       const elpsdx = +Date.now() - startx;
       document.getElementById('go').onclick = strtclick;
       document.getElementById('output').innerText = "Found " + count 
+ " primes up to " + LIMIT + " in " + elpsdx + " milliseconds.";
       document.getElementById('go').value = "Start Sieve...";
       document.getElementById('go').disabled = false;
      }
    };
    pgfnc();
  }
}

const cancelclick = function () {
  cancelled = true;
  document.getElementById('go').disabled = true;
  document.getElementById('go').value = "Cancelled!!!";
  document.getElementById('go').onclick = strtclick;
}

const strtclick = function () {
  document.getElementById('output').innerText = "Sieved 0%";
  document.getElementById('go').onclick = cancelclick;
  document.getElementById('go').value = "Running, click to cancel...";
  cancelled = false;
  setTimeout(doit, 7);
};

document.getElementById('go').onclick = strtclick;

html,
body {
  justify-content: center;
  align-items: center;
  text-align: center;
  font-weight: bold;
  font-size: 120%;
  margin-bottom: 10px;
}

.title {
  font-size: 200%;
}

.input {
  font-size: 100%;
  padding:5px 15px 5px 15px;
}

.output {
  padding:7px 15px 7px 15px;
}

.doit {
  font-weight: bold;
  font-size: 110%;
  border:3px solid black;
  background:#F0E5D1;
  padding:7px 15px 7px 15px;
}

<!doctype html>
<html>
<head>
  <meta http-equiv='Content-Type' content='text/html; charset=utf-8'>
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>Page-Segmented Sieve of Eratosthenes...</title>
</head>
<body>
  <div class="title">
    <text>
      Page-Segmented Sieve of Eratosthenes
    </text>
  </div>
  <div>
    <text>
      Top sieve limit value:
    </text>
    <input class="input" type="textbox" id="limit" value="1e9" />
  </div>
  <div class="output">
    <text>The enforced limit is zero to 9e15, but values greater than about 1e12 can take a very long time!</text>
  </div>
  <div>
    <text>Sieve buffer size (CPU L2 cache?)</text>
    <select class="input" id="L1">
      <option value="1048576">128 Kilobytes</option>
      <option value="2097152">256 Kilobytes</option>
      <option value="4194304">512 Kilobytes</option>
    </select>
  </div>
  <div class="output">
    <text>Refresh the page to reset to default values or stop processing!</text>
  </div>
  <div class="output">
    <text id="output">
      Waiting to start...
    </text>
  </div>
  <div>
    <input class="doit" type="button" id="go" value="Start Sieve..." />
  </div>
</body>
</html>

Note that the above code just comes into ins own for non-trivial sieving ranges above a billion (1e9) as the granularity of the quite large sieving page sizes make timeing not that accurate for ranges below about this size. The program works best fo sieving ranges of 1E11 and above.

The above code may not be quite as fast as one might expect for the following reasons:

The speed up technique using a special simplified culling loop with a fixed mask pattern is no longer as effective as there are no longer hardly any small culling base primes; this increases the average number of clock cycles per composite number culling operation by about 20%, although this applies more to slower languages such as JavaScript than to the more efficient machine code compiling languages since they can use further techniques such as extreme loop unrolling to further reduce number of CPU clock cycles per culling operation loop to as little as about 1.25 clock cycles per culling loop.
Although the overhead of counting the resulting primes is reduced by about a factor of two due to the less modulo bit planes (by about that factor), that is not the required factor of four; this is made worse in using JavaScript which does not have an means of tapping into the CPU POP_COUNT machine instruction which would be as about ten times faster than the Counting LUT (CLUT) technique used here.
While the LUT techniques used here reduce the start address calculation overhead by a factor of five or so from what they would be for the more complex modulo calculations needed, these calculations are about two and a half to three times more complex than as required for the "odds-only" sieve in Chapter 3 so it isn't enough to give us a ratio-metric reduction; we would need a technique to reduce the time by a further factor of two or so in order for these calculation not to contribute to the reduced ratio. Believe me, I tried, but haven't been able to make this any better. That said, these calculations are likely to be more efficient in a more efficient computer language than JavaScript and/or a better CPU than my very low end Atom CPU processor, but then the composite number culling speed is likely to also be more efficient as well!

Still, almost a four times speed up with only an increase in number of lines of code of about 50% isn't too bad, is it? This JavaScript code is only three to five times slower (depending on CPU, closer in performance on higher end CPU's) when run on newer versions of node/Google Chrome browser (Chrome version 75 is still about 25% faster than Firefox version 68) than Kim Walisch's "primesieve" written in "C" and compiled to x86_64 native code!

I've added an Appendix answer that shows that one doesn't really need to write JavaScript to generate JavaScript when the project increases in size to more than a couple of hundred lines, as here. In the future, I expect such transpilers as Fable might emit WebAssembly code rather than JavaScript, and the advantage of writing in Fable now is that one will not then have to make changes (or at least few changes) to the code in order to take advantage of the newer technology, which supports faster code execution and (eventually) the multi-threading that JavaScript does not.

Coming is Chapter 4.5b that will be about two and a half times more code but will be a Prime Generator capable of sieving to extremely large ranges limited partly by that JavaScript can only efficiently represent numbers up to the 64-bit floating bit mantissa of 53 bits or about 9e15 and also by the amount of time one wants to wait: Sieving to 1e14 on a better CPU will take in the order of a day, but that isn't much of a problem - just keep a browser tab open!

Kevyn answered 19/7, 2019 at 8:4 Comment(24)

Wow! :O My record of sieve is 220ms per 1e7. Your code takes 15ms. Amazing! – Crammer 20/5, 2020 at 22:42

@MrHIDEn, Glad you like my code. If you read this series of tutorial type answers carefully, you can see how it's done. Actually 1e7 is a pretty small range for this algorithm as it is just getting started, and 137 only represents a part of one Page Segment even when the CPU L1 cache size is chosen to be the smallest at 16 Kilobytes, as that one page covers a sieving range of about 2.7525e7 so less than half of one page segment. This algorithm is most useful for large sieving ranges of say 1e8 and above, and really comes into its own above 1e9. – Kevyn 24/5, 2020 at 10:29

@GordoneBGood I did not read carefully. I tried to understand by reading the code, but I found it difficult to me. I need to spend much more time to get the idea. I hope I will find a day to dive in to this. – Crammer 24/5, 2020 at 19:12

Comparing to your previous answer, where you were extracting actual primes, this codes doesn't do it. At which point can we yield an actual next prime found? – Lefebvre 5/10, 2021 at 20:47

@vitaly-t, your question sounds reasonable until one thinks "what am I going to do with a sequence of millions of prime numbers?"; this answer provides a benchmark and so just counts the primes as that is often all one is going to do with the results; however if one wanted to actually do something with the prime values over the range such as sum them, one would change the code near the bottom of the doit function just after filling and culling to, instead of calling the counting function, to call a function that does something else. My next comment will show a function that sums them... – Kevyn 6/10, 2021 at 2:57

@vitaly-t, (cont'd) For instance, it you wanted to sum the actual primes over a range rather than just count the number of them, one would use a function as follows: function sumSieveBuffer(lwi, bitndxlmt, cmpsts) { let sum = 0; const whlfctr = WHLODDCRC + WHLODDCRC; for (let i = 0; i <= bitndxlmt; ++i) for (let ri = 0; ri < WHLHITS; ++ri) if ((cmpsts[ri][i >> 3] & (1 << (i & 7))) == 0) { sum += ((lwi + i) * whlfctr + RESIDUES[ri]; } return sum; } with the above function called in place of countSieveBuffer with the addititional prefix argument of lwi. – Kevyn 6/10, 2021 at 3:28

@Kevyn Thank you! I'm just trying to optimize my primes-generator module for better performance. It is probably fast enough for where it is intended to be used - Web UI apps and such, but there's no harm optimizing. Your code here seems to be the fastest I have found, I just want to make it usable within my library. – Lefebvre 6/10, 2021 at 9:10

I've been playing with your solutions within my primes-generator library, but in the end could not use it, because: 1) You are pre-allocating full-length array, which is memory-inefficient for lite web processes, and 2) Your algorithm fails beyond 32 bits (if u set 3bln+ limit, for instance), while I need to support full 53 bits that JavaScript supports. – Lefebvre 6/10, 2021 at 11:45

@vitaly-t, if you are running into the above two problems, then you are not using the chapter 4a method: It only allocates arrays of the 16 Kilobytes page size although this version does allocate all of the base primes used in one go, so would use about 5 Megabytes for that range (I plan to change this but haven't gotten to it), and this version definitely works for more than 32-bit ranges - try the snippet above to 1e12, about 10 minutes on my machine (that's about 40 bits). That said, you won't want to sieve the entire range of 53 bits on a web page, as that will take over 1000 hours. – Kevyn 7/10, 2021 at 2:12

@Lefebvre (cont'd), if you want to handle that range, you'll want to refactor the code to sieve a sub range, which the above code doesn't do as it is just a benchmark to demonstrate the techniques of page-segmentation and maximum wheel factorization, but I have done in other applications of this. It will still need the storage of all the base primes to about 95 million though, so will still need the about five million bytes of storage to sieve even a portion of the 53-bit range. Also, an optimization using a larger CPU-L2-cache sized SieveBuffer sieved in L1 sized sub chunks is faster still. – Kevyn 7/10, 2021 at 2:25

@Kevyn Thank you, I appreciate very much your input on this. I will also recognize I'm not smart enough or advanced on the subject to understand what all that bodes for, or what should I change in my code, if anything. At least the latter gives you a link to where this all ended, in case you find interest in contributing to that library ;) – Lefebvre 12/10, 2021 at 18:11

@vitaly-t, You're welcome. Sorry to hear you are giving up at your current level of sieving. It's not about being smart, it's about being persistent: the various codes that I've posted here are the result of almost 50 years of experience in writing sieves - believe me that when I started, I had no idea that the algorithms would lead to this. As to joining your repo as contributor, I must decline as I often get such requests and have other projects I must prioritize; however, if you do something new just tag me in the commit and I'll try to find time to review and comment. – Kevyn 13/10, 2021 at 1:58

I'm not giving up :) I have brought it to being logically finished, and with fairly decent performance to be considered production-ready. I surely will revisit it in the future, and my library will see further improvements when I have time again. Thank you for your help! – Lefebvre 13/10, 2021 at 2:16

Would it be not much faster and efficient to build an algorithm that uses Pierre Dusart's formula as the approximate basis, and then calculate only what's inside that margin? Here's the playground I created. – Lefebvre 20/10, 2021 at 22:52

@vitaly-t, so now you have an approximation of the count of primes over a range, where are you going to start to refine it as the algorithm is not recursive? That said, although this question is about sieving, there is a family of algorithms that can count/sum the primes to a limit using combinational number algorithms to not need to determine each and every prime so their performance is about O(n^0.67) rather than about O(n) for sieving. Using the best of these, the count of primes to 2^64 can be found in about a minute on a desktop computer rather than the about 60 years using sieving. – Kevyn 21/10, 2021 at 2:46

@vitaly-t, (cont'd) if you are interested in such algorithms, you could start by implementing the Meissel-Lehmer prime counting algorithm which can calculate the number of primes to 10^16 in about an hour on a desktop computer (although not in JavaScript) rather than the almost a month it takes using sieving. The advances I mentioned on this combinational technique reduce this time to about a second for this range, but at the cost of quite a large increase in code complexity. – Kevyn 21/10, 2021 at 3:1

I don't have interest in very slow algorithms. I am working on my own version of the counter now, one which uses Pierre Dusart's approximation, and then narrows it down to a precise value at logarithmic progression, based on the fact that his approximation produces margin within 1%. – Lefebvre 21/10, 2021 at 3:38

@vitaly-t, you seem to have missed my point - the techniques I referenced using combinational numeric algorithms are not slow, and indeed are the fastest currently known general algorithms to find the number of primes up to a limit for large limits at the cost of being quite complex. Pierre Dusart's work and other later extensions of it find finer and finer approximations of the count of primes up to a limit but never claim to resolve to an exact correct number, else we would be using these methods to set records such as the count of primes above 10^28, which is not yet known. – Kevyn 21/10, 2021 at 6:15

Thank you for the clarification! I will have a look into that convoluted algo ;) – Lefebvre 21/10, 2021 at 6:23

@vitaly-t, the basic Meissel algorithm only seems convoluted due to the mathematical symbols in which it is expressed; in actual implementation, it is rather simple. After all, Meissel was able to use it to calculate the number of primes to a billion by hand in about 1880. However, it does need to use a sieve to find the prime up to n^(2/3) (a million for a total range of a billion). For a little more concise definition see the Meissel section here, Lehmer was able to do this and more on about a 10 Kilobyte machine in 1959! – Kevyn 21/10, 2021 at 21:4

Actually, my attention got caught today by an answer by Ovinus Real. I'm testing it now + optimizing, and I'm getting some fascinating performance numbers, much better than you got here. I will update later when I finish it completely, but I am already very impressed to have found it so fast and efficient counter in JavaScript, which received almost no attention, b.t.w. – Lefebvre 21/10, 2021 at 21:32

@vitaly-t, yes, that's the Meissel method applied to a fairly small range. As I said, of course it is much faster than a pure sieve as it doesn't actually find all the primes in the range, although you'll notice it contains a sieve for the primes up to the range^(2/3). I write these too but not for a question that specifically asks for a page-segmented sieve as here. You'll find that in order to maintain good performance for larger ranges such as the 53 bit range that JavaScript can handle, it needs many more optimizations that can be quite complex. – Kevyn 22/10, 2021 at 0:12

My optimized version of Meissel Lehmer gave me the performance of about 14 times better than here: 1e9 => 18ms, 1e10 => 130ms, 1e11 => 1200ms, 1e12 => 13s and 1e13 => 160s. Since I was looking for the fastest-performing JavaScript counter, this is the best I've got so far :) – Lefebvre 22/10, 2021 at 7:36

@vitaly-t, if you really want the fastest, you need to move beyond Meissel-Lehmer to at least Lagarias-Miller-Odlyzko (LMO), which at least an order faster and more for larger ranges. Since your goal is to have a web page that can count the primes in the 53-bit number range, you don't want to take days, but a maximum of minutes with a progress bar. This technique can also find the nth prime by finding the count of primes up to the approximate limit and then just using conventional sieving to find the prime at the difference between that approximate limit and the desired index... – Kevyn 22/10, 2021 at 19:46

Appendix: Using other JavaScript generating languages

Up to my Chapter 4a answer, only JavaScript has been used in this thread as per the question's request and code. However, in my mind, writing JavaScript is no longer the correct approach when writing more than a couple of hundred lines of code. There are several reasons for this, as follows:

JavaScript code was designed in the long ago so its programming models are quite dated. Much of this shows in its support for legacy code even though it has been (much) updated as per ECMA2015 and newer. However, the updates have lead to too many ways of doing things, not all of them efficient and leaving the programmer confused as to the best ways to go.
JavaScript is FUGLY in its way of using the prototype model to support a version of Object Oriented Programming (OOP)!
JavaScript is dynamically typed, which can lead to accidental code errors and make it harder to maintain and refactor code.
Hand coding for speed (usually using asm.js and newer features) doesn't always produce the optimal code unless one knows a lot about these optimizations use by JavaScript engines that will actually execute the code.

There are a couple of mainstream options to use another language to produce JavaScript by transpiling, with two of the most common (and best) as follows:

Microsoft's Typescript which is a (optionally) statically typed OOP language that has become quite popular for the above reasons.
My favorite is Fable, which is a fork of Microsoft's mostly functional ML-based language, F#, developed to produce efficient JavaScript code, and is an even better statically type checked language but which offers type inference and all kinds of neat features.

I don't do OOP if I can avoid it any more (Functional Programming - FP - is the way to go for me whenever it makes sense, as here), but here is a Fable version of the Chapter 4a code (on mobile, view in your browser as a desktop site to use); as for the Chapter 4a code, one should press the Sieve button several times to allow the JavaScript engine to hot tune the generated code for optimizations and thus improving speed, which it will reach after four or five iterations. With Fable, one can avoid imperative code entirely, even for User Interface (UI), by using the Elmish React-based library, which I didn't do here in order to not confuse the sample code; as well, for speed, I continue to use the sieve culling buffer as a mutable array - even the ultimate FP language, Haskell, allows this sort of mutation although protected by an "ST" monad (it's possible to implement the idea of monods in Fable/F#, but they aren't performant as there aren't the custom optimizations that Haskell has).

CORRECTIONS_TO_COMMENTS: Tt turns out that the Chrome V8 JavaScript engine already seems to optimize the immutable cull mutations away and the difference in speed for the first change is due to using imperative JavaScript for/while loops rather than Fable's emulation of tail recursive function using loops, which are slightly slower. The biggest effect in the better array copy calling the JavaScript directly. Use of the different bit length for counting also is quite a small improvement. The total improvement is something like 25%, but with the copy improvements about the same as both of the other two combined.

In the page that opens as per the above link, one can view the generated JavaScript code and will see that pure asm.js code is generated that is better (more consistent) than what one would write by hand, but that the Fable code needed a little help for performance by forcing it to emit JavaScript code in three places as follows:

Fable/F# code is immutable by default and to be true to the functional spirit I use immutable tail call recursive function loop, but these are slightly slower than using imperative for/while loops.
The Fable library for array copy/blit doesn't (yet at least) use the fastest JavaScript method (setting a subarray/slice), so I force it to emit JavaScript to do that.
In counting unset bits in the culling array, Fable doesn't offer other bit widths of TypedArray views (I don't think), so I add this by emitting JavaScript as its use of a Uint32 view is faster than using four successive Uint8 reads and assembling the Uint32 by hand by bit shifting.

You'll find that the resulting code is just as fast as the hand written JavaScript code in Chapter 4a!

Kevyn answered 2/4, 2020 at 3:43 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++