As a hobby project I'm taking a crack at finding really large prime numbers. The primality tests for this contain modular exponentiation calculations, i.e. a^e mod n. Let's call this the modpow operation to keep the explanation simple. I am wanting to speed up this particular calculation.
Currently I am using GMP's mpz_pown function, but, it is kind of slow. The reason I think it's too slow, is because a function call to GMP's modpow is slower than a full-blown primality test of the software called PFGW for the same large number. (So to be clear, this is just the GMP's modpow part, not my whole custom primality testing routine I am comparing). PFGW is considered the fastest in it's field and for my use case it uses a Brillhart-Lehmer-Selfridge primality test - which also uses the modpow procedure - so it's not because of mathematical cleverness that PFGW is faster in that aspect (please correct me if I'm wrong here). It looks like the bottleneck in GMP is the modpow operation. An example runtime for numbers which have a little over 20,000 digits: GMP's modpow operation takes about 45 seconds and PFGW finishes the whole primality test (involving a modpow) in 9 seconds flat. The difference gets even more impressive with even bigger numbers. GMP uses FFT multiplication and Montgomery reduction for this test comparison, see comments on this post below.
I did some research. So far I understand that the modpow algorithm uses exponentiation by squaring, integer multiplication and modulo reduction - these all sound very familiar to me. Several helper methods could improve the running time of integer multiplication:
To improve the running time of the exponentiation by squaring part, one may use a signed digit representation to reduce the number of multiplications (i.e. bits are represented as 0, 1 or -1, and the bit string is represented in such a way so that it contains many more zeros than in it's original base-2 representation - this reduces the running time of exponentiation by squaring).
For optimizing the modulo part of the operation, I know of these methods:
So here is the 150,000 dollar question: is there a software library available to do a modpow operation efficiently given a very large base, exponent and modulus? (I'm aiming for several millions of digits). If you would like to suggest an option, please try to explain the inner workings of the algorithm for the case with millions of digits for the base, modulus and exponents, as some libraries use different algorithms based on the number of digits. Basically I am looking for a library which supports the techniques mentioned above (or possibly more clever techniques) and it should perform well while running the algorithm (well, better than GMP at least). So far I've searched, found and tried GMP and PFGW, but didn't find these satisfying (PFGW is fast, but I'm just interested in the modpow operation and there is no direct programming interface to that). I'm hoping that maybe an expert in the field can suggest a library with these capabilities, as there seem to be very few that are able to handle these requirements.
Edit: made the question more concise, as it was marked too broad.
mpn_mul
,mpn_sqr
are used bympn_powm
, which will use FFT if the size of the multiplicands are over the Toom-Cook thresholds. Unless GMP has been built with--disable-fft
.mpn_powm
uses Montgomery reduction, including the case of an even modulus, asmpz_powm
decomposes the modulus into an odd factor, and an even (2^i) factor, and applies the Chinese remainder theorem to the results. The Montgomery reduction routinempn_redc_n
will also use FFT if the operands are large. – Repeatermpn_powm
isn't documented (as mentioned here). That page saysmpn_powm
is faster thanmpz_powm
. Do you happen to know why that is and what the differences between the two are? The GMP function index also doesn't mentionmpn_redc_n
which sounds like something very useful! Where can I find the documentation of these functions you mentioned? Thanks! – Gibbympz_
functions are the recommended interface. thempn_
functions do the real work, implementing the lower-level routines. That said, thempz_
functions also setup working memory, massage the data, select the best routines for the operands, etc. For such a large calculation, you would save nothing in bypassing the MPZ interface, and would have to essentially replicate the setup code. – Repeater