Brute-force, single-threaded prime factorization

Asked 12/10, 2010 at 20:57 Answered 16/10, 2010 at 17:24

Up for consideration is the following function which can be used to (relatively quickly) factor a 64-bit unsigned integer into its prime factors. Note that the factoring is not probabalistic (i.e., it is exact). The algorithm is already fast enough to find that a number is prime or has few very large factors in a matter of several seconds, on modern hardware.

The question: Can any improvements be made to the algorithm presented, while keeping it single-threaded, so that it can factor (arbitrary) very large unsigned 64-bit integers faster, preferably without using a probabalistic approach (e.g., Miller-Rabin) for determining primality?

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef unsigned __int64 ulong;
typedef std::vector<ulong> ULongVector;

// Caller needs to pass in an empty factors vector
void GetFactors(ULongVector &factors, ulong num)
{
  // Num has to be at least 2 to contain "prime" factors
  if (num<2)
    return;

  ulong workingNum=num;
  ulong nextOffset=2; // Will be used to skip multiples of 3, later

  // Factor out factors of 2
  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  // Factor out factors of 3
  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  // If all of the factors were 2s and 3s, done...
  if (workingNum==1)
    return;

  // sqrtNum is the (inclusive) upper bound of our search for factors
  ulong sqrtNum=(ulong) sqrt(double(workingNum+0.5));

  // Factor out potential factors that are greate than or equal to 5
  // The variable n represents the next potential factor to be tested
  for (ulong n=5;n<=sqrtNum;)
  {
    // Is n a factor of the current working number?
    if (workingNum%n==0)
    {
      // n is a factor, so add it to the list of factors
      factors.push_back(n);

      // Divide current working number by n, to get remaining number to factor
      workingNum/=n;

      // Check if we've found all factors
      if (workingNum==1)
        return;

      // Recalculate the new upper bound for remaining factors
      sqrtNum=(ulong) sqrt(double(workingNum+0.5));

      // Recheck if n is a factor of the new working number, 
      // in case workingNum contains multiple factors of n
      continue;
    }

    // n is not or is no longer a factor, try the next odd number 
    // that is not a multiple of 3
    n+=nextOffset;
    // Adjust nextOffset to be an offset from n to the next non-multiple of 3
    nextOffset=(nextOffset==2UL ? 4UL : 2UL);
  }

  // Current workingNum is prime, add it as a factor
  factors.push_back(workingNum);
}

Thanks

Edit: I've added even more comments. The reason that a vector is passed in by reference, is to allow for the vector to be reused in between calls and avoid dynamic allocations. The reason the vector is not emptied in the function, is to allow for the odd requirement of appending the current "num's" factors to factors already in the vector.

The function itself is not pretty and can be refactored, but the question is about how to making the algorithm faster. So, please, no suggestions about how to make the function more pretty, readable, or C++ish. That's child's play. Improving this algorithm, so that it can find (proven) factors faster is the difficult part.

Update: Potatoswatter has some excellent solutions so far, be sure to check out his MMX solution near the bottom, as well.

Taritariff answered 12/10, 2010 at 20:57 Comment(26)

Your code doesn't seem to clear out factors before starting. Why not make it a function that returns a ULongVector? – Postfree 12/10, 2010 at 21:2

What is your question ? – Bayless 12/10, 2010 at 21:3

The question is how to speed up this algorithm, hopefully via algorithmic improvements/additions. – Taritariff 12/10, 2010 at 21:14

What makes you think this code is "highly efficient"? Not saying it is not, just wondering... – Hadden 12/10, 2010 at 21:15

It performs its job very fast for very large integers. If it is possible to make it perform its job faster, I'm all ears. In fact this is the question being asked! – Taritariff 12/10, 2010 at 21:16

How fast is very fast? How large is very large? – Shindig 12/10, 2010 at 21:30

At no loss of efficiency, you can make this a template that accepts an output iterator, removing the dependence on vector. A user can then still use a vector with GetFactors(n, back_inserter(v)). – Quinidine 12/10, 2010 at 21:31

Fast is about 6 seconds for the hardest 64-bit unsigned integers to factor (e.g., primes and composites with very large factors). I want to know if this simple algorithm can be improved to be even faster. – Taritariff 12/10, 2010 at 21:33

@wok: That little trick with nextOffset saves 33% of calls to integer modulo operator, which is very expensive. – Tighe 12/10, 2010 at 21:52

241 * 251 * 65521 * 4294967291 == 17022805693386603001, which makes it an interesting test candidate. – Peridot 12/10, 2010 at 21:55

Test 15631: you are trying to divide by 25 while you have already tried dividing by 5 as many times as possible. While the algorithm is effective, it is not perfect here. But maybe no algorithm is for the purpose you are trying to achieve, where a sieve is not conceivable. – Shindig 12/10, 2010 at 21:58

@Ben Voigt: I realized it later. That's a nice trick. :) – Shindig 12/10, 2010 at 21:59

@wok: A sieve for a 64-bit number would need a prohibitive amount of memory. – Tighe 12/10, 2010 at 22:23

It occurs to me that it would might be a really good idea to do a probabilistic primality test for your number once you started trying factors over 2^16, basically once the expected frequency of prime numbers was small enough that the cost of doing the test is outweighed by the cost of dividing until you got to the next one. – Peridot 12/10, 2010 at 23:33

The key here is to avoid such probabalistic tests, at least for the purposes of this algorithm, because 1) They are quite complex and 2) They can produce errors (even if improbably so). I was hoping for rather straightforward evolutionary (complexity/performance) improvements to the algorithm as stated. – Taritariff 12/10, 2010 at 23:44

@Michael: you could use a probabilistic test, if you could also somehow come up with a list of all false positives of that test (and their largest prime factors). There are "only" a few million Carmichael numbers under 2^64. Could fit in RAM, and searching an array that size even on disk with an interpolation chop would not take long compared with your 6 seconds, and anyway could be done in parallel since it's not CPU intensive. I think. I'm very much not an expert on factorization. – Apoenzyme 13/10, 2010 at 0:12

@Michael: you could probably try unrolling the loop once, so that you don't have the conditional +2 or +4. Instead you unconditionally increment by 2 in the middle of the loop, and 4 at the end. Might not make any difference, might prevent a branch misprediction 50-100% of the time. Who knows. – Apoenzyme 13/10, 2010 at 0:59

That's an interesting idea. I'll have to give that a try and post back... – Taritariff 13/10, 2010 at 1:43

Don't underestimate how good a sieve can be. I'm generating one with a multibyte encoding, so most primes are represented by just one byte. So far it's run for about 1 hour and generated all the primes up to a billion into a ~100 MB file. Will write the 64-bit factorizer and post results tomorrow when it's done. – Folsom 13/10, 2010 at 8:2

I would be very interested to hear the results of this attempt, but as I said previously I am very pessimistic on it improving performance. I think this is one of those cases where an improvement in complexity (a sieve vs brute force) may be strongly overweighed by "the constant" of real world performance. Still, it would be interesting to see how it performs, especially for large N! – Taritariff 13/10, 2010 at 13:35

@Michael: As an aside, there is a deterministic variant of Miller-Rabin, but I'm not sure how fast it is compared to your code. Also, it relies on the generalized Riemann hypothesis...which is probably only a concern if you're writing this algorithm for a technical paper. By the way, the errors produced by a probabilistic primality test should not be of major concern to you, assuming you use such a primality test properly. Keep in mind that usually you can get a fast result and still have the probability of error be lower than that of an error being introduced by a PC glitch. – Fraser 13/10, 2010 at 21:38

@Brian: "only a concern if you're writing this algorithm for a technical paper" - if there's a false positive of the deterministic Miller-Rabin below 2^64, proving that the GRH is false, then as someone with a maths degree, I strongly encourage people to write code that's buggy if they encounter it. That result is worth way more than 99.9% of software systems ;-) Seriously, though, there are probably various published lower bounds for the first possible false positive, so the thing to do would be to look for one that's big enough for the domain of the test. – Apoenzyme 14/10, 2010 at 21:43

@Steve:I think you are focusing too much on the possibility of false positives. Unless your RNG is broken or you pick an overly low number of trials, the probability of a false positive in the nondeterministic Miller-Rabin algorithm is too low to be worthy of notice, considering that CPU/RAM failures do happen occasionally. In fact, some of the optimizations used on many popular cpus actually have known timing bugs (in the name of MUCH faster CPUs) which happen absurdly rarely but cause broken execution order of CPU instructions. – Fraser 14/10, 2010 at 22:0

@Brian: sure, I'm aware of the general point, it also comes up when randomly generating GUIDs and when using secure hashes. But since 2^64 is within the area where you can deterministically factorize, it seems silly to me ignore that option even if you don't end up taking it. I don't think I'm "focusing" on anything, just pointing out that in cases where an algorithm isn't proven for all integers, it might still be proven for the pitifully small numbers you care about. Even if Michael ends up using a probabilistic check, surely it's of interest what deterministic checks are available? – Apoenzyme 14/10, 2010 at 22:11

So what are the rules, here? You don't want to use inline assembler, how about macros? Or is that "cheating"? – Malinger 15/10, 2010 at 0:57

Keep it to C++, please. The question is algorithmic in nature. – Taritariff 15/10, 2010 at 19:37

Compare such an approach to a (pre-generated) sieve. Modulo is expensive, so both approaches essentially do two things: generate potential factors, and perform modulo operations. Either program should reasonably generate a new candidate factor in less cycles than modulo takes, so either program is modulo bound.

The given approach filters out a constant proportion of all integers, namely the multiples of 2 and 3, or 75%. One in four (as given) numbers is used as an argument to the modulo operator. I'll call it a skip filter.

On the other hand, a sieve uses only primes as arguments to the modulo operator, and the average difference between successive primes is governed by the prime number theorem to be 1/ln(N). For example, e^20 is just under 500 million, so numbers over 500 million have under a 5% chance of being prime. If all numbers up to 2^32 are considered, 5% is a good rule of thumb.

Therefore, a sieve will spend 5 times less time on div operations as your skip filter. The next factor to consider is the speed at which the sieve produces primes, i.e. reads them from memory or disk. If fetching one prime is faster than 4 divs, then the sieve is faster. According to my tables div throughput on my Core2 is at most one per 12 cycles. These will be hard division problems, so let's conservatively budget 50 cycles per prime. For a 2.5 GHz processor, that's 20 nanoseconds.

In 20 ns, a 50 MB/sec hard drive can read about one byte. The simple solution is to use 4 bytes per prime, so the drive will be slower. But, we can be more clever. If we want to encode all the primes in order, we can just encode their differences. Again, the expected difference is 1/ln(N). Also, they're all even, which saves an extra bit. And they are never zero, which makes extension to a multibyte encoding free. So using one byte per prime, differences up to 512 can be stored in one byte, which gets us up to 303371455241 according to that Wikipedia article.

Therefore, depending on the hard drive, a stored list of primes should be about equal in speed at verifying primality. If it can be stored in RAM (it's 203 MB, so subsequent runs will probably hit the disk cache), then the problem goes away entirely, as the FSB speed typically differs from the processor speed by a factor less than the FSB width in bytes — i.e., the FSB can transfer more than one prime per cycle. Then factor of improvement is the reduction in division operations, i.e. five times. This is borne out by the experimental results below.

Of course, then there is multithreading. Ranges of either primes or skip-filtered candidates can be assigned to different threads, making either approach embarrassingly parallel. There are no optimizations that don't involve increasing the number of parallel divider circuits, unless you somehow eliminate the modulo.

Here is such a program. It's templated so you could add bignums.

/*
 *  multibyte_sieve.cpp
 *  Generate a table of primes, and use it to factorize numbers.
 *
 *  Created by David Krauss on 10/12/10.
 *
 */

#include <cmath>
#include <bitset>
#include <limits>
#include <memory>
#include <fstream>
#include <sstream>
#include <iostream>
#include <iterator>
#include <stdint.h>
using namespace std;

char const primes_filename[] = "primes";
enum { encoding_base = (1<< numeric_limits< unsigned char >::digits) - 2 };

template< typename It >
unsigned decode_gap( It &stream ) {
    unsigned gap = static_cast< unsigned char >( * stream ++ );

    if ( gap ) return 2 * gap; // only this path is tested

    gap = ( decode_gap( stream )/2-1 ) * encoding_base; // deep recursion
    return gap + decode_gap( stream ); // shallow recursion
}

template< typename It >
void encode_gap( It &stream, uint32_t gap ) {
    unsigned len = 0, bytes[4];

    gap /= 2;
    do {
        bytes[ len ++ ] = gap % encoding_base;
        gap /= encoding_base;
    } while ( gap );

    while ( -- len ) { // loop not tested
        * stream ++ = 0;
        * stream ++ = bytes[ len + 1 ];
    }
    * stream ++ = bytes[ 0 ];
}

template< size_t lim >
void generate_primes() {
    auto_ptr< bitset< lim / 2 > > sieve_p( new bitset< lim / 2 > );
    bitset< lim / 2 > &sieve = * sieve_p;

    ofstream out_f( primes_filename, ios::out | ios::binary );
    ostreambuf_iterator< char > out( out_f );

    size_t count = 0;

    size_t last = sqrtl( lim ) / 2 + 1, prev = 0, x = 1;
    for ( ; x != last; ++ x ) {
        if ( sieve[ x ] ) continue;
        size_t n = x * 2 + 1; // translate index to number
        for ( size_t m = x + n; m < lim/2; m += n ) sieve[ m ] = true;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    for ( ; x != lim / 2; ++ x ) {
        if ( sieve[ x ] ) continue;
        encode_gap( out, ( x - prev ) * 2 );
        prev = x;
    }

    cout << prev * 2 + 1 << endl;
}

template< typename I >
void factorize( I n ) {
    ifstream in_f( primes_filename, ios::in | ios::binary );
    if ( ! in_f ) {
        cerr << "Could not open primes file.\n"
                "Please generate it with 'g' command.\n";
        return;
    }

    while ( n % 2 == 0 ) {
        n /= 2;
        cout << "2 ";
    }
    unsigned long factor = 1;

    for ( istreambuf_iterator< char > in( in_f ), in_end; in != in_end; ) {
        factor += decode_gap( in );

        while ( n % factor == 0 ) {
            n /= factor;
            cout << factor << " ";
        }

        if ( n == 1 ) goto finish;
    }

    cout << n;
finish:
    cout << endl;
}

int main( int argc, char *argv[] ) {
    if ( argc != 2 ) goto print_help;

    unsigned long n;

    if ( argv[1][0] == 'g' ) {
        generate_primes< (1ul<< 32) >();
    } else if ( ( istringstream( argv[1] ) >> n ).rdstate() == ios::eofbit )
        factorize( n );
    } else goto print_help;

    return 0;

print_help:
    cerr << "Usage:\n\t" << argv[0] << " <number> -- factorize number.\n"
            "\t" << argv[0] << " g -- generate primes file in current directory.\n";
}

Performance on a 2.2 GHz MacBook Pro:

dkrauss$ time ./multibyte_sieve g
4294967291

real    2m8.845s
user    1m15.177s
sys    0m2.446s
dkrauss$ time ./multibyte_sieve 18446743721522234449
4294967231 4294967279 

real    0m5.405s
user    0m4.773s
sys 0m0.458s
dkrauss$ time ./mike 18446743721522234449
4294967231 4294967279
real    0m25.147s
user    0m24.170s
sys 0m0.096s

Folsom answered 13/10, 2010 at 20:27 Comment(8)

Well, the theory sure sounds interesting, but ultimately, let's see how it performs. So, when you fix the bug, please post your results here. I would especially be interested to hear how it handles very large primes (e.g., 18446744073709542713), and also how it handles large numbers consisting of only two large prime factors (e.g., 18446743721522234449). – Taritariff 14/10, 2010 at 0:33

… and it takes 25 sec for me. Wow, you have a fast machine. – Folsom 14/10, 2010 at 3:15

So the conclusion is the one we could expect: if you want to factor one big number once in a while, mike's code looks better, but if you are interested in factoring many numbers and disk space is not a problem, the sieve is way faster (five times faster for 20-digit numbers which is the product of two 10-digit prime numbers). Compare constant cost (generating the primes file) versus the cost of one factorization times the number of calls. – Shindig 14/10, 2010 at 7:0

@wok: Pretty much. The overhead of having the table is surprisingly low, though. 75 seconds to generate that table, excluding I/O but including some OS overhead. I haven't tried the prime generator on a regular array instead of the streambuf_iterator, but it should be a bit faster if you run it that way. Right now I'm trying an optimization so it won't churn through memory so much. The break-even point is already just 3 primes to verify, I might get it down to 2. – Folsom 14/10, 2010 at 7:13

I wonder if testing against the sieve could be effectively parallelized. – Peridot 14/10, 2010 at 14:25

@Omni: Certainly. Like I said, just divide it into subranges and assign one to each thread. Each thread will also need a checkpoint to be able to begin in the middle. After joining the threads, you'll need to merge their primes and get the final quotient. But all that should be easy. – Folsom 14/10, 2010 at 15:31

I will let this question sit for another few days. If nobody else comes up with a significant improvement that doesn't involve a look up table, or perhaps a better lookup table variant, I will mark this answer as accepted, unless I find deficiencies, time-performance wise, with this method in that time. – Taritariff 14/10, 2010 at 15:48

@Michael: I'm not sure SO handles multiple answers from the same person well, so make sure you check out my new program down at the bottom of the page. It requires SSE2 but doesn't put anything on the heap. – Folsom 15/10, 2010 at 12:14

My other answer is rather long and quite different from this one, so here's something else.

Rather than just filter out multiples of the first two primes, or encode all relevant primes in one byte each, this program filters out multiples of all of the primes that fit in eight bits, specifically 2 through 211. So instead of passing 33% of numbers, this passes about 10% on to the division operator.

The primes are kept in three SSE registers, and their moduli with the running counter are kept in another three. If the modulus of any prime with the counter is equal to zero, the counter cannot be prime. Also, if any modulus is equal to one, then (counter+2) cannot be prime, etc, up to (counter+30). Even numbers are ignored, and offsets like +3, +6, and +5 are skipped. Vector processing allows updating sixteen byte-sized variables at once.

After throwing a kitchen sink full o' micro-optimizations at it (but nothing more platform specific than an inline directive), I got a 1.78x performance boost (24 s vs 13.4 s on my laptop). If using a bignum library (even a very fast one), the advantage is greater. See below for a more readable, pre-optimization version.

/*
 *  factorize_sse.cpp
 *  Filter out multiples of the first 47 primes while factorizing a number.
 *
 *  Created by David Krauss on 10/14/10.
 *
 */

#include <cmath>
#include <sstream>
#include <iostream>
#include <xmmintrin.h>
using namespace std;

inline void remove_factor( unsigned long &n, unsigned long factor ) __attribute__((always_inline));
void remove_factor( unsigned long &n, unsigned long factor ) {
    while ( n % factor == 0 ) {
        n /= factor;
        cout << factor << " ";
    }
}

int main( int argc, char *argv[] ) {
    unsigned long n;

    if ( argc != 2
        || ( istringstream( argv[1] ) >> n >> ws ).rdstate() != ios::eofbit ) {
        cerr << "Usage: " << argv[0] << " <number>\n";
        return 1;
    }

    int primes[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
                     53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127,
                     131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211 };
    for ( int *p = primes; p < primes + sizeof primes/sizeof *primes; ++ p ) {
        remove_factor( n, * p );
    }

    //int histo[8] = {}, total = 0;

    enum { bias = 15 - 128 };
    __m128i const prime1 =       _mm_set_epi8( 21, 21, 21, 22, 22, 26, 26, 17, 19, 23, 29, 31, 37, 41, 43, 47 ),
            prime2 =             _mm_set_epi8( 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127 ),
            prime3 =             _mm_set_epi8( 131, 137, 139, 149, 151, 157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211 ),
            vbias = _mm_set1_epi8( bias ),
            v3 = _mm_set1_epi8( 3+bias ), v5 = _mm_set1_epi8( 5+bias ), v6 = _mm_set1_epi8( 6+bias ), v8 = _mm_set1_epi8( 8+bias ),
            v9 = _mm_set1_epi8( 9+bias ), v11 = _mm_set1_epi8( 11+bias ), v14 = _mm_set1_epi8( 14+bias ), v15 = _mm_set1_epi8( 15+bias );
    __m128i mod1 = _mm_add_epi8( _mm_set_epi8(  3, 10, 17,  5, 16,  6, 19,  8,  9, 11, 14, 15, 18, 20, 21, 23 ), vbias ),
            mod2 = _mm_add_epi8( _mm_set_epi8( 26, 29, 30, 33, 35, 36, 39, 41, 44, 48,  50,  51,  53,  54,  56,  63 ), vbias ),
            mod3 = _mm_add_epi8( _mm_set_epi8(  65,  68,  69,  74,  75,  78,  81,  83,  86,  89,  90,  95,  96,  98,  99, 105 ), vbias );

    for ( unsigned long factor = 1, limit = sqrtl( n ); factor <= limit + 30; factor += 30 ) {
        if ( n == 1 ) goto done;

        // up to 2^32, distribution of number candidates produced (0 up to 7) is
        // 0.010841     0.0785208   0.222928    0.31905     0.246109    0.101023    0.0200728   0.00145546 
        unsigned candidates[8], *cand_pen = candidates;
        * cand_pen = 6;
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v3 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v3 ), _mm_cmpeq_epi8( mod3,  v3 ) ) ) );
        * cand_pen = 10;                                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v5 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v5 ), _mm_cmpeq_epi8( mod3,  v5 ) ) ) );
        * cand_pen = 12;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v6 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v6 ), _mm_cmpeq_epi8( mod3,  v6 ) ) ) );
        * cand_pen = 16;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v8 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v8 ), _mm_cmpeq_epi8( mod3,  v8 ) ) ) );
        * cand_pen = 18;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1,  v9 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2,  v9 ), _mm_cmpeq_epi8( mod3,  v9 ) ) ) );
        * cand_pen = 22;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v11 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v11 ), _mm_cmpeq_epi8( mod3, v11 ) ) ) );
        * cand_pen = 28;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v14 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v14 ), _mm_cmpeq_epi8( mod3, v14 ) ) ) );
        * cand_pen = 30;                                                                                                                            
        cand_pen += !( _mm_movemask_epi8( _mm_cmpeq_epi8( mod1, v15 ) ) | _mm_movemask_epi8( _mm_or_si128( _mm_cmpeq_epi8( mod2, v15 ), _mm_cmpeq_epi8( mod3, v15 ) ) ) );

        /*++ total;
        ++ histo[ cand_pen - candidates ];

        cout << "( ";
        while ( cand_pen != candidates ) cout << factor + * -- cand_pen << " ";
        cout << ")" << endl; */

        mod1 = _mm_sub_epi8( mod1, _mm_set1_epi8( 15 ) ); // update residuals
        __m128i mask1 = _mm_cmplt_epi8( mod1, _mm_set1_epi8( 1+bias ) );
        mask1 = _mm_and_si128( mask1, prime1 ); // residual goes to zero or negative?
        mod1 = _mm_add_epi8( mask1, mod1 ); // combine reset into zero or negative

        mod2 = _mm_sub_epi8( mod2, _mm_set1_epi8( 15 ) );
        __m128i mask2 = _mm_cmplt_epi8( mod2, _mm_set1_epi8( 1+bias ) );
        mask2 = _mm_and_si128( mask2, prime2 );
        mod2 = _mm_add_epi8( mask2, mod2 );

        mod3 = _mm_sub_epi8( mod3, _mm_set1_epi8( 15 ) );
        __m128i mask3 = _mm_cmplt_epi8( mod3, _mm_set1_epi8( 1+bias ) );
        mask3 = _mm_and_si128( mask3, prime3 );
        mod3 = _mm_add_epi8( mask3, mod3 );

        if ( cand_pen - candidates == 0 ) continue;
        remove_factor( n, factor + candidates[ 0 ] );
        if ( cand_pen - candidates == 1 ) continue;
        remove_factor( n, factor + candidates[ 1 ] );
        if ( cand_pen - candidates == 2 ) continue;
        remove_factor( n, factor + candidates[ 2 ] );
        if ( cand_pen - candidates == 3 ) continue;
        remove_factor( n, factor + candidates[ 3 ] );
        if ( cand_pen - candidates == 4 ) continue;
        remove_factor( n, factor + candidates[ 4 ] );
        if ( cand_pen - candidates == 5 ) continue;
        remove_factor( n, factor + candidates[ 5 ] );
        if ( cand_pen - candidates == 6 ) continue;
        remove_factor( n, factor + candidates[ 6 ] );
    }

    cout << n;
done:
    /*cout << endl;
    for ( int hx = 0; hx < 8; ++ hx ) cout << (float) histo[hx] / total << " ";*/
    cout << endl;
}

dkrauss$ /usr/local/bin/g++ main.cpp -o factorize_sse -O3 --profile-use
dkrauss$ time ./factorize_sse 18446743721522234449
4294967231 4294967279 

real    0m13.437s
user    0m13.393s
sys 0m0.011s

Below is the first draft of the above. Optimizations included

Make the cyclic counter merge unconditional (avoid a branch).
Obtain ILP by unrolling loop by a factor of 15, increasing stride to 30.
- Inspired by your optimization.
- 30 seems to be a sweet spot, as it removes multiples of 2, 3, and 5 for free.
- Primes between 5 and 15 may have several multiples in one stride, so put several copies at varied phase in the vector.
Factor out remove_factor.
Change conditional, unpredictable remove_factor calls to non-branching array writes.
Fully unroll final loop with remove_factor call and make sure the function is always inlined.
- Eliminate the final unrolled iteration as there is always a multiple of 7 among the candidates.
Add another vector containing all the remaining primes that are small enough.
Make more space by adding a bias to the counters, and add yet another vector. Now there are only six more primes that could possibly be filtered without bumping up to 16 bits, and I've also run out of registers: the loop needs 3 prime vectors, 3 modulus vectors, 8 constants to search for, and one constant each to increment by and to do the range check. That makes 16.
- The gain is minimal (but positive) in this application, but the original purpose of this technique was to filter primes for the sieve in the other answer. So stay tuned…

Readable version:

/*
 *  factorize_sse.cpp
 *  Filter out multiples of the first 17 primes while factorizing a number.
 *
 *  Created by David Krauss on 10/14/10.
 *
 */

#include <cmath>
#include <sstream>
#include <iostream>
#include <xmmintrin.h>
using namespace std;

int main( int argc, char *argv[] ) {
    unsigned long n;

    if ( argc != 2
        || ( istringstream( argv[1] ) >> n >> ws ).rdstate() != ios::eofbit ) {
        cerr << "Usage: " << argv[0] << " <number>\n";
        return 1;
    }

    int primes[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59 };
    for ( int *p = primes; p < primes + sizeof primes/sizeof *primes; ++ p ) {
        while ( n % * p == 0 ) {
            n /= * p;
            cout << * p << " ";
        }
    }

    if ( n != 1 ) {
        __m128i       mod   = _mm_set_epi8( 1, 2, 3,  5,  6,  8,  9, 11, 14, 15, 18, 20, 21, 23, 26, 29 );
        __m128i const prime = _mm_set_epi8( 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59 ),
                      one = _mm_set1_epi8( 1 );

        for ( unsigned long factor = 1, limit = sqrtl( n ); factor < limit; ) {
            factor += 2;
            __m128i mask = _mm_cmpeq_epi8( mod, one ); // residual going to zero?
            mod = _mm_sub_epi8( mod, one ); // update other residuals
            if ( _mm_movemask_epi8( mask ) ) {
                mask = _mm_and_si128( mask, prime ); // reset cycle if going to zero
                mod = _mm_or_si128( mask, mod ); // combine reset into zeroed position

            } else while ( n % factor == 0 ) {
                n /= factor;
                cout << factor << " ";
                if ( n == 1 ) goto done;
            }
        }
        cout << n;
    }
done:
    cout << endl;
}

Folsom answered 15/10, 2010 at 11:53 Comment(5)

I don't think I'll give this particular version the checkmark, baring a revolutionary performance improvement, but I did give it a plus one vote for the extra effort and hope others do the same! – Taritariff 15/10, 2010 at 17:31

@Michael: Updated with micro-optimizations. Now 25% faster… I'd put the upper bound on remaining possible gains around 10% compounded. – Folsom 16/10, 2010 at 15:58

@Michael: Updated again. I underestimated the remaining improvement (positive jinx, I guess)… I've cut another 3 seconds off, which is 16% compound or 38% less time than your program. That is, close to twice as fast. Also PGO got me down to 14.5 sec once but I can't seem to reproduce it. – Folsom 16/10, 2010 at 22:5

It's very possible that a stride of 42 (2*3*7), producing up to 10 candidates from a pool of 12, would be slightly faster, but I'll leave that as an exercise to the reader. (Hint: It could only filter out primes up to 37.) – Folsom 16/10, 2010 at 23:14

@Michael: Updated again with more primes, cutting another 1.5 sec off the runtime. I haven't run it through the profiler, but I suspect there could be a further gain by balancing the OR operations between the integer and vector ALUs. – Folsom 17/10, 2010 at 18:8

Fermat's factorization method is simple and quick for finding pairs of large prime factors as long as you stop it before it goes too far and becomes slow. However, in my tests on random numbers such cases have been too rare to see any improvement.

...without using a probabalistic approach (e.g., Miller-Rabin) for determining primality

With uniform distribution, 75% of your inputs are going to need a billion loop iterations, so it's worth spending a million operations on less deterministic techniques first even if you get an inconclusive answer and have to revert back to trial division.

I've found the Brent variation of Pollard's Rho Method to be very good, though more complicated to code and understand. The best example I've seen is from this forum discussion. The method relies on luck, but helps often enough to be worthwhile.

The Miller-Rabin primality test is actually deterministic up to about 10^15, which can save you the trouble of a fruitless search.

I tried a few dozen variations and settled on the following for factoring int64 values:

Trial division on small factors. (I use first 8000 precomputed primes.)
10 attempts with Pollard's Rho, each using 16 iterations
Trial division to sqrt(n).

Note that Pollard's Rho finds factors that are not necessarily prime, so recursion can be used to factor those.

Otherworldly answered 16/10, 2010 at 17:24 Comment(0)

This code is rather slower, and I'm pretty sure I understand why. It's not incredibly slower, but definitely slower in the 10-20% range. The division should not be done once for every loop through, but the only way to do that is to call sqrt or something similar.

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef std::vector<ulong> ULongVector;

void GetFactors(ULongVector &factors, ulong num)
{
  if (num<2)
    return;

  ulong workingNum=num;
  ulong nextOffset=2;

  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  ulong n = 5;
  while ((workingNum != 1) && ((workingNum / n) >= n)) {
    // Is workingNum divisible by n?
    if (workingNum%n==0)
    {
      // n is a factor!
      // so is the number multiplied by n to get workingNum

      // Insert n into the list of factors
      factors.push_back(n);

      // Divide working number by n
      workingNum/=n;
    } else {
      n+=nextOffset;
      nextOffset=(nextOffset==2UL ? 4UL : 2UL);
    }
  }

  if (workingNum != 1) {
    // workingNum is prime, add it to the list of factors        
    factors.push_back(workingNum);
  }
}

Peridot answered 12/10, 2010 at 21:20 Comment(16)

The code alternates between 2 and 4 in order to skip numbers divisible by 3, since all factors of 3 have already been factored out. Thank you for your effort. I will give your approach a try later today and time it vs the original, to see if there is an incremental improvement in speed with this refactoring of the code... – Taritariff 12/10, 2010 at 21:22

@Michael Goldshteyn - I just finished verifying that the offset trick you're doing will indeed filter out only factors of 3. So yes, it will work. – Peridot 12/10, 2010 at 21:28

elimination of sqrt == good. Moving the workingNum == 1 and division into the loop condition == bad. – Tighe 12/10, 2010 at 21:35

@Ben Voigt - Unfortunately, I don't think elimination of the sqrt really helps. That can be done in a single instruction fairly quickly on modern processors. – Peridot 12/10, 2010 at 23:5

@Ben: regarding the division in the loop, if he's really luck Omnifarous could benefit from an instruction like idiv that does division and modulus together. Since the modulus is necessary, that would make the division basically free. – Apoenzyme 13/10, 2010 at 0:57

@Steve Jessop: An optimising compiler might be smart enough to figure that out on its own. – Postfree 13/10, 2010 at 1:4

@Greg: sorry, that's what I meant - he needs to have idiv, and his compiler has to use both results of a single op. If he's not lucky, his compiler doesn't figure it out, and does the same idiv twice (then again if n is a factor). I guess you thought I meant he could stick some asm in there? Which isn't a bad idea if the compiler hasn't handled it, just not what I was trying to suggest. – Apoenzyme 13/10, 2010 at 1:7

OK, let's keep inline assembly out of the equation. I was hoping more for algorithmic improvements or at least improvements within the confines of C++. – Taritariff 13/10, 2010 at 1:46

@Omnifarious: single instruction doesn't mean fast. Some instructions have much higher latency than others (although with pipelining, speed is as much dependent on the inherent dependencies as on the actual instructions used). Besides, the old code did an int->float conversion, float addition, float sqrt, then float->int conversion, which is quite a bit more than just one instruction. – Tighe 13/10, 2010 at 1:54

The reason that sqrt was so fast is because it was called once per factor, not once per iteration. I did anticipate that your divisions would take longer than the sqrt for that reason. – Tighe 13/10, 2010 at 2:1

Also, I wonder how much it would help to make n only half the width of num. – Tighe 13/10, 2010 at 2:4

@Greg Hewgill - I was hoping that it would figure it out on its own. I'm using gcc 4.5 to test. The assembly code was confusing enough that I couldn't quite tell. It really re-arranged things. – Peridot 13/10, 2010 at 2:25

@Ben Voigt - Integer square root using newton ralphson wouldn't take all that long. – Peridot 13/10, 2010 at 2:27

You guys are too obsessed about the role of the square root calculation. But, as mentioned already, this is only done once per factor and there can be at most 63 factors versus the billions of potential mod tests. So, please, concentrate on the bottleneck(s) in the "hot" part of the code. – Taritariff 13/10, 2010 at 3:24

I implemented an approximate integer sqrt using gcc's __builtin_clzll function and shifting slightly fewer than half the bits out. It was a lot slower. – Peridot 13/10, 2010 at 3:31

@Omnifarious: not surprised. Hardware support for sqrt helps a lot. But yeah, that's the approach I would take if I didn't have an FPU. Note that using a lower bound on the sqrt (such as the next lower power-of-2) is good enough for finding all but the very last factor, once n reaches that number you could use binary search to find the real sqrt and you'd only have to do that once. – Tighe 13/10, 2010 at 5:5

Incorporating some ideas from Omnifarious, plus other improvements:

// system specific typedef for ulong should go here (or use boost::uint64_t)
typedef unsigned __int64 ulong;
typedef std::vector<ulong> ULongVector;

// Caller needs to pass in an empty factors vector
void GetFactors(ULongVector &factors, ulong num)
{
  if (num<2)
    return;

  ulong workingNum=num;

  // Factor out factors of 2
  while (workingNum%2==0)
  {
    factors.push_back(2);
    workingNum/=2;
  }

  // Factor out factors of 3
  while (workingNum%3==0)
  {
    factors.push_back(3);
    workingNum/=3;
  }

  if (workingNum==1)
    return;

  // Factor out factors >=5
  ulong nextOffset=2;
  char nextShift = 1;
  ulong n = 5;
  ulong nn = 25;
  do {
    // Is workingNum divisible by n?
    if (workingNum%n==0)
    {
      // n is a factor!
      // so is the number multiplied by n to get workingNum

      // Insert n into the list of factors
      factors.push_back(n);

      // Divide working number by n
      workingNum/=n;

      // Test for done...
      if (workingNum==1)
        return;

      // Try n again
    }  
    else {
      nn += (n << (nextShift+1)) + (1<<(nextShift*2)); // (n+b)^2 = n^2 + 2*n*b + b*2
      n += nextOffset;
      nextOffset ^= 6;
      nextShift ^= 3;
      // invariant: nn == n*n
      if (n & 0x100000000LL) break; // careful of integer wraparound in n^2
    }
  } while (nn <= workingNum);

  // workingNum is prime, add it to the list of factors        
  factors.push_back(workingNum);
}

Tighe answered 12/10, 2010 at 21:37 Comment(6)

You need to swap the n += and nn += lines -- incrementing n first results in an nn value that is a little bit too large, so you might end the loop too soon. Try you program factoring 121 (11^2) to see. – Landan 12/10, 2010 at 21:53

@Chris: You are correct, good catch. Could also fix it by changing the sign on the final term (n^2 = (n-b)^2 + 2*n*b - b^2), but nice bonus of this order is that it will pipeline better because it isn't dependent on the just-calculated value of n. – Tighe 12/10, 2010 at 22:21

@PigBen: Yeah I just found that problem, missing parentheses. Would you try again? – Tighe 12/10, 2010 at 22:34

Downvote removed. Retesting speed to see if it deserves an upvote. – Squiffy 12/10, 2010 at 22:38

@Ben Voigt - Unfortunately, according to my testing yours is slower than mine, and takes nearly twice as long as the OPs. Mine takes approximately 10% longer. sqrt being a fast single instruction operation on modern CPUs really makes removing it not that big of an improvement. – Peridot 12/10, 2010 at 23:11

Strange. How does it compare to nuke nn and nextShift and just do n*n in the while condition? I guess that the term (1<<nextShift*2) doesn't actually need to be computed at runtime, it could be toggled between the two values (4, 16) using xor as well. Also would have to see what code the compiler came up with for the wraparound test. – Tighe 13/10, 2010 at 1:59

The natural generalization is to precompute skippages using more known primes than just 2 and 3. Like 2, 3, 5, 7, 11, for a pattern period of 2310 (huh, nice number). And perhaps more, but it has diminishing returns -- a graph of run times could establish exactly where the precomputation starts to have negative impact, but of course it depends on the number of numbers to factor...

Hah, I leave the coding details to you folks. :-)

Cheers & hth.,

– Alf

Spraggins answered 13/10, 2010 at 4:0 Comment(4)

The point of diminishing returns is going to be determined by the icache probably, which will be an unrolling factor quite a bit smaller than 2000. – Tighe 13/10, 2010 at 5:13

@Ben: note that with a pattern period of 6 (for the known primes 2 and 3) there are just 2 skip numbers, namely 2 and 4, so the data retained after precomputation is very much less than the period. I think the only think to do is measure. It's very hard to predict such things! :-) – Spraggins 13/10, 2010 at 5:21

I had the same thought, but the efficiency increase is significantly less (I suspect n^-x) for each set of skips that you add. So I think the point of diminishing returns would come fairly. You might get away with adding 5, and maybe 7, but I bet after that it's not worth it. The only way though, as you point out, is to measure. – Peridot 13/10, 2010 at 5:38

I tried many ideas, but they only slow the function down. I think the choice of literal (2 or 4) can be precompiled into the code, without any use of non-register or literal memory. This is not true of a lookup table, which would incur a cache hit - significantly slower than registers or literals baked into the actual instructions. – Taritariff 13/10, 2010 at 12:30

I'm not sure how effective these would be, but instead of

while (workingNum%2==0)

you could do

while (workingNum & 1 == 0)

I'm not sure if gcc or msvc (or whatever compiler you are using) is clever enough to change the workingNum%2 expression, but odds are it's doing the division and looking at the modulus in edx...

My next suggestion could be completely unnecessary depending on your compiler, but you could try putting the workingNum /= 3 before the method call. G++ might be smart enough to see the unnecessary division and just use the quotient in eax (you could do this inside your larger loop too). Or, a more thorough (but painful) approach would be to inline assembly the following code.

while (workingNum%3==0)
{
  factors.push_back(3);
  workingNum/=3;
}

The compiler is probably translating the modular operation into a divison and then looking at the modulus in edx. The issue is, you are performing the division again (and I doubt the compiler is seeing that you just implicitly performed the division in the condition of the loop). So, you could inline assemble this. This presents two issues:

1) The method call for push_back(3). This might mess with the registers, making this completely unnecessary.

2) Getting a register for workingNum, but this can be determined by doing an initial modular check (to force it into %eax), or at the current moment, it will/should be in eax.

You could write the loop as (assuming workingNum is in eax, and this is 32bit AT&T syntax, only because I don't know 64bit assembly or Intel syntax)

asm( "
     movl     $3, %ebx
  WorkNumMod3Loop: 
     movl     %eax, %ecx    # just to be safe, backup workingNUm
     movl     $0, %edx      # zero out edx
     idivl    $3            # divide by 3. quotient in eax, remainder in edx
     cmpl     $0, %edx      # compare if it's 0
     jne      AfterNumMod3Loop    # if 0 is the remainder, jump out

     # no need to perform division because new workingNum is already in eax
     #factors.push_back(3) call

     je       WorkNumMod3Loop
  AfterNumMod3Loop: 
     movl     %ecx, %eax"
);

You should look at the assembly output for those loops. It is possible your compiler is already making these optimizations, but I doubt it. I would not be surprised if putting the workingNum /= n before the method call improves performance a little in some cases.

Rebuttal answered 13/10, 2010 at 21:40 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags