This answer gives a C implementation that I believe is the fastest and most memory-efficient.
Overview of algorithm. This algorithm is based on the bottom-up merge approach introduced by Will Ness in another answer, but is further simplified so that the lists being merged do not actually ever exist anywhere in memory. The head element of each list is groomed and kept in a small array, while all other elements of the lists are constructed on-the-fly as needed. This use of “phantom lists”—figments of the imagination of the running code—greatly reduces the memory footprint, as well as the volume of memory accesses, both read and write, and also improves spatial locality, which in turn significantly increases the speed of the algorithm. Factors at each level are written directly into their final resting place in the output array, in order.
The basic idea is to produce the factors using mathematical induction on the prime-power factorization. For example:
- To produce the factors of 360, the factors of 72 are computed and then multiplied by the relevant powers of 5, in this case {1,5}.
- To produce the factors of 72, the factors of 8 are computed and then multiplied by the relevant powers of 3, in this case {1,3,9}.
- To produce the factors of 8, the base case 1 is multiplied by the relevant powers of 2, in this case {1,2,4,8}.
Thus, at each inductive step, a Cartesian product is calculated between sets of existing factors and sets of prime powers, and the results are reduced back to integers via multiplication.
Below is an illustration for the number 360. Shown left-to-right are memory cells; one row represents one time step. Time is represented as flowing vertically downward.
Spatial complexity. Temporary data structures to produce the factors are extremely small: only O(log₂(n)) space is used, where n is the number whose factors are being generated. For example, in the 128-bit implementation of this algorithm in C, a number such as 333,939,014,887,358,848,058,068,063,658,770,598,400 (whose base-2 logarithm is ≈127.97) allocates 5.1 GB to store the list of its 318,504,960 factors, but uses only 360 bytes of additional overhead to produce that list. At most, approximately 5 KB overhead is needed for any 128-bit number.
Runtime complexity. Runtime depends entirely on the exponents of the prime-power factorization (e.g., the prime signature). For best results, largest exponents should be processed first and smallest exponents last, so that the runtime is dominated in the final stages by low-complexity merges, which in practice often turn out to be binary merges. Asymptotic runtime is O(c(n) d(n)), where d(n) is the divisor count of n and where c(n) is some constant dependent on the prime signature of n. That is, each equivalence class associated with a prime signature has a different constant. Prime signatures dominated by small exponents have smaller constants; prime signatures dominated by large exponents have larger constants. Thus, runtime complexity is clustered by prime signature.
Graphs. Runtime performance was profiled on a 3.4 GHz. Intel Core i7 for a sampling of 66,591 values of n having d(n) factors for unique d(n) up to 160 million. The largest value of n profiled for this graph was 14,550,525,518,294,259,162,294,162,737,840,640,000, producing 159,744,000 factors in 2.35 seconds.
The number of sorted factors produced per second is essentially the inversion of the above. Clustering by prime signature is apparent in the data. Performance is also affected by L1, L2, and L3 cache sizes, as well as physical RAM limitations.
Source Code. Attached below is a working program in the C programming language (specifically, C11). It has been tested on x86-64 with Clang/LLVM, although it should work fine with GCC as well.
/*==============================================================================
DESCRIPTION
This is a small proof-of-concept program to test the idea of generating the
factors of a number in ascending order using an ultra-efficient sortless
method.
INPUT
Input is given on the command line, either as a single argument giving the
number to be factored or an even number of arguments giving the 2-tuples that
comprise the prime-power factorization of the desired number. For example,
the number
75600 = 2^4 x 3^3 x 5^2 x 7
can be given by the following list of arguments:
2 4 3 3 5 2 7 1
Note: If a single number is given, it will require factoring to produce its
prime-power factorization. Since this is just a small test program, a very
crude factoring method is used that is extremely fast for small prime factors
but extremely slow for large prime factors. This is actually fine, because
the largest factor lists occur with small prime factors anyway, and it is the
production of large factor lists at which this program aims to be proficient.
It is simply not interesting to be fast at producing the factor list of a
number like 17293823921105882610 = 2 x 3 x 5 x 576460797370196087, because
it has only 32 factors. Numbers with tens or hundreds of thousands of
factors are much more interesting.
OUTPUT
Results are written to standard output. A list of factors in ascending order
is produced, followed by runtime required to generate the list (not including
time to print it).
AUTHOR
Todd Lehman
2015/05/10
*/
//-----------------------------------------------------------------------------
#include <inttypes.h>
#include <limits.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include <math.h>
#include <assert.h>
//-----------------------------------------------------------------------------
typedef unsigned int uint;
typedef uint8_t uint8;
typedef uint16_t uint16;
typedef uint32_t uint32;
typedef uint64_t uint64;
typedef __uint128_t uint128;
#define UINT128_MAX (uint128)(-1)
#define UINT128_MAX_STRLEN 39
//-----------------------------------------------------------------------------
#define ARRAY_CAPACITY(x) (sizeof(x) / sizeof((x)[0]))
//-----------------------------------------------------------------------------
// This structure encode a single prime-power pair for the prime-power
// factorization of numbers, for example 3 to the 4th power.
#pragma pack(push, 8)
typedef struct
{
uint128 p; // Prime.
uint8 e; // Power (exponent).
}
PrimePower; // 24 bytes using 8-byte packing
#pragma pack(pop)
//-----------------------------------------------------------------------------
// Prime-power factorization structure.
//
// This structure is sufficient to represent the prime-power factorization of
// all 128-bit values. The field names ω and Ω are dervied from the standard
// number theory functions ω(n) and Ω(n), which count the number of unique and
// non-unique prime factors of n, respectively. The field name d is derived
// from the standard number theory function d(n), which counts the number of
// divisors of n, including 1 and n.
//
// The maximum possible value here of ω is 26, which occurs at
// n = 232862364358497360900063316880507363070 = 2 x 3 x 5 x 7 x 11 x 13 x 17 x
// 19 x 23 x 29 x 31 x 37 x 41 x 43 x 47 x 53 x 59 x 61 x 67 x 71 x 73 x 79 x
// 83 x 89 x 97 x 101, which has 26 unique prime factors.
//
// The maximum possible value of Ω here is 127, which occurs at n = 2^127 and
// n = 2^126 x 3, both of which have 127 non-unique prime factors.
//
// The maximum possible value of d here is 318504960, which occurs at
// n = 333939014887358848058068063658770598400 = 2^9 x 3^5 x 5^2 x 7^2 x 11^2 x
// 13^2 x 17 x 19 x 23 x 29 x 31 x 37 x 41 x 43 x 47 x 53 x 59 x 61 x 67 x 71 x
// 73 x 79.
//
#pragma pack(push, 8)
typedef struct
{
PrimePower f[32]; // Primes and their exponents.
uint8 ω; // Count of prime factors without multiplicity.
uint8 Ω; // Count of prime factors with multiplicity.
uint32 d; // Count of factors of n, including 1 and n.
uint128 n; // Value of n on which all other fields depend.
}
PrimePowerFactorization; // 656 bytes using 8-byte packing
#pragma pack(pop)
#define MAX_ω 26
#define MAX_Ω 127
//-----------------------------------------------------------------------------
// Fatal error: print error message and abort.
void fatal_error(const char *format, ...)
{
va_list args;
va_start(args, format);
vfprintf(stderr, format, args);
exit(1);
}
//------------------------------------------------------------------------------
uint128 uint128_from_string(const char *const str)
{
assert(str != NULL);
uint128 n = 0;
for (int i = 0; isdigit(str[i]); i++)
n = (n * 10) + (uint)(str[i] - '0');
return n;
}
//------------------------------------------------------------------------------
void uint128_to_string(const uint128 n,
char *const strbuf, const uint strbuflen)
{
assert(strbuf != NULL);
assert(strbuflen >= UINT128_MAX_STRLEN + 1);
// Extract digits into string buffer in reverse order.
uint128 a = n;
char *s = strbuf;
do { *(s++) = '0' + (uint)(a % 10); a /= 10; } while (a != 0);
*s = '\0';
// Reverse the order of the digits.
uint l = strlen(strbuf);
for (uint i = 0; i < l/2; i++)
{ char t = strbuf[i]; strbuf[i] = strbuf[l-1-i]; strbuf[l-1-i] = t; }
// Verify result.
assert(uint128_from_string(strbuf) == n);
}
//------------------------------------------------------------------------------
char *uint128_to_static_string(const uint128 n, const uint i)
{
static char str[2][UINT128_MAX_STRLEN + 1];
assert(i < ARRAY_CAPACITY(str));
uint128_to_string(n, str[i], ARRAY_CAPACITY(str[i]));
return str[i];
}
//------------------------------------------------------------------------------
// Compute sorted list of factors, given a prime-power factorization.
uint128 *compute_factors(const PrimePowerFactorization ppf)
{
const uint128 n = ppf.n;
const uint d = (uint)ppf.d;
const uint ω = (uint)ppf.ω;
if (n == 0)
return NULL;
uint128 *factors = malloc((d + 1) * sizeof(*factors));
if (!factors)
fatal_error("Failed to allocate array of %u factors.", d);
uint128 *const factors_end = &factors[d];
// --- Seed the factors[] array.
factors_end[0] = 0; // Dummy value to simplify looping in bottleneck code.
factors_end[-1] = 1; // Seed value.
if (n == 1)
return factors;
// --- Iterate over all prime factors.
uint range = 1;
for (uint i = 0; i < ω; i++)
{
const uint128 p = ppf.f[i].p;
const uint e = ppf.f[i].e;
// --- Initialize phantom input lists and output list.
assert(e < 128);
assert(range < d);
uint128 *restrict in[128];
uint128 pe[128], f[128];
for (uint j = 0; j <= e; j++)
{
in[j] = &factors[d - range];
pe[j] = (j == 0)? 1 : pe[j-1] * p;
f[j] = pe[j];
}
uint active_list_count = 1 + e;
range *= 1 + e;
uint128 *restrict out = &factors[d - range];
// --- Merge phantom input lists to output list, until all input lists are
// extinguished.
while (active_list_count > 0)
{
if (active_list_count == 1)
{
assert(out == in[0]);
while (out != factors_end)
*(out++) *= pe[0];
in[0] = out;
}
else if (active_list_count == 2)
{
// This section of the code is the bottleneck of the entire factor-
// producing algorithm. Other portions need to be fast, but this
// *really* needs to be fast; therefore, it has been highly optimized.
// In fact, it is by far most frequently the case here that pe[0] is 1,
// so further optimization is warranted in this case.
uint128 f0 = f[0], f1 = f[1];
uint128 *in0 = in[0], *in1 = in[1];
const uint128 pe0 = pe[0], pe1 = pe[1];
if (pe[0] == 1)
{
while (true)
{
if (f0 < f1)
{ *(out++) = f0; f0 = *(++in0);
if (in0 == factors_end) break; }
else
{ *(out++) = f1; f1 = *(++in1) * pe1; }
}
}
else
{
while (true)
{
if (f0 < f1)
{ *(out++) = f0; f0 = *(++in0) * pe0;
if (in0 == factors_end) break; }
else
{ *(out++) = f1; f1 = *(++in1) * pe1; }
}
}
f[0] = f0; f[1] = f1;
in[0] = in0; in[1] = in1;
}
else if (active_list_count == 3)
{
uint128 f0 = f[0], f1 = f[1], f2 = f[2];
uint128 *in0 = in[0], *in1 = in[1], *in2 = in[2];
const uint128 pe0 = pe[0], pe1 = pe[1], pe2 = pe[2];
while (true)
{
if (f0 < f1)
{
if (f0 < f2)
{ *(out++) = f0; f0 = *(++in0) * pe0;
if (in0 == factors_end) break; }
else
{ *(out++) = f2; f2 = *(++in2) * pe2; }
}
else
{
if (f1 < f2)
{ *(out++) = f1; f1 = *(++in1) * pe1; }
else
{ *(out++) = f2; f2 = *(++in2) * pe2; }
}
}
f[0] = f0; f[1] = f1, f[2] = f2;
in[0] = in0; in[1] = in1, in[2] = in2;
}
else if (active_list_count >= 3)
{
while (true)
{
// Chose the smallest multiplier.
uint k_min = 0;
uint128 f_min = f[0];
for (uint k = 0; k < active_list_count; k++)
if (f[k] < f_min)
{ f_min = f[k]; k_min = k; }
// Write the output factor, advance the input pointer, and
// produce a new factor in the array f[] of list heads.
*(out++) = f_min;
f[k_min] = *(++in[k_min]) * pe[k_min];
if (in[k_min] == factors_end)
{ assert(k_min == 0); break; }
}
}
// --- Remove the newly emptied phantom input list. Note that this is
// guaranteed *always* to be the first remaining non-empty list.
assert(in[0] == factors_end);
for (uint j = 1; j < active_list_count; j++)
{
in[j-1] = in[j];
pe[j-1] = pe[j];
f[j-1] = f[j];
}
active_list_count -= 1;
}
assert(out == factors_end);
}
// --- Validate array of sorted factors.
#ifndef NDEBUG
{
for (uint k = 0; k < d; k++)
{
if (factors[k] == 0)
fatal_error("Produced a factor of 0 at index %u.", k);
if (n % factors[k] != 0)
fatal_error("Produced non-factor %s at index %u.",
uint128_to_static_string(factors[k], 0), k);
if ((k > 0) && (factors[k-1] == factors[k]))
fatal_error("Duplicate factor %s at index %u.",
uint128_to_static_string(factors[k], 0), k);
if ((k > 0) && (factors[k-1] > factors[k]))
fatal_error("Out-of-order factors %s and %s at indexes %u and %u.",
uint128_to_static_string(factors[k-1], 0),
uint128_to_static_string(factors[k], 1),
k-1, k);
}
}
#endif
return factors;
}
//------------------------------------------------------------------------------
// Print prime-power factorization of a number.
void print_ppf(const PrimePowerFactorization ppf)
{
printf("%s = ", uint128_to_static_string(ppf.n, 0));
if (ppf.n == 0)
{
printf("0");
}
else
{
for (uint i = 0; i < ppf.ω; i++)
{
if (i > 0)
printf(" x ");
printf("%s", uint128_to_static_string(ppf.f[i].p, 0));
if (ppf.f[i].e > 1)
printf("^%"PRIu8"", ppf.f[i].e);
}
}
printf("\n");
}
//------------------------------------------------------------------------------
int compare_powers_ascending(const void *const pf1,
const void *const pf2)
{
const PrimePower f1 = *((const PrimePower *)pf1);
const PrimePower f2 = *((const PrimePower *)pf2);
return (f1.e < f2.e)? -1:
(f1.e > f2.e)? +1:
0; // Not an error; duplicate exponents are common.
}
//------------------------------------------------------------------------------
int compare_powers_descending(const void *const pf1,
const void *const pf2)
{
const PrimePower f1 = *((const PrimePower *)pf1);
const PrimePower f2 = *((const PrimePower *)pf2);
return (f1.e < f2.e)? +1:
(f1.e > f2.e)? -1:
0; // Not an error; duplicate exponents are common.
}
//------------------------------------------------------------------------------
int compare_primes_ascending(const void *const pf1,
const void *const pf2)
{
const PrimePower f1 = *((const PrimePower *)pf1);
const PrimePower f2 = *((const PrimePower *)pf2);
return (f1.p < f2.p)? -1:
(f1.p > f2.p)? +1:
0; // Error; duplicate primes must never occur.
}
//------------------------------------------------------------------------------
int compare_primes_descending(const void *const pf1,
const void *const pf2)
{
const PrimePower f1 = *((const PrimePower *)pf1);
const PrimePower f2 = *((const PrimePower *)pf2);
return (f1.p < f2.p)? +1:
(f1.p > f2.p)? -1:
0; // Error; duplicate primes must never occur.
}
//------------------------------------------------------------------------------
// Sort prime-power factorization.
void sort_ppf(PrimePowerFactorization *const ppf,
const bool primes_major, // Best false
const bool primes_ascending, // Best false
const bool powers_ascending) // Best false
{
int (*compare_primes)(const void *, const void *) =
primes_ascending? compare_primes_ascending : compare_primes_descending;
int (*compare_powers)(const void *, const void *) =
powers_ascending? compare_powers_ascending : compare_powers_descending;
if (primes_major)
{
mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_powers);
mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_primes);
}
else
{
mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_primes);
mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_powers);
}
}
//------------------------------------------------------------------------------
// Compute prime-power factorization of a 128-bit value. Note that this
// function is designed to be fast *only* for numbers with very simple
// factorizations, e.g., those that produce large factor lists. Do not attempt
// to factor large semiprimes with this function. (The author does know how to
// factor large numbers efficiently; however, efficient factorization is beyond
// the scope of this small test program.)
PrimePowerFactorization compute_ppf(const uint128 n)
{
PrimePowerFactorization ppf;
if (n == 0)
{
ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
}
else if (n == 1)
{
ppf = (PrimePowerFactorization){ .f[0] = { .p = 1, .e = 1 },
.ω = 1, .Ω = 1, .d = 1, .n = 1 };
}
else
{
ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = n };
uint128 m = n;
for (uint128 p = 2; p * p <= m; p += 1 + (p > 2))
{
if (m % p == 0)
{
assert(ppf.ω <= MAX_ω);
ppf.f[ppf.ω].p = p;
ppf.f[ppf.ω].e = 0;
while (m % p == 0)
{ m /= p; ppf.f[ppf.ω].e += 1; }
ppf.d *= (1 + ppf.f[ppf.ω].e);
ppf.Ω += ppf.f[ppf.ω].e;
ppf.ω += 1;
}
}
if (m > 1)
{
assert(ppf.ω <= MAX_ω);
ppf.f[ppf.ω].p = m;
ppf.f[ppf.ω].e = 1;
ppf.d *= 2;
ppf.Ω += 1;
ppf.ω += 1;
}
}
return ppf;
}
//------------------------------------------------------------------------------
// Parse prime-power factorization from a list of ASCII-encoded base-10 strings.
// The values are assumed to be 2-tuples (p,e) of prime p and exponent e.
// Primes must not exceed 2^128 - 1 and must not be repeated. Exponents must
// not exceed 2^8 - 1, but can of course be repeated. The constructed value
// must not exceed 2^128 - 1.
PrimePowerFactorization parse_ppf(const uint pairs, const char *const values[])
{
assert(pairs <= MAX_ω);
PrimePowerFactorization ppf;
if (pairs == 0)
{
ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
}
else
{
ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = 1 };
for (uint i = 0; i < pairs; i++)
{
ppf.f[i].p = uint128_from_string(values[(i*2)+0]);
ppf.f[i].e = (uint8)strtoumax(values[(i*2)+1], NULL, 10);
// Validate prime value.
if (ppf.f[i].p < 2) // (Ideally this would actually do a primality test.)
fatal_error("Factor %s is invalid.",
uint128_to_static_string(ppf.f[i].p, 0));
// Accumulate count of unique prime factors.
if (ppf.ω > UINT8_MAX - 1)
fatal_error("Small-omega overflow at factor %s^%"PRIu8".",
uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
ppf.ω += 1;
// Accumulate count of total prime factors.
if (ppf.Ω > UINT8_MAX - ppf.f[i].e)
fatal_error("Big-omega wverflow at factor %s^%"PRIu8".",
uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
ppf.Ω += ppf.f[i].e;
// Accumulate total divisor count.
if (ppf.d > UINT32_MAX / (1 + ppf.f[i].e))
fatal_error("Divisor count overflow at factor %s^%"PRIu8".",
uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
ppf.d *= (1 + ppf.f[i].e);
// Accumulate value.
for (uint8 k = 1; k <= ppf.f[i].e; k++)
{
if (ppf.n > UINT128_MAX / ppf.f[i].p)
fatal_error("Value overflow at factor %s.",
uint128_to_static_string(ppf.f[i].p, 0));
ppf.n *= ppf.f[i].p;
}
}
}
return ppf;
}
//------------------------------------------------------------------------------
// Main control. Parse command line and produce list of factors.
int main(const int argc, const char *const argv[])
{
bool primes_major = false;
bool primes_ascending = false;
bool powers_ascending = false;
PrimePowerFactorization ppf;
// --- Parse prime-power sort specifier (if present).
uint value_base = 1;
uint value_count = (uint)argc - 1;
if ((argc > 1) && (argv[1][0] == '-'))
{
static const struct
{
char *str; bool primes_major, primes_ascending, powers_ascending;
}
sort_options[] =
{
// Sorting criteria:
// ----------------------------------------
{ "ep", 0,0,0 }, // Exponents descending, primes descending
{ "Ep", 0,0,1 }, // Exponents ascending, primes descending
{ "eP", 0,1,0 }, // Exponents descending, primes ascending
{ "EP", 0,1,1 }, // Exponents ascending, primes ascending
{ "p", 1,0,0 }, // Primes descending (exponents irrelevant)
{ "P", 1,1,0 }, // Primes ascending (exponents irrelevant)
};
bool valid = false;
for (uint i = 0; i < ARRAY_CAPACITY(sort_options); i++)
{
if (strcmp(&argv[1][1], sort_options[i].str) == 0)
{
primes_major = sort_options[i].primes_major;
primes_ascending = sort_options[i].primes_ascending;
powers_ascending = sort_options[i].powers_ascending;
valid = true;
break;
}
}
if (!valid)
fatal_error("Bad sort specifier: \"%s\"", argv[1]);
value_base += 1;
value_count -= 1;
}
// --- Prime factorization from either a number or a raw prime factorization.
if (value_count == 1)
{
uint128 n = uint128_from_string(argv[value_base]);
ppf = compute_ppf(n);
}
else
{
if (value_count % 2 != 0)
fatal_error("Odd number of arguments (%u) given.", value_count);
uint pairs = value_count / 2;
ppf = parse_ppf(pairs, &argv[value_base]);
}
// --- Sort prime factorization by either the default or the user-overridden
// configuration.
sort_ppf(&ppf, primes_major, primes_ascending, powers_ascending);
print_ppf(ppf);
// --- Run for (as close as possible to) a fixed amount of time, tallying the
// elapsed CPU time.
uint128 iterations = 0;
double cpu_time = 0.0;
const double cpu_time_limit = 0.10;
uint128 memory_usage = 0;
while (cpu_time < cpu_time_limit)
{
clock_t clock_start = clock();
uint128 *factors = compute_factors(ppf);
clock_t clock_end = clock();
cpu_time += (double)(clock_end - clock_start) / (double)CLOCKS_PER_SEC;
memory_usage = sizeof(*factors) * ppf.d;
if (++iterations == 0) //1)
{
for (uint32 i = 0; i < ppf.d; i++)
printf("%s\n", uint128_to_static_string(factors[i], 0));
}
if (factors) free(factors);
}
// --- Print the average amount of CPU time required for each iteration.
uint memory_scale = (memory_usage >= 1e9)? 9:
(memory_usage >= 1e6)? 6:
(memory_usage >= 1e3)? 3:
0;
char *memory_units = (memory_scale == 9)? "GB":
(memory_scale == 6)? "MB":
(memory_scale == 3)? "KB":
"B";
printf("%s %"PRIu32" factors %.6f ms %.3f ns/factor %.3f %s\n",
uint128_to_static_string(ppf.n, 0),
ppf.d,
cpu_time/iterations * 1e3,
cpu_time/iterations * 1e9 / (double)(ppf.d? ppf.d : 1),
(double)memory_usage / pow(10, memory_scale),
memory_units);
return 0;
}