Enumerate factors of a number directly in ascending order without sorting?
Asked Answered
B

4

19

Is there an efficient algorithm to enumerate the factors of a number n, in ascending order, without sorting? By “efficient” I mean:

  1. The algorithm avoids a brute-force search for divisors by starting with the prime-power factorization of n.

  2. The runtime complexity of the algorithm is O(d log₂ d) or better, where d is the divisor count of n.

  3. The spatial complexity of the algorithm is O(d).

  4. The algorithm avoids a sort operation. That is, the factors are produced in order rather than produced out of order and sorted afterward. Although enumerating using a simple recursive approach and then sorting is O(d log₂ d), there is a very ugly cost for all the memory accesses involved in sorting.

A trivial example is n = 360 = 2³ × 3² × 5, which has d=24 factors: { 1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 18, 20, 24, 30, 36, 40, 45, 60, 72, 90, 120, 180, 360 }.

A more serious example is n = 278282512406132373381723386382308832000 = 2⁸ × 3⁴ × 5³ × 7² × 11² × 13² × 17 × 19 × 23 × 29 × 31 × 37 × 41 × 43 × 47 × 53 × 59 × 61 × 67 × 71 × 73 × 79, which has d=318504960 factors (obviously too many to list here!). This number, incidentally, has the greatest number of factors of any number up to 2^128.

I could swear that I saw a description of such algorithm a few weeks ago, with sample code, but now I can't seem to find it anywhere. It used some magic trick of maintaining a list of progenitor indexes in the output list for each prime factor. (Update: I was confusing factor generation with Hamming numbers, which operate similarly.)

Update

I ended up using a solution which is O(d) in runtime, has extremely low memory overhead creating the O(d) output in-place, and is significantly faster than any other method I am aware of. I've posted this solution as answer, with C source code. It is a heavily optimized, simplified version of a beautiful algorithm presented here in Haskell by another contributor, Will Ness. I've selected Will's answer as the accepted answer, as it provided a very elegant solution that matched all the requirements as originally stated.

Bolte answered 1/5, 2015 at 18:35 Comment(10)
Maybe just run Dijkstra's algorithm on the implicit graph of factors where two factors have an edge between them if their ratio is a prime?Awful
"The spatial complexity of the algorithm is O(d)." - wait, what? If you're fine with O(d) spatial complexity, why not just use the generate-and-sort algorithm?Awful
Well, O(d) or better. O(1) would be fine. But I want to avoid the sorting because it's slow — too many memory operations.Bolte
As far as O(d) space algorithms go, I'd expect a sort to be one of the faster ones. There's good locality of reference and a lot of pre-existing order in the input to take advantage of.Awful
Also, here's a blog post with example code that basically does the Dijkstra thing. In that guy's tests, the sort outperfoms Dijkstra by a factor of about 14. It's worth noting that the Dijkstra traversal is implemented in pure Python, while the sort is Python's highly-optimized, adaptive mergesort; a Dijkstra-based algorithm written in C or Cython might do a lot better.Awful
If there was such algorithm, with runtime complexity O(d log_2 d), wouldn't factorization of semiprimes (d=2) be constant time, rendering RSA useless?Aspirant
@JuanLopes: We're given the prime factorization to start with. (That should probably be made more clear.)Awful
@Awful Didn't see that. You're right :)Aspirant
You are given the prime factors in order, so you can generate factors in rough order by starting with few low prime factors and working up to many large factors. You are adding factors one at a time so use a binary search to find where to insert each new factor, biased towards the higher factors found so far. A little experimentation should tell you where to make the first cut for best response. For the first few factors an insertion sort would probably be best.Subcritical
@rossum: Use a heap instead of binary search, and that's the Dijkstra factor-graph thing.Awful
N
9

We can merge streams of multiples, produced so there are no duplicates in the first place.

Starting with the list [1], for each unique prime factor p, we multiply the list by p iteratively k times (where k is the multiplicity of p) to get k new lists, and merge them together with that incoming list.

In Haskell,

ordfactors n =          --   <----------------------<---<---<-----\
  foldr g [1]           -- g (19,1) (g (7,1) (g (3,2) (g (2,3) [1])))
    . reverse           -- [(19,1),(7,1),(3,2),(2,3)]              ^
    . primePowers       -- [(2,3),(3,2),(7,1),(19,1)]              |
    $ n                 -- 9576                    --->--->--->----/
       where
         g (p,k) xs = mergeAsTree 
                        . take (k+1)          -- take 3 [a,b,c,d..] = [a,b,c]
                        . iterate (map (p*))  -- iterate f x = [x, y, z,...]
                        $ xs                  --  where { y=f x; z=f y; ... }

{-  g (2,3) [1] = let a0 = [1]        
                      a1 = map (2*) a0               -- [2]
                      a2 = map (2*) a1               -- [4]
                      a3 = map (2*) a2               -- [8]
                  in mergeAsTree [ a0, a1, a2, a3 ]  -- xs2 = [1,2,4,8]

    g (3,2) xs2 = let b0 = xs2                       -- [1,2,4,8]
                      b1 = map (3*) b0               -- [3,6,12,24]
                      b2 = map (3*) b1               -- [9,18,36,72]
                  in mergeAsTree [ b0, b1, b2 ]      -- xs3 = [1,2,3,4,6,8,9,12,...] 

    g (7,1) xs3 = mergeAsTree [ xs3, map (7*) xs3 ]  -- xs4 = [1,2,3,4,6,7,8,9,...]

    g (19,1) xs4 = mergeAsTree [ xs4, map (19*) xs4 ]  
-}
mergeAsTree xs   = foldi merge [] xs    -- [a,b]     --> merge a b
                                        -- [a,b,c,d] --> merge (merge a b) (merge c d)
foldi f z []     = z
foldi f z (x:xs) = f x (foldi f z (pairs f xs))
pairs f (x:y:t)  = f x y : pairs f t
pairs _ t        = t

foldi arranges the binary merges (which presuppose that there's no duplicates in the streams being merged, for streamlined operation) in a tree-like fashion for speed; whereas foldr works in linear fashion.

primePowers n | n > 1 =           -- 9576 => [(2,3),(3,2),(7,1),(19,1)]
  map (\xs -> (head xs,length xs)) --                                ^
    . group                       -- [[2,2,2],[3,3],[7],[19]]        |
    . factorize                   -- [2,2,2,3,3,7,19]                |
    $ n                           -- 9576             --->--->--->---/

group is a standard function that groups adjacent equals in a list together, and factorize is a well-known trial-division algorithm that produces prime factors of a number in non-decreasing order:

factorize n | n > 1 = go n (2:[3,5..])   -- 9576 = 2*2*2*3*3*7*19
   where                                 -- produce prime factors only
     go n (d:t)
        | d*d > n    = [n]
        | r == 0     =  d : go q (d:t)
        | otherwise  =      go n t
            where 
              (q,r)  = quotRem n d
factorize _ = []

(.) is a functional composition operator, chaining the output of its argument function on its right as input into the one on its left. It (and $) can be read aloud as "of".

Nadaha answered 2/5, 2015 at 2:1 Comment(33)
Generate-and-mergesort, but lazy, eh? I wish I could read Haskell.Awful
@Awful so, it is "generate sorted, and merge". as all the subsequences are generated in order, it should avoid the additional log n factor; but in imperative language it might be a challenge with all the needed bookkeeping (i.e. pointers into each sequence etc.)Nadaha
How does the number of subsequences relate to d? If it's O(d), you'll get the log(d) factor in the runtime for the same reason mergesort does.Awful
@Awful the number of steps n in the algorithm is the number of unique prime factors (not the total number of divisors, which in a regular case woulds be much much higher). On each step, k new streams are merged in, where k is the multiplicity of that prime factor. So the additional factor (over the O(d) since we produce d divisors) is not log(d), but is somehow related to the number of (unique (?)) prime factors of the number.Nadaha
@WillNess: Great solution, and the detailed comments are very helpful as I also struggle to read functional code! In your last comment I think you mean "the number of merge steps in the algorithm is the number of unique prime factors". Also if any of the prime factors have multiplicity > 1, then some of the merges are not 2-way merges -- I can't pinpoint exactly what that does to the time complexity, but it may make it worse than an extra log(num_unique_prime_factors).Sofko
@Sofko thanks! in the main "loop", foldr g [1], the number of steps is the number of unique prime factors of n. These are arranged linearly. On each step, k (= multiplicity) new streams are calculated linearly (by iterate (p*)) but are merged in a tree, which has the usual advantage over linear merging (which may be noticeable if k is big). I now think it might be advantageous to sort the unique factorization putting the factors with the largest ks at the top. But what is it overall, is kinda hard to wrap my head around it immediately. Maybe some simulation would help...Nadaha
@Sofko and so, to your last point, yes, it might be O( unique_pfs * log(max_multiplicity)) or something. So it might be looking dangerously close to the dreaded log(d) factor, actually. Maybe if the unique factorization is sorted, it'll make it be better...Nadaha
could you please post working Haskell code (including any imports)?Lambda
@גלעדברקן group is from Data.List. foldi is on wikipedia (linked), merge is as usual, e.g. mergeBy (<). Or you could import Data.List.Ordered (I think foldi is called foldt, there).Nadaha
@WillNess — I am curious, how fast does this solution run on input of a number like 18401055938125660800, which has 184320 factors? Now that I've implemented a fast bucket-sort solution to replace the Quicksort-based solution I'd originally started with, I might try your method if it looks comparable in speed.Bolte
@ToddLehman 0.022 secs, compiled Haskell code, calculating length (ordfactors ...) on Win7-64, Core i7 2.5 GHz.Nadaha
@ToddLehman and for (n*120), which has 2.2x more factors (405504), the time is 0.054 secs (i.e. d^1.13 empirically, for d a number of factors; seems to agree with d log d rule more or less).Nadaha
@WillNess — Thanks! That is definitely very impressive. For the first number, it's 4x slower than the C implementation I have using bucket sort...but for the second number, it's actually 30% faster! — suggesting better asymptotic growth. (I'm running my tests on a 3.4 GHz Core i7). I am really curious now how fast your method would run in C... Maybe learning some Haskell is in my future!Bolte
@ToddLehman ah, thanks for that. Interesting. :) This kind of code is easiest to translate using something like Python's generators. Otherwise, there'll be a lot of bookkeeping to do, manually. OTOH it is certainly doable. At their core, generators are nothing more than loops with internal state anyway; you'll just have to maintain that state by yourself, in C. I might have a go at it, too. Maybe... :)Nadaha
so, your time for 1st is 0.022/4 = 0.0055; and for 2nd it's 0.054*1.3 = 0.07? then the empirical order of growth is logBase 2.2 (700/55) = d^3. this doesn't look right, compared to your plots, which show ~ d^1.1. Maybe the memory usage hits some hard limit, like you say... Hitting the wall so to speak. -- I've tested my code for (n*120*360) now; number of factors is 2.1212x, d=860160. The time is 0.127 secs, so order of growth is logBase 2.1212 (127/54) = 0.137 again (compared with 0.138 for n*120, d=405504). So it does seem to scale OK for this range.Nadaha
@WillNess — Oh, my code which factors the first number (with 184320 factors) is pure 64-bit hardware arithmetic. The one which factors the second number (with 405504 factors) is a completely separate version of the code which does 128-bit arithmetic, which has to be simulated in software (automatically by the compiler) on x86-64, so there is a large penalty when crossing the 2⁶⁴ boundary in the number being factored. For my application, I don't actually need 128-bit numbers; I was just curious, so I developed a 128-bit version for comparison.Bolte
and in C, we can use a bona fide array-based priority queue, O(1) per operation (instead of that tree-folding), to get rid of that log(d) factor, for overall, perhaps, O(d*n_unique_factors) which I hoped for, originally.Nadaha
ah, I see, didn't think about that ; Haskell has unlimited Integer type. -- so it doesn't count; your code's order of growth is ~ n^1.1 and with low constant factor, too.Nadaha
@WillNess — One last question: How fast is your Haskell code on the number 331984163921408578320113791401984000000 (which has 49996800 factors)? My 128-bit C version eats up 2.2 GB of RAM for that, and takes 19.3 seconds on my 3.4 GHz. Core i7). I'm guessing your Haskell code beats my C code by a wide margin on this number — which, if true, means that I really should try implenting your algorithm in C.Bolte
17.9 secs, memory shoots up by about 1G.Nadaha
no, scratch that. properly tested as stand-alone exe, 12.78 secs, 0.96 GB memory. (and I must point out, my code is much, much, much shorter... ;) ;) :) ).Nadaha
@WillNess — Amazeballs. Sold!Bolte
last interesting tidbit. with reverse in the foldr g [1] . reverse . primePowers replaced with sortBy (comparing snd) — which changed the factorization to [(47,1),(43,1),(41,1),(37,1),(31,1),(29,1),(23,1),(19,1),(17,1),(13,2),(11,2),(7,4),(5,6),(3,9),(2,30)] — it ran for the same time, but in 788 MB memory; but without reversing, on plain factorization [(2,30),(3,9),(5,6),(7,4),(11,2),(13,2),(17,1),(19,1),(23,1),(29,1),(31,1),(37,1),(41,1),(43,1),(47,1)] (as well as on sorted, reversed) it ran for 20.6 secs, in 333 MB! (interesting how would C code behave there...)Nadaha
@WillNess — I see why your algorithm is so good: (1) It is very cache-friendly: everything is sequential. (2) It is basically a Natural Mergesort; but perhaps more importantly, it is a temporally in situ natural mergesort; that is, you begin merging immediately, without ever having to scan for runs. (3) Runtime in the hardest cases is dominated by the final doublings from the $s$ prime factors that appear only once, performed last. Therefore, the algorithm is $O(\sum_{k=1}^s{\frac{1}{2^k}} d) = O(2d) = O(d)$ for these cases, assuming constant-time multiplies and a custom 2-way merge.Bolte
@Sofko — If prime factors with multiplicity > 1 are processed first (that is, the highest multiplicity first, then working downward), then the runtime of the algorithm is dominated by the prime factors at the end with multiplicity of 1. For example, 18401055938125660800 = 2^7 x 3^4 x 5^2 x 7^2 x 11 x 13 x 17 x 19 x 23 x 29 x 31 x 37 x 41 is dominated by the list-size doublings incurred by prime factors { 11, 13, 17, 19, 23, 29, 31, 37, 41 }, so that the non-2-way merges (which occur very early on) are dwarfed by the 9 doublings that occur later.Bolte
@ToddLehman I copied the wrong factorization in my last comment. the first one should have been [(17,1),(19,1),(23,1),(29,1),(31,1),(37,1),(41,1),(43,1),(47,1),(11,2),(13,2),(7,4),(5,6),(3,9),(2,30)]. what I copied by mistake is the reversed factors; with that it ran for same time (12.85 secs) but in 957 MB memory.Nadaha
@WillNess — Hey, I took your basic idea and ran with it to its ultimate conclusion, where both spatial complexity and runtime complexity are O(d), by implementing "phantom lists" in the merging steps (that is, the lists don't actually exist in memory at the time of the merge) and storing the factors directly into their final locations to minimize writes to memory. In C, the last example takes 0.731 secs and uses 800 MB (that's 800 MB for returning the list of factors, plus 600 bytes of overhead for accounting). I'll post some graphs soon. Thanks so much again for your insights!Bolte
@ToddLehman wow, great! yes to the phantom lists idea, thought about it too. how do you merge, through a PQ or something else? ... wait, so it runs in 600 bytes? :) when you post the graphs do ping me please. :) had you played with the order as per the multiplicity of factors? glad to hear it worked out as we both hoped.Nadaha
@WillNess — Nope, no priority queue required! It simply keeps the head element of each phantom list in a tiny array (one element per power of the prime being multiplied by), and does a k-way merge of the phantom lists. Indeed, it is significantly faster to process the high-multiplicity primes first and low-multiplicity primes last. Often, binary merges dominate the runtime. I'll post code and graphs when my data collection run finishes. The 600 bytes overhead is assuming that the caller wants the list of factors returned in memory rather than printed and discarded.Bolte
thanks. don't you get a hit in efficiency though with the k-way merge, esp. for big ks? I thought about defining some struct with the pointers and the multipliers etc, and place those structs in an array-based PQ, for the quicker merge. for 2-way (maybe even 4-way) this is of course not necessary. but for k=30...Nadaha
@WillNess — Best thing is to take the hit with the k-way merge up front, when the list is still in its infancy. That is, do the primes with relatively large exponents first, and leave the primes with low exponents for last. This way, taking the hit for the k-way merge isn't even noticeable. Although it is possible to construct numbers like n = (2x3x5x7)^9 that incur, for example, a 10-way merge at every step, d(n) for this n is still only 10,000. The only way to get really high factor counts is to have lots of unique primes, and that generally means having a lot of small exponents.Bolte
@WillNess — p.s. I just posted my C implementation, with runtime statistics and a chart showing internal memory organization of the data structure: #29993404Bolte
see also https://mcmap.net/q/89075/-two-simple-codes-to-generate-divisors-of-a-number-why-is-the-recursive-one-fasterNadaha
B
13

This answer gives a C implementation that I believe is the fastest and most memory-efficient.

Overview of algorithm. This algorithm is based on the bottom-up merge approach introduced by Will Ness in another answer, but is further simplified so that the lists being merged do not actually ever exist anywhere in memory. The head element of each list is groomed and kept in a small array, while all other elements of the lists are constructed on-the-fly as needed. This use of “phantom lists”—figments of the imagination of the running code—greatly reduces the memory footprint, as well as the volume of memory accesses, both read and write, and also improves spatial locality, which in turn significantly increases the speed of the algorithm. Factors at each level are written directly into their final resting place in the output array, in order.

The basic idea is to produce the factors using mathematical induction on the prime-power factorization. For example:

  • To produce the factors of 360, the factors of 72 are computed and then multiplied by the relevant powers of 5, in this case {1,5}.
  • To produce the factors of 72, the factors of 8 are computed and then multiplied by the relevant powers of 3, in this case {1,3,9}.
  • To produce the factors of 8, the base case 1 is multiplied by the relevant powers of 2, in this case {1,2,4,8}.

Thus, at each inductive step, a Cartesian product is calculated between sets of existing factors and sets of prime powers, and the results are reduced back to integers via multiplication.

Below is an illustration for the number 360. Shown left-to-right are memory cells; one row represents one time step. Time is represented as flowing vertically downward.

Divisors of 360

Spatial complexity. Temporary data structures to produce the factors are extremely small: only O(log₂(n)) space is used, where n is the number whose factors are being generated. For example, in the 128-bit implementation of this algorithm in C, a number such as 333,939,014,887,358,848,058,068,063,658,770,598,400 (whose base-2 logarithm is ≈127.97) allocates 5.1 GB to store the list of its 318,504,960 factors, but uses only 360 bytes of additional overhead to produce that list. At most, approximately 5 KB overhead is needed for any 128-bit number.

Runtime complexity. Runtime depends entirely on the exponents of the prime-power factorization (e.g., the prime signature). For best results, largest exponents should be processed first and smallest exponents last, so that the runtime is dominated in the final stages by low-complexity merges, which in practice often turn out to be binary merges. Asymptotic runtime is O(c(n) d(n)), where d(n) is the divisor count of n and where c(n) is some constant dependent on the prime signature of n. That is, each equivalence class associated with a prime signature has a different constant. Prime signatures dominated by small exponents have smaller constants; prime signatures dominated by large exponents have larger constants. Thus, runtime complexity is clustered by prime signature.

Graphs. Runtime performance was profiled on a 3.4 GHz. Intel Core i7 for a sampling of 66,591 values of n having d(n) factors for unique d(n) up to 160 million. The largest value of n profiled for this graph was 14,550,525,518,294,259,162,294,162,737,840,640,000, producing 159,744,000 factors in 2.35 seconds.

Total runtime

The number of sorted factors produced per second is essentially the inversion of the above. Clustering by prime signature is apparent in the data. Performance is also affected by L1, L2, and L3 cache sizes, as well as physical RAM limitations.

Factors produced per second

Source Code. Attached below is a working program in the C programming language (specifically, C11). It has been tested on x86-64 with Clang/LLVM, although it should work fine with GCC as well.

/*==============================================================================

DESCRIPTION

   This is a small proof-of-concept program to test the idea of generating the
   factors of a number in ascending order using an ultra-efficient sortless
   method.


INPUT

   Input is given on the command line, either as a single argument giving the
   number to be factored or an even number of arguments giving the 2-tuples that
   comprise the prime-power factorization of the desired number.  For example,
   the number

      75600 = 2^4 x 3^3 x 5^2 x 7

   can be given by the following list of arguments:

      2 4 3 3 5 2 7 1

   Note:  If a single number is given, it will require factoring to produce its
   prime-power factorization.  Since this is just a small test program, a very
   crude factoring method is used that is extremely fast for small prime factors
   but extremely slow for large prime factors.  This is actually fine, because
   the largest factor lists occur with small prime factors anyway, and it is the
   production of large factor lists at which this program aims to be proficient.
   It is simply not interesting to be fast at producing the factor list of a
   number like 17293823921105882610 = 2 x 3 x 5 x 576460797370196087, because
   it has only 32 factors.  Numbers with tens or hundreds of thousands of
   factors are much more interesting.


OUTPUT

   Results are written to standard output.  A list of factors in ascending order
   is produced, followed by runtime required to generate the list (not including
   time to print it).


AUTHOR

   Todd Lehman
   2015/05/10

*/

//-----------------------------------------------------------------------------
#include <inttypes.h>
#include <limits.h>
#include <stdbool.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include <math.h>
#include <assert.h>

//-----------------------------------------------------------------------------
typedef  unsigned int  uint;
typedef  uint8_t       uint8;
typedef  uint16_t      uint16;
typedef  uint32_t      uint32;
typedef  uint64_t      uint64;
typedef  __uint128_t   uint128;

#define  UINT128_MAX  (uint128)(-1)

#define  UINT128_MAX_STRLEN  39

//-----------------------------------------------------------------------------
#define  ARRAY_CAPACITY(x)  (sizeof(x) / sizeof((x)[0]))

//-----------------------------------------------------------------------------
// This structure encode a single prime-power pair for the prime-power
// factorization of numbers, for example 3 to the 4th power.

#pragma pack(push, 8)
typedef struct
{
  uint128  p;  // Prime.
  uint8    e;  // Power (exponent).
}
PrimePower;   // 24 bytes using 8-byte packing
#pragma pack(pop)

//-----------------------------------------------------------------------------
// Prime-power factorization structure.
//
// This structure is sufficient to represent the prime-power factorization of
// all 128-bit values.  The field names ω and Ω are dervied from the standard
// number theory functions ω(n) and Ω(n), which count the number of unique and
// non-unique prime factors of n, respectively.  The field name d is derived
// from the standard number theory function d(n), which counts the number of
// divisors of n, including 1 and n.
//
// The maximum possible value here of ω is 26, which occurs at
// n = 232862364358497360900063316880507363070 = 2 x 3 x 5 x 7 x 11 x 13 x 17 x
// 19 x 23 x 29 x 31 x 37 x 41 x 43 x 47 x 53 x 59 x 61 x 67 x 71 x 73 x 79 x
// 83 x 89 x 97 x 101, which has 26 unique prime factors.
//
// The maximum possible value of Ω here is 127, which occurs at n = 2^127 and
// n = 2^126 x 3, both of which have 127 non-unique prime factors.
//
// The maximum possible value of d here is 318504960, which occurs at
// n = 333939014887358848058068063658770598400 = 2^9 x 3^5 x 5^2 x 7^2 x 11^2 x
// 13^2 x 17 x 19 x 23 x 29 x 31 x 37 x 41 x 43 x 47 x 53 x 59 x 61 x 67 x 71 x
// 73 x 79.
//
#pragma pack(push, 8)
typedef struct
{
  PrimePower  f[32];  // Primes and their exponents.
  uint8       ω;      // Count of prime factors without multiplicity.
  uint8       Ω;      // Count of prime factors with multiplicity.
  uint32      d;      // Count of factors of n, including 1 and n.
  uint128     n;      // Value of n on which all other fields depend.
}
PrimePowerFactorization;  // 656 bytes using 8-byte packing
#pragma pack(pop)

#define  MAX_ω  26
#define  MAX_Ω  127

//-----------------------------------------------------------------------------
// Fatal error:  print error message and abort.

void fatal_error(const char *format, ...)
{
  va_list args;
  va_start(args, format);
  vfprintf(stderr, format, args);
  exit(1);
}

//------------------------------------------------------------------------------
uint128 uint128_from_string(const char *const str)
{
  assert(str != NULL);

  uint128 n = 0;
  for (int i = 0; isdigit(str[i]); i++)
    n = (n * 10) + (uint)(str[i] - '0');

  return n;
}

//------------------------------------------------------------------------------
void uint128_to_string(const uint128 n,
                       char *const strbuf, const uint strbuflen)
{
  assert(strbuf != NULL);
  assert(strbuflen >= UINT128_MAX_STRLEN + 1);

  // Extract digits into string buffer in reverse order.
  uint128 a = n;
  char *s = strbuf;
  do { *(s++) = '0' + (uint)(a % 10); a /= 10; } while (a != 0);
  *s = '\0';

  // Reverse the order of the digits.
  uint l = strlen(strbuf);
  for (uint i = 0; i < l/2; i++)
    { char t = strbuf[i]; strbuf[i] = strbuf[l-1-i]; strbuf[l-1-i] = t; }

  // Verify result.
  assert(uint128_from_string(strbuf) == n);
}

//------------------------------------------------------------------------------
char *uint128_to_static_string(const uint128 n, const uint i)
{
  static char str[2][UINT128_MAX_STRLEN + 1];
  assert(i < ARRAY_CAPACITY(str));
  uint128_to_string(n, str[i], ARRAY_CAPACITY(str[i]));
  return str[i];
}

//------------------------------------------------------------------------------
// Compute sorted list of factors, given a prime-power factorization.

uint128 *compute_factors(const PrimePowerFactorization ppf)
{
  const uint128 n =       ppf.n;
  const uint    d = (uint)ppf.d;
  const uint    ω = (uint)ppf.ω;

  if (n == 0)
    return NULL;

  uint128 *factors = malloc((d + 1) * sizeof(*factors));
  if (!factors)
    fatal_error("Failed to allocate array of %u factors.", d);
  uint128 *const factors_end = &factors[d];


  // --- Seed the factors[] array.

  factors_end[0] = 0;   // Dummy value to simplify looping in bottleneck code.
  factors_end[-1] = 1;  // Seed value.

  if (n == 1)
    return factors;


  // --- Iterate over all prime factors.

  uint range = 1;
  for (uint i = 0; i < ω; i++)
  {
    const uint128 p = ppf.f[i].p;
    const uint    e = ppf.f[i].e;

    // --- Initialize phantom input lists and output list.
    assert(e < 128);
    assert(range < d);
    uint128 *restrict in[128];
    uint128 pe[128], f[128];
    for (uint j = 0; j <= e; j++)
    {
      in[j] = &factors[d - range];
      pe[j] = (j == 0)? 1 : pe[j-1] * p;
      f[j] = pe[j];
    }
    uint active_list_count = 1 + e;
    range *= 1 + e;
    uint128 *restrict out = &factors[d - range];

    // --- Merge phantom input lists to output list, until all input lists are
    //     extinguished.
    while (active_list_count > 0)
    {
      if (active_list_count == 1)
      {
        assert(out == in[0]);
        while (out != factors_end)
          *(out++) *= pe[0];
        in[0] = out;
      }
      else if (active_list_count == 2)
      {
        // This section of the code is the bottleneck of the entire factor-
        // producing algorithm.  Other portions need to be fast, but this
        // *really* needs to be fast; therefore, it has been highly optimized.
        // In fact, it is by far most frequently the case here that pe[0] is 1,
        // so further optimization is warranted in this case.
        uint128 f0 = f[0], f1 = f[1];
        uint128 *in0 = in[0], *in1 = in[1];
        const uint128 pe0 = pe[0], pe1 = pe[1];

        if (pe[0] == 1)
        {
          while (true)
          {
            if (f0 < f1)
              { *(out++) = f0; f0 = *(++in0);
                if (in0 == factors_end) break; }
            else
              { *(out++) = f1; f1 = *(++in1) * pe1; }
          }
        }
        else
        {
          while (true)
          {
            if (f0 < f1)
              { *(out++) = f0; f0 = *(++in0) * pe0;
                if (in0 == factors_end) break; }
            else
              { *(out++) = f1; f1 = *(++in1) * pe1; }
          }
        }

        f[0] = f0; f[1] = f1;
        in[0] = in0; in[1] = in1;
      }
      else if (active_list_count == 3)
      {
        uint128 f0 = f[0], f1 = f[1], f2 = f[2];
        uint128 *in0 = in[0], *in1 = in[1], *in2 = in[2];
        const uint128 pe0 = pe[0], pe1 = pe[1], pe2 = pe[2];

        while (true)
        {
          if (f0 < f1)
          {
            if (f0 < f2)
              { *(out++) = f0; f0 = *(++in0) * pe0;
                if (in0 == factors_end) break; }
            else
              { *(out++) = f2; f2 = *(++in2) * pe2; }
          }
          else
          {
            if (f1 < f2)
              { *(out++) = f1; f1 = *(++in1) * pe1; }
            else
              { *(out++) = f2; f2 = *(++in2) * pe2; }
          }
        }

        f[0] = f0; f[1] = f1, f[2] = f2;
        in[0] = in0; in[1] = in1, in[2] = in2;
      }
      else if (active_list_count >= 3)
      {
        while (true)
        {
          // Chose the smallest multiplier.
          uint k_min = 0;
          uint128 f_min = f[0];
          for (uint k = 0; k < active_list_count; k++)
            if (f[k] < f_min)
              { f_min = f[k]; k_min = k; }

          // Write the output factor, advance the input pointer, and
          // produce a new factor in the array f[] of list heads.
          *(out++) = f_min;
          f[k_min] = *(++in[k_min]) * pe[k_min];
          if (in[k_min] == factors_end)
            { assert(k_min == 0); break; }
        }
      }

      // --- Remove the newly emptied phantom input list.  Note that this is
      //     guaranteed *always* to be the first remaining non-empty list.
      assert(in[0] == factors_end);
      for (uint j = 1; j < active_list_count; j++)
      {
        in[j-1] = in[j];
        pe[j-1] = pe[j];
         f[j-1] =  f[j];
      }
      active_list_count -= 1;
    }

    assert(out == factors_end);
  }


  // --- Validate array of sorted factors.
  #ifndef NDEBUG
  {
    for (uint k = 0; k < d; k++)
    {
      if (factors[k] == 0)
        fatal_error("Produced a factor of 0 at index %u.", k);

      if (n % factors[k] != 0)
        fatal_error("Produced non-factor %s at index %u.",
                    uint128_to_static_string(factors[k], 0), k);

      if ((k > 0) && (factors[k-1] == factors[k]))
        fatal_error("Duplicate factor %s at index %u.",
                    uint128_to_static_string(factors[k], 0), k);

      if ((k > 0) && (factors[k-1] > factors[k]))
        fatal_error("Out-of-order factors %s and %s at indexes %u and %u.",
                    uint128_to_static_string(factors[k-1], 0),
                    uint128_to_static_string(factors[k], 1),
                    k-1, k);
    }
  }
  #endif


  return factors;
}

//------------------------------------------------------------------------------
// Print prime-power factorization of a number.

void print_ppf(const PrimePowerFactorization ppf)
{
  printf("%s = ", uint128_to_static_string(ppf.n, 0));
  if (ppf.n == 0)
  {
    printf("0");
  }
  else
  {
    for (uint i = 0; i < ppf.ω; i++)
    {
      if (i > 0)
        printf(" x ");

      printf("%s", uint128_to_static_string(ppf.f[i].p, 0));

      if (ppf.f[i].e > 1)
        printf("^%"PRIu8"", ppf.f[i].e);
    }
  }
  printf("\n");
}

//------------------------------------------------------------------------------
int compare_powers_ascending(const void *const pf1,
                             const void *const pf2)
{
  const PrimePower f1 = *((const PrimePower *)pf1);
  const PrimePower f2 = *((const PrimePower *)pf2);

  return  (f1.e < f2.e)?  -1:
          (f1.e > f2.e)?  +1:
                           0;  // Not an error; duplicate exponents are common.
}

//------------------------------------------------------------------------------
int compare_powers_descending(const void *const pf1,
                              const void *const pf2)
{
  const PrimePower f1 = *((const PrimePower *)pf1);
  const PrimePower f2 = *((const PrimePower *)pf2);

  return  (f1.e < f2.e)?  +1:
          (f1.e > f2.e)?  -1:
                           0;  // Not an error; duplicate exponents are common.
}

//------------------------------------------------------------------------------
int compare_primes_ascending(const void *const pf1,
                             const void *const pf2)
{
  const PrimePower f1 = *((const PrimePower *)pf1);
  const PrimePower f2 = *((const PrimePower *)pf2);

  return  (f1.p < f2.p)?  -1:
          (f1.p > f2.p)?  +1:
                           0;  // Error; duplicate primes must never occur.
}

//------------------------------------------------------------------------------
int compare_primes_descending(const void *const pf1,
                              const void *const pf2)
{
  const PrimePower f1 = *((const PrimePower *)pf1);
  const PrimePower f2 = *((const PrimePower *)pf2);

  return  (f1.p < f2.p)?  +1:
          (f1.p > f2.p)?  -1:
                           0;  // Error; duplicate primes must never occur.
}

//------------------------------------------------------------------------------
// Sort prime-power factorization.

void sort_ppf(PrimePowerFactorization *const ppf,
              const bool primes_major,      // Best false
              const bool primes_ascending,  // Best false
              const bool powers_ascending)  // Best false
{
  int (*compare_primes)(const void *, const void *) =
    primes_ascending? compare_primes_ascending : compare_primes_descending;

  int (*compare_powers)(const void *, const void *) =
    powers_ascending? compare_powers_ascending : compare_powers_descending;

  if (primes_major)
  {
    mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_powers);
    mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_primes);
  }
  else
  {
    mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_primes);
    mergesort(ppf->f, ppf->ω, sizeof(ppf->f[0]), compare_powers);
  }
}

//------------------------------------------------------------------------------
// Compute prime-power factorization of a 128-bit value.  Note that this
// function is designed to be fast *only* for numbers with very simple
// factorizations, e.g., those that produce large factor lists.  Do not attempt
// to factor large semiprimes with this function.  (The author does know how to
// factor large numbers efficiently; however, efficient factorization is beyond
// the scope of this small test program.)

PrimePowerFactorization compute_ppf(const uint128 n)
{
  PrimePowerFactorization ppf;

  if (n == 0)
  {
    ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
  }
  else if (n == 1)
  {
    ppf = (PrimePowerFactorization){ .f[0] = { .p = 1, .e = 1 },
                                     .ω = 1, .Ω = 1, .d = 1, .n = 1 };
  }
  else
  {
    ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = n };

    uint128 m = n;
    for (uint128 p = 2; p * p <= m; p += 1 + (p > 2))
    {
      if (m % p == 0)
      {
        assert(ppf.ω <= MAX_ω);
        ppf.f[ppf.ω].p = p;
        ppf.f[ppf.ω].e = 0;
        while (m % p == 0)
          { m /= p; ppf.f[ppf.ω].e += 1; }
        ppf.d *= (1 + ppf.f[ppf.ω].e);
        ppf.Ω += ppf.f[ppf.ω].e;
        ppf.ω += 1;
      }
    }
    if (m > 1)
    {
      assert(ppf.ω <= MAX_ω);
      ppf.f[ppf.ω].p = m;
      ppf.f[ppf.ω].e = 1;
      ppf.d *= 2;
      ppf.Ω += 1;
      ppf.ω += 1;
    }
  }

  return ppf;
}

//------------------------------------------------------------------------------
// Parse prime-power factorization from a list of ASCII-encoded base-10 strings.
// The values are assumed to be 2-tuples (p,e) of prime p and exponent e.
// Primes must not exceed 2^128 - 1 and must not be repeated.  Exponents must
// not exceed 2^8 - 1, but can of course be repeated.  The constructed value
// must not exceed 2^128 - 1.

PrimePowerFactorization parse_ppf(const uint pairs, const char *const values[])
{
  assert(pairs <= MAX_ω);

  PrimePowerFactorization ppf;

  if (pairs == 0)
  {
    ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
  }
  else
  {
    ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = 1 };

    for (uint i = 0; i < pairs; i++)
    {
      ppf.f[i].p = uint128_from_string(values[(i*2)+0]);
      ppf.f[i].e = (uint8)strtoumax(values[(i*2)+1], NULL, 10);

      // Validate prime value.
      if (ppf.f[i].p < 2)  // (Ideally this would actually do a primality test.)
        fatal_error("Factor %s is invalid.",
                    uint128_to_static_string(ppf.f[i].p, 0));

      // Accumulate count of unique prime factors.
      if (ppf.ω > UINT8_MAX - 1)
        fatal_error("Small-omega overflow at factor %s^%"PRIu8".",
                    uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
      ppf.ω += 1;

      // Accumulate count of total prime factors.
      if (ppf.Ω > UINT8_MAX - ppf.f[i].e)
        fatal_error("Big-omega wverflow at factor %s^%"PRIu8".",
                    uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
      ppf.Ω += ppf.f[i].e;

      // Accumulate total divisor count.
      if (ppf.d > UINT32_MAX / (1 + ppf.f[i].e))
        fatal_error("Divisor count overflow at factor %s^%"PRIu8".",
                    uint128_to_static_string(ppf.f[i].p, 0), ppf.f[i].e);
      ppf.d *= (1 + ppf.f[i].e);

      // Accumulate value.
      for (uint8 k = 1; k <= ppf.f[i].e; k++)
      {
        if (ppf.n > UINT128_MAX / ppf.f[i].p)
          fatal_error("Value overflow at factor %s.",
                      uint128_to_static_string(ppf.f[i].p, 0));
        ppf.n *= ppf.f[i].p;
      }
    }
  }

  return ppf;
}

//------------------------------------------------------------------------------
// Main control.  Parse command line and produce list of factors.

int main(const int argc, const char *const argv[])
{
  bool primes_major     = false;
  bool primes_ascending = false;
  bool powers_ascending = false;

  PrimePowerFactorization ppf;


  // --- Parse prime-power sort specifier (if present).

  uint value_base = 1;
  uint value_count = (uint)argc - 1;
  if ((argc > 1) && (argv[1][0] == '-'))
  {
    static const struct
    {
      char *str; bool primes_major, primes_ascending, powers_ascending;
    }
    sort_options[] =
    {
                        // Sorting criteria:
                        // ----------------------------------------
      { "ep", 0,0,0 },  // Exponents descending, primes descending
      { "Ep", 0,0,1 },  // Exponents ascending, primes descending
      { "eP", 0,1,0 },  // Exponents descending, primes ascending
      { "EP", 0,1,1 },  // Exponents ascending, primes ascending
      { "p",  1,0,0 },  // Primes descending (exponents irrelevant)
      { "P",  1,1,0 },  // Primes ascending (exponents irrelevant)
    };

    bool valid = false;
    for (uint i = 0; i < ARRAY_CAPACITY(sort_options); i++)
    {
      if (strcmp(&argv[1][1], sort_options[i].str) == 0)
      {
        primes_major     = sort_options[i].primes_major;
        primes_ascending = sort_options[i].primes_ascending;
        powers_ascending = sort_options[i].powers_ascending;
        valid = true;
        break;
      }
    }

    if (!valid)
      fatal_error("Bad sort specifier: \"%s\"", argv[1]);

    value_base += 1;
    value_count -= 1;
  }


  // --- Prime factorization from either a number or a raw prime factorization.

  if (value_count == 1)
  {
    uint128 n = uint128_from_string(argv[value_base]);
    ppf = compute_ppf(n);
  }
  else
  {
    if (value_count % 2 != 0)
      fatal_error("Odd number of arguments (%u) given.", value_count);
    uint pairs = value_count / 2;
    ppf = parse_ppf(pairs, &argv[value_base]);
  }


  // --- Sort prime factorization by either the default or the user-overridden
  //     configuration.

  sort_ppf(&ppf, primes_major, primes_ascending, powers_ascending);
  print_ppf(ppf);


  // --- Run for (as close as possible to) a fixed amount of time, tallying the
  //     elapsed CPU time.

  uint128 iterations = 0;
  double cpu_time = 0.0;
  const double cpu_time_limit = 0.10;
  uint128 memory_usage = 0;
  while (cpu_time < cpu_time_limit)
  {
    clock_t clock_start = clock();
    uint128 *factors = compute_factors(ppf);
    clock_t clock_end = clock();
    cpu_time += (double)(clock_end - clock_start) / (double)CLOCKS_PER_SEC;
    memory_usage = sizeof(*factors) * ppf.d;

    if (++iterations == 0) //1)
    {
      for (uint32 i = 0; i < ppf.d; i++)
        printf("%s\n", uint128_to_static_string(factors[i], 0));
    }

    if (factors) free(factors);
  }


  // --- Print the average amount of CPU time required for each iteration.

  uint memory_scale = (memory_usage >= 1e9)? 9:
                      (memory_usage >= 1e6)? 6:
                      (memory_usage >= 1e3)? 3:
                                             0;
  char *memory_units = (memory_scale == 9)? "GB":
                       (memory_scale == 6)? "MB":
                       (memory_scale == 3)? "KB":
                                            "B";

  printf("%s  %"PRIu32" factors  %.6f ms  %.3f ns/factor  %.3f %s\n",
         uint128_to_static_string(ppf.n, 0),
         ppf.d,
         cpu_time/iterations * 1e3,
         cpu_time/iterations * 1e9 / (double)(ppf.d? ppf.d : 1),
         (double)memory_usage / pow(10, memory_scale),
         memory_units);

  return 0;
}
Bolte answered 12/5, 2015 at 3:39 Comment(5)
great stuff! nice. looks like in the few worst cases the production rate is same as before, and much better on average.Nadaha
@WillNess — Thanks! One other thing I noticed is that it spatial locality is improved in merging by multiplying the smaller primes last (within a given exponent group). For example, 11x7x5x3x2 results in better spatial locality during the final binary merge than 2x3x5x7x11, because multiplying a set by 2 spreads things out less than multiplying by 11. However, when doing arbitrary-size arithmetic (as opposed to fixed-size, say 32- or 64-bit), it is advantageous to keep the numbers as small as possible for as long as possible, so multiplying the largest primes last isn't necessarily bad.Bolte
@WillNess — One other crazy thing: Although multiplying the larger factors last is worse for spatial locality, it seems to be better for branch prediction! When I factor 10047560702722583040 using 2⁹ x 3² x 5 x 7 x 11 x 13 x 17 x 19 x 23 x 29 x 31 x 37 x 41 x 43, it's 11.3% faster than when I factor it using 2⁹ x 3² x 43 x 41 x 37 x 31 x 29 x 23 x 19 x 17 x 13 x 11 x 7 x 5. I looked at the temporal activity of memory reads, and indeed the powers of larger factors give more predictable branching. Funny that. Would not have guessed it, but in hindsight makes sense. Thanks for asking earlier.Bolte
The speed difference for 18444563725810051830 (whose largest prime factor is 60623) is even more profound: a 32% speedup between ascending vs. descending. I wish S.O. didn't limit answers to 30,000 characters, otherwise I'd include an in-depth analysis of this in my answer.Bolte
so the simplest is the best, after all. :)Nadaha
N
9

We can merge streams of multiples, produced so there are no duplicates in the first place.

Starting with the list [1], for each unique prime factor p, we multiply the list by p iteratively k times (where k is the multiplicity of p) to get k new lists, and merge them together with that incoming list.

In Haskell,

ordfactors n =          --   <----------------------<---<---<-----\
  foldr g [1]           -- g (19,1) (g (7,1) (g (3,2) (g (2,3) [1])))
    . reverse           -- [(19,1),(7,1),(3,2),(2,3)]              ^
    . primePowers       -- [(2,3),(3,2),(7,1),(19,1)]              |
    $ n                 -- 9576                    --->--->--->----/
       where
         g (p,k) xs = mergeAsTree 
                        . take (k+1)          -- take 3 [a,b,c,d..] = [a,b,c]
                        . iterate (map (p*))  -- iterate f x = [x, y, z,...]
                        $ xs                  --  where { y=f x; z=f y; ... }

{-  g (2,3) [1] = let a0 = [1]        
                      a1 = map (2*) a0               -- [2]
                      a2 = map (2*) a1               -- [4]
                      a3 = map (2*) a2               -- [8]
                  in mergeAsTree [ a0, a1, a2, a3 ]  -- xs2 = [1,2,4,8]

    g (3,2) xs2 = let b0 = xs2                       -- [1,2,4,8]
                      b1 = map (3*) b0               -- [3,6,12,24]
                      b2 = map (3*) b1               -- [9,18,36,72]
                  in mergeAsTree [ b0, b1, b2 ]      -- xs3 = [1,2,3,4,6,8,9,12,...] 

    g (7,1) xs3 = mergeAsTree [ xs3, map (7*) xs3 ]  -- xs4 = [1,2,3,4,6,7,8,9,...]

    g (19,1) xs4 = mergeAsTree [ xs4, map (19*) xs4 ]  
-}
mergeAsTree xs   = foldi merge [] xs    -- [a,b]     --> merge a b
                                        -- [a,b,c,d] --> merge (merge a b) (merge c d)
foldi f z []     = z
foldi f z (x:xs) = f x (foldi f z (pairs f xs))
pairs f (x:y:t)  = f x y : pairs f t
pairs _ t        = t

foldi arranges the binary merges (which presuppose that there's no duplicates in the streams being merged, for streamlined operation) in a tree-like fashion for speed; whereas foldr works in linear fashion.

primePowers n | n > 1 =           -- 9576 => [(2,3),(3,2),(7,1),(19,1)]
  map (\xs -> (head xs,length xs)) --                                ^
    . group                       -- [[2,2,2],[3,3],[7],[19]]        |
    . factorize                   -- [2,2,2,3,3,7,19]                |
    $ n                           -- 9576             --->--->--->---/

group is a standard function that groups adjacent equals in a list together, and factorize is a well-known trial-division algorithm that produces prime factors of a number in non-decreasing order:

factorize n | n > 1 = go n (2:[3,5..])   -- 9576 = 2*2*2*3*3*7*19
   where                                 -- produce prime factors only
     go n (d:t)
        | d*d > n    = [n]
        | r == 0     =  d : go q (d:t)
        | otherwise  =      go n t
            where 
              (q,r)  = quotRem n d
factorize _ = []

(.) is a functional composition operator, chaining the output of its argument function on its right as input into the one on its left. It (and $) can be read aloud as "of".

Nadaha answered 2/5, 2015 at 2:1 Comment(33)
Generate-and-mergesort, but lazy, eh? I wish I could read Haskell.Awful
@Awful so, it is "generate sorted, and merge". as all the subsequences are generated in order, it should avoid the additional log n factor; but in imperative language it might be a challenge with all the needed bookkeeping (i.e. pointers into each sequence etc.)Nadaha
How does the number of subsequences relate to d? If it's O(d), you'll get the log(d) factor in the runtime for the same reason mergesort does.Awful
@Awful the number of steps n in the algorithm is the number of unique prime factors (not the total number of divisors, which in a regular case woulds be much much higher). On each step, k new streams are merged in, where k is the multiplicity of that prime factor. So the additional factor (over the O(d) since we produce d divisors) is not log(d), but is somehow related to the number of (unique (?)) prime factors of the number.Nadaha
@WillNess: Great solution, and the detailed comments are very helpful as I also struggle to read functional code! In your last comment I think you mean "the number of merge steps in the algorithm is the number of unique prime factors". Also if any of the prime factors have multiplicity > 1, then some of the merges are not 2-way merges -- I can't pinpoint exactly what that does to the time complexity, but it may make it worse than an extra log(num_unique_prime_factors).Sofko
@Sofko thanks! in the main "loop", foldr g [1], the number of steps is the number of unique prime factors of n. These are arranged linearly. On each step, k (= multiplicity) new streams are calculated linearly (by iterate (p*)) but are merged in a tree, which has the usual advantage over linear merging (which may be noticeable if k is big). I now think it might be advantageous to sort the unique factorization putting the factors with the largest ks at the top. But what is it overall, is kinda hard to wrap my head around it immediately. Maybe some simulation would help...Nadaha
@Sofko and so, to your last point, yes, it might be O( unique_pfs * log(max_multiplicity)) or something. So it might be looking dangerously close to the dreaded log(d) factor, actually. Maybe if the unique factorization is sorted, it'll make it be better...Nadaha
could you please post working Haskell code (including any imports)?Lambda
@גלעדברקן group is from Data.List. foldi is on wikipedia (linked), merge is as usual, e.g. mergeBy (<). Or you could import Data.List.Ordered (I think foldi is called foldt, there).Nadaha
@WillNess — I am curious, how fast does this solution run on input of a number like 18401055938125660800, which has 184320 factors? Now that I've implemented a fast bucket-sort solution to replace the Quicksort-based solution I'd originally started with, I might try your method if it looks comparable in speed.Bolte
@ToddLehman 0.022 secs, compiled Haskell code, calculating length (ordfactors ...) on Win7-64, Core i7 2.5 GHz.Nadaha
@ToddLehman and for (n*120), which has 2.2x more factors (405504), the time is 0.054 secs (i.e. d^1.13 empirically, for d a number of factors; seems to agree with d log d rule more or less).Nadaha
@WillNess — Thanks! That is definitely very impressive. For the first number, it's 4x slower than the C implementation I have using bucket sort...but for the second number, it's actually 30% faster! — suggesting better asymptotic growth. (I'm running my tests on a 3.4 GHz Core i7). I am really curious now how fast your method would run in C... Maybe learning some Haskell is in my future!Bolte
@ToddLehman ah, thanks for that. Interesting. :) This kind of code is easiest to translate using something like Python's generators. Otherwise, there'll be a lot of bookkeeping to do, manually. OTOH it is certainly doable. At their core, generators are nothing more than loops with internal state anyway; you'll just have to maintain that state by yourself, in C. I might have a go at it, too. Maybe... :)Nadaha
so, your time for 1st is 0.022/4 = 0.0055; and for 2nd it's 0.054*1.3 = 0.07? then the empirical order of growth is logBase 2.2 (700/55) = d^3. this doesn't look right, compared to your plots, which show ~ d^1.1. Maybe the memory usage hits some hard limit, like you say... Hitting the wall so to speak. -- I've tested my code for (n*120*360) now; number of factors is 2.1212x, d=860160. The time is 0.127 secs, so order of growth is logBase 2.1212 (127/54) = 0.137 again (compared with 0.138 for n*120, d=405504). So it does seem to scale OK for this range.Nadaha
@WillNess — Oh, my code which factors the first number (with 184320 factors) is pure 64-bit hardware arithmetic. The one which factors the second number (with 405504 factors) is a completely separate version of the code which does 128-bit arithmetic, which has to be simulated in software (automatically by the compiler) on x86-64, so there is a large penalty when crossing the 2⁶⁴ boundary in the number being factored. For my application, I don't actually need 128-bit numbers; I was just curious, so I developed a 128-bit version for comparison.Bolte
and in C, we can use a bona fide array-based priority queue, O(1) per operation (instead of that tree-folding), to get rid of that log(d) factor, for overall, perhaps, O(d*n_unique_factors) which I hoped for, originally.Nadaha
ah, I see, didn't think about that ; Haskell has unlimited Integer type. -- so it doesn't count; your code's order of growth is ~ n^1.1 and with low constant factor, too.Nadaha
@WillNess — One last question: How fast is your Haskell code on the number 331984163921408578320113791401984000000 (which has 49996800 factors)? My 128-bit C version eats up 2.2 GB of RAM for that, and takes 19.3 seconds on my 3.4 GHz. Core i7). I'm guessing your Haskell code beats my C code by a wide margin on this number — which, if true, means that I really should try implenting your algorithm in C.Bolte
17.9 secs, memory shoots up by about 1G.Nadaha
no, scratch that. properly tested as stand-alone exe, 12.78 secs, 0.96 GB memory. (and I must point out, my code is much, much, much shorter... ;) ;) :) ).Nadaha
@WillNess — Amazeballs. Sold!Bolte
last interesting tidbit. with reverse in the foldr g [1] . reverse . primePowers replaced with sortBy (comparing snd) — which changed the factorization to [(47,1),(43,1),(41,1),(37,1),(31,1),(29,1),(23,1),(19,1),(17,1),(13,2),(11,2),(7,4),(5,6),(3,9),(2,30)] — it ran for the same time, but in 788 MB memory; but without reversing, on plain factorization [(2,30),(3,9),(5,6),(7,4),(11,2),(13,2),(17,1),(19,1),(23,1),(29,1),(31,1),(37,1),(41,1),(43,1),(47,1)] (as well as on sorted, reversed) it ran for 20.6 secs, in 333 MB! (interesting how would C code behave there...)Nadaha
@WillNess — I see why your algorithm is so good: (1) It is very cache-friendly: everything is sequential. (2) It is basically a Natural Mergesort; but perhaps more importantly, it is a temporally in situ natural mergesort; that is, you begin merging immediately, without ever having to scan for runs. (3) Runtime in the hardest cases is dominated by the final doublings from the $s$ prime factors that appear only once, performed last. Therefore, the algorithm is $O(\sum_{k=1}^s{\frac{1}{2^k}} d) = O(2d) = O(d)$ for these cases, assuming constant-time multiplies and a custom 2-way merge.Bolte
@Sofko — If prime factors with multiplicity > 1 are processed first (that is, the highest multiplicity first, then working downward), then the runtime of the algorithm is dominated by the prime factors at the end with multiplicity of 1. For example, 18401055938125660800 = 2^7 x 3^4 x 5^2 x 7^2 x 11 x 13 x 17 x 19 x 23 x 29 x 31 x 37 x 41 is dominated by the list-size doublings incurred by prime factors { 11, 13, 17, 19, 23, 29, 31, 37, 41 }, so that the non-2-way merges (which occur very early on) are dwarfed by the 9 doublings that occur later.Bolte
@ToddLehman I copied the wrong factorization in my last comment. the first one should have been [(17,1),(19,1),(23,1),(29,1),(31,1),(37,1),(41,1),(43,1),(47,1),(11,2),(13,2),(7,4),(5,6),(3,9),(2,30)]. what I copied by mistake is the reversed factors; with that it ran for same time (12.85 secs) but in 957 MB memory.Nadaha
@WillNess — Hey, I took your basic idea and ran with it to its ultimate conclusion, where both spatial complexity and runtime complexity are O(d), by implementing "phantom lists" in the merging steps (that is, the lists don't actually exist in memory at the time of the merge) and storing the factors directly into their final locations to minimize writes to memory. In C, the last example takes 0.731 secs and uses 800 MB (that's 800 MB for returning the list of factors, plus 600 bytes of overhead for accounting). I'll post some graphs soon. Thanks so much again for your insights!Bolte
@ToddLehman wow, great! yes to the phantom lists idea, thought about it too. how do you merge, through a PQ or something else? ... wait, so it runs in 600 bytes? :) when you post the graphs do ping me please. :) had you played with the order as per the multiplicity of factors? glad to hear it worked out as we both hoped.Nadaha
@WillNess — Nope, no priority queue required! It simply keeps the head element of each phantom list in a tiny array (one element per power of the prime being multiplied by), and does a k-way merge of the phantom lists. Indeed, it is significantly faster to process the high-multiplicity primes first and low-multiplicity primes last. Often, binary merges dominate the runtime. I'll post code and graphs when my data collection run finishes. The 600 bytes overhead is assuming that the caller wants the list of factors returned in memory rather than printed and discarded.Bolte
thanks. don't you get a hit in efficiency though with the k-way merge, esp. for big ks? I thought about defining some struct with the pointers and the multipliers etc, and place those structs in an array-based PQ, for the quicker merge. for 2-way (maybe even 4-way) this is of course not necessary. but for k=30...Nadaha
@WillNess — Best thing is to take the hit with the k-way merge up front, when the list is still in its infancy. That is, do the primes with relatively large exponents first, and leave the primes with low exponents for last. This way, taking the hit for the k-way merge isn't even noticeable. Although it is possible to construct numbers like n = (2x3x5x7)^9 that incur, for example, a 10-way merge at every step, d(n) for this n is still only 10,000. The only way to get really high factor counts is to have lots of unique primes, and that generally means having a lot of small exponents.Bolte
@WillNess — p.s. I just posted my C implementation, with runtime statistics and a chart showing internal memory organization of the data structure: #29993404Bolte
see also https://mcmap.net/q/89075/-two-simple-codes-to-generate-divisors-of-a-number-why-is-the-recursive-one-fasterNadaha
S
5

In short: Repeatedly pull the next-smallest factor from a heap, then push back every multiple of that factor that is still a factor of n. Use a trick to avoid duplicates arising, so that the heap size never exceeds d. The time complexity is O(kd log d), where k is the number of distinct prime factors.

The key property that we make use of is that if x and y are both factors of n with y = x*p for some factor p >= 2 -- i.e. if the prime factors of x are a proper submultiset of the prime factors of y -- then x < y. This means that it is always safe to output x before any such y is even inserted into the heap. The heap is effectively used only to compare factors where neither is a submultiset of the other.

A first attempt: Duplicates slow things down

It will be helpful to first describe an algorithm that produces the right answer, but also produces many duplicates:

  1. Set prev = NULL.
  2. Insert 1 into a heap H.
  3. Extract the top of heap t from H. If the heap is empty, stop.
  4. If t == prev then goto 3. [EDIT: Fixed]
  5. Output t.
  6. Set prev = t.
  7. For each distinct prime factor p of n:
    • If n % (t*p) == 0 (i.e. if t*p is still a factor of n), push t*p onto H.
  8. Goto 3.

The only problem with the above algorithm is that it can generate the same factor many times. For example, if n = 30, then the factor 15 will be generated as a child of the factor 5 (by multiplying by the prime factor 3), and also as a child of the factor 3 (by multiplying by 5). One way around this is to notice that any duplicates must be read out in a contiguous block when they reach the top of the heap, so you can simply check whether the top of the heap is equal to the just-extracted value, and to keep extracting and discarding it if so. But a better approach is possible:

Killing duplicates at the source

How many ways can a factor x be generated? First consider the case in which x contains no prime factors with multiplicity > 1. In that case, if it contains m distinct prime factors, then there are m-1 "parent" factors that will generate it as a "child" in the previous algorithm -- each of these parents consists of some subset of m-1 of the m prime factors, with the remaining prime factor being the one that gets added to the child. (If x has a prime factor with multiplicity > 1, then there are in fact m parents.) If we had a way of deciding on exactly one of these parents to be the "chosen one" that actually generates x as a child, and this rule resulted in a test that could be applied to each parent at the time that parent is popped off, then we could avoid ever creating any duplicates in the first place.

We can use the following rule: For any given x, choose the potential parent y that is missing the largest of x's m factors. This makes for a simple rule: A parent y produces a child x if and only if x = y*p for some p greater than or equal to any prime factor already in y. This is easy to test for: Just loop through prime factors in decreasing order, generating children for each, until we hit a prime factor that already divides y. In the previous example, the parent 3 will produce 15, but the parent 5 will not (because 3 < 5) -- so 15 indeed gets produced only once. For n = 30, the complete tree looks like:

       1
      /|\
     2 3 5
    /|  \
   6 10  15
   |
   30

Notice that each factor is generated exactly once.

The new, duplicate-free algorithm is as follows:

  1. Insert 1 into a heap H.
  2. Extract the top of heap t from H. If the heap is empty, stop.
  3. Output t.
  4. For each distinct prime factor p of n in decreasing order:
    • If n % (t*p) == 0 (i.e. if t*p is still a factor of n), push t*p onto H.
    • If t % p == 0 (i.e. if t already contains p as a factor) then stop.
  5. Goto 2.
Sofko answered 2/5, 2015 at 0:50 Comment(5)
Nice dupe-elimination scheme. I get the feeling the factor of k in the runtime bound could be improved, either with a more carefully-constructed algorithm or with a more careful complexity analysis.Awful
Also, idea I found in another discussion: cut off the factor generation at sqrt(n) instead of n, and generate factors greater than sqrt(n) by dividing n by the smaller factors.Awful
@user2357112: I think you're right about the factor of k. For a factor x, let f be its largest prime factor. When x is popped, we can make it generate at most 2 children: one containing another copy of f, and one with f removed and the next higher prime factor next(f) included (this, i.e. x/fnext(f), is also necessarily > x). To do this efficiently we need to store pairs (x, i) in the heap, where i is the *index of the largest prime factor in x.Sofko
@user2357112: Regarding cutting off at sqrt(n): I don't think that will give an asymptotic improvement, since there are at most only total_number_of_factors/2 above sqrt(n). It might improve the constant factor though. (In fact you could just fill in an array from both ends as you read each factor from the heap.)Sofko
Although source code wasn't provided in this answer, I marked it as the accepted answer because I felt its algorithmic description was very clear. I haven't implemented it yet in C, because I worry that a heap-based solution won't be able to compete with a two-dimensional bucket sort, but the heap-based algorithm outlined in this answer is very nice and elegant, to be sure.Bolte
B
3

[I'm posting this answer just as a formality for completeness. I've already chosen someone else's answer as the accepted answer.]

Overview of algorithm. In searching for the fastest way to generate an in-memory list of factors (64-bit unsigned values in my case), I settled upon a hybrid algorithm that implements a two-dimensional bucket sort, which takes advantage of the internal knowledge of the sort keys (i.e., they are just integers and can therefore be computed upon). The specific method is something closer to a “ProxMapSort” but with two levels of keys (major, minor) instead of just one. The major key is simply the base-2 logarithm of the value. The minor key is the minimal number of most significant digits of the value needed to produce a reasonable spread at the second layer of buckets. Factors are produced first into a temporary work array of unsorted factors. Next, their distribution is analyzed and an array of bucket indexes is allocated and populated. Finally, the factors are stored directly into place in the final sorted array, using insertion sort. The vast majority of buckets have only 1, 2, or 3 values. Examples are given in the source code, which is attached at the bottom of this answer.

Spatial complexity. Memory utilization is approximately 4x that of a Quicksort-based solution, but this is actually rather insignificant, as the maximum memory ever used in the worst case (for 64-bit input) is 5.5 MB, of which 4.0 MB is held for only a small handful of milliseconds.

Runtime complexity. Performance is far better than a hand-coded Quicksort-based solution: for numbers with a nontrivial number of factors, it is unformly about 2.5x times faster. On my 3.4 GHz. Intel i7, it produces the 184,320 factors of 18,401,055,938,125,660,800 in sorted order in 0.0052 seconds, or about 96 clock cycles per factor, or about 35 million factors per second.

Graphs. Memory and runtime performance were profiled for the 47,616 canonical representatives of the equivalance classes of prime signatures of numbers up to 2⁶⁴–1. These are the so-called “highly factorable numbers” in 64-bit search space.

Total runtime is ~2.5x better than a Quicksort-based solution for nontrivial factor counts, shown below on this log–log plot:

Total Runtime

The number of sorted factors produced per second is essentially the inversion of the above. Performance on a per-factor basis declines after the sweet spot of approximately 2000 factors, but not by much. Performance is affected by L1, L2, and L3 cache sizes, as well as the count of unique prime factors of the number being factored, which goes up roughly with the logarithm of the input value.

Sorted Factors Per Second

Peak memory usage is a straight line in this log–log plot, since it is proportional to the base-2 logarithm of the number of factors. Note that peak memory usage is only for a very brief period of time; short-lived work arrays are discarded within milliseconds. After the temporary arrays are discarded, what remains is the final list of factors, which is the same minimal usage as seen in the Quicksort-based solution.

Memory Usage

Source Code. Attached below is a proof-of-concept program in the C programming language (specifically, C11). It has been tested on x86-64 with Clang/LLVM, although it should work fine with GCC as well.

 /*==============================================================================

 DESCRIPTION

    This is a small proof-of-concept program to test the idea of "sorting"
    factors using a form of bucket sort.  The method is essentially a 2D version
    of ProxMapSort that has tuned for vast, nonlinear distributions using two
    keys (major, minor) rather than one.  The major key is simply the floor of
    the base-2 logarithm of the value, and the minor key is derived from the most
    significant bits of the value.


 INPUT

    Input is given on the command line, either as a single argument giving the
    number to be factored or an even number of arguments giving the 2-tuples that
    comprise the prime-power factorization of the desired number.  For example,
    the number

       75600 = 2^4 x 3^3 x 5^2 x 7

    can be given by the following list of arguments:

       2 4 3 3 5 2 7 1

    Note:  If a single number is given, it will require factoring to produce its
    prime-power factorization.  Since this is just a small test program, a very
    crude factoring method is used that is extremely fast for small prime factors
    but extremely slow for large prime factors.  This is actually fine, because
    the largest factor lists occur with small prime factors anyway, and it is the
    production of large factor lists at which this program aims to be proficient.
    It is simply not interesting to be fast at producing the factor list of a
    number like 17293823921105882610 = 2 x 3 x 5 x 576460797370196087, because
    it has only 32 factors.  Numbers with tens or hundreds of thousands of
    factors are much more interesting.


 OUTPUT

    Results are written to standard output.  A list of factors in ascending order
    is produced, followed by runtime (in microseconds) required to generate the
    list (not including time to print it).


 STATISTICS

    Bucket size statistics for the 47616 canonical representatives of the prime
    signature equivalence classes of 64-bit numbers:

    ==============================================================
    Bucket size     Total count of factored       Total count of
         b          numbers needing size b      buckets of size b
    --------------------------------------------------------------
         1               47616 (100.0%)         514306458  (76.2%)
         2               47427  (99.6%)         142959971  (21.2%)
         3               43956  (92.3%)          16679329   (2.5%)
         4               27998  (58.8%)            995458   (0.1%)
         5                6536  (13.7%)             33427  (<0.1%)
         6                 400   (0.8%)               729  (<0.1%)
         7                  12  (<0.1%)                18  (<0.1%)
    --------------------------------------------------------------
         ~               47616 (100.0%)         674974643 (100.0%)
    --------------------------------------------------------------

    Thus, no 64-bit number (of the input set) ever requires more than 7 buckets,
    and the larger the bucket size the less frequent it is.  This is highly
    desirable.  Note that although most numbers need at least 1 bucket of size 5,
    the vast majority of buckets (99.9%) are of size 1, 2, or 3, meaning that
    insertions are extremely efficient.  Therefore, the use of insertion sort
    for the buckets is clearly the right choice and is arguably optimal for
    performance.


 AUTHOR

    Todd Lehman
    2015/05/08

 */

 #include <inttypes.h>
 #include <limits.h>
 #include <stdbool.h>
 #include <stdlib.h>
 #include <stdio.h>
 #include <stdarg.h>
 #include <string.h>
 #include <time.h>
 #include <math.h>
 #include <assert.h>

 typedef  unsigned int  uint;
 typedef  uint8_t       uint8;
 typedef  uint16_t      uint16;
 typedef  uint32_t      uint32;
 typedef  uint64_t      uint64;

 #define  ARRAY_CAPACITY(x)  (sizeof(x) / sizeof((x)[0]))

 //-----------------------------------------------------------------------------
 // This structure is sufficient to represent the prime-power factorization of
 // all 64-bit values.  The field names ω and Ω are dervied from the standard
 // number theory functions ω(n) and Ω(n), which count the number of unique and
 // non-unique prime factors of n, respectively.  The field name d is derived
 // from the standard number theory function d(n), which counts the number of
 // divisors of n, including 1 and n.
 //
 // The maximum possible value here of ω is 15, which occurs for example at
 // n = 7378677391061896920 = 2^3 x 3^2 x 5 x 7 x 11 x 13 x 17 x 19 x 23 x 29
 // 31 x 37 x 41 x 43 x 47, which has 15 unique prime factors.
 //
 // The maximum possible value of Ω here is 63, which occurs for example at
 // n = 2^63 and n = 2^62 x 3, both of which have 63 non-unique prime factors.
 //
 // The maximum possible value of d here is 184320, which occurs at
 // n = 18401055938125660800 = 2^7 x 3^4 x 5^2 x 7^2 x 11 x 13 x 17 x 19 x 23 x
 // 29 x 31 x 37 x 41.
 //
 // Maximum possible exponents when exponents are sorted in decreasing order:
 //
 //    Index   Maximum   Bits   Example of n
 //    -----   -------   ----   --------------------------------------------
 //        0        63      6   (2)^63
 //        1        24      5   (2*3)^24
 //        2        13      4   (2*3*5)^13
 //        3         8      4   (2*3*5*7)^8
 //        4         5      3   (2*3*5*7*11)^5
 //        5         4      3   (2*3*5*7*11*13)^4
 //        6         3      2   (2*3*5*7*11*13*17)^3
 //        7         2      2   (2*3*5*7*11*13*17*19)^2
 //        8         2      2   (2*3*5*7*11*13*17*19*23)^2
 //        9         1      1   (2*3*5*7*11*13*17*19*23*29)^1
 //       10         1      1   (2*3*5*7*11*13*17*19*23*29*31)^1
 //       11         1      1   (2*3*5*7*11*13*17*19*23*29*31*37)^1
 //       12         1      1   (2*3*5*7*11*13*17*19*23*29*31*37*41)^1
 //       13         1      1   (2*3*5*7*11*13*17*19*23*29*31*37*41*43)^1
 //       14         1      1   (2*3*5*7*11*13*17*19*23*29*31*37*41*43*47)^1
 //    -----   -------   ----   --------------------------------------------
 //       15        63     37
 //
 #pragma pack(push, 8)
 typedef struct
 {
   uint8   e[16];  // Exponents.
   uint64  p[16];  // Primes in increasing order.
   uint8   ω;      // Count of prime factors without multiplicity.
   uint8   Ω;      // Count of prime factors with multiplicity.
   uint32  d;      // Count of factors of n, including 1 and n.
   uint64  n;      // Value of n on which all other fields of this struct depend.
 }
 PrimePowerFactorization;  // 176 bytes with 8-byte packing
 #pragma pack(pop)

 #define  MAX_ω  15
 #define  MAX_Ω  63

 //-----------------------------------------------------------------------------
 // Fatal error:  print error message and abort.

 void fatal_error(const char *format, ...)
 {
   va_list args;
   va_start(args, format);
   vfprintf(stderr, format, args);
   exit(1);
 }

 //-----------------------------------------------------------------------------
 // Compute 64-bit 2-adic integer inverse.

 uint64 uint64_inv(const uint64 x)
 {
   assert(x != 0);

   uint64 y = 1;
   for (uint i = 0; i < 6; i++)  // 6 = log2(log2(2**64)) = log2(64)
     y = y * (2 - (x * y));

   return y;
 }

 //------------------------------------------------------------------------------
 // Compute 2 to arbitrary power.  This is just a portable and abstract way to
 // write a left-shift operation.  Note that the use of the UINT64_C macro here
 // is actually required, because the result of 1U<<x is not guaranteed to be a
 // 64-bit result; on a 32-bit compiler, 1U<<32 is 0 or is undefined.

 static inline
 uint64 uint64_pow2(x)
 {
   return UINT64_C(1) << x;
 }

 //------------------------------------------------------------------------------
 // Deduce native word size (int, long, or long long) for 64-bit integers.
 // This is needed for abstracting certain compiler-specific intrinsic functions.

 #if UINT_MAX == 0xFFFFFFFFFFFFFFFFU
   #define UINT64_IS_U
 #elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
   #define UINT64_IS_UL
 #elif ULLONG_MAX == 0xFFFFFFFFFFFFFFFFULL
   #define UINT64_IS_ULL
 #else
   //error "Unable to deduce native word size of 64-bit integers."
 #endif

 //------------------------------------------------------------------------------
 // Define abstracted intrinsic function for counting leading zeros.  Note that
 // the value is well-defined for nonzero input but is compiler-specific for
 // input of zero.

 #if   defined(UINT64_IS_U) && __has_builtin(__builtin_clz)
   #define UINT64_CLZ(x) __builtin_clz(x)
 #elif defined(UINT64_IS_UL) && __has_builtin(__builtin_clzl)
   #define UINT64_CLZ(x) __builtin_clzl(x)
 #elif defined(UINT64_IS_ULL) && __has_builtin(__builtin_clzll)
   #define UINT64_CLZ(x) __builtin_clzll(x)
 #else
   #undef UINT64_CLZ
 #endif

 //------------------------------------------------------------------------------
 // Compute floor of base-2 logarithm y = log_2(x), where x > 0.  Uses fast
 // intrinsic function if available; otherwise resorts to hand-rolled method.

 static inline
 uint uint64_log2(uint64 x)
 {
   assert(x > 0);

   #if defined(UINT64_CLZ)
     return 63 - UINT64_CLZ(x);
   #else
     #define S(k) if ((x >> k) != 0) { y += k; x >>= k; }
     uint y = 0; S(32); S(16); S(8); S(4); S(2); S(1); return y;
     #undef S
   #endif
 }

 //------------------------------------------------------------------------------
 // Compute major key, given a nonzero number.  The major key is simply the
 // floor of the base-2 logarithm of the number.

 static inline
 uint major_key(const uint64 n)
 {
   assert(n > 0);
   uint k1 = uint64_log2(n);
   return k1;
 }

 //------------------------------------------------------------------------------
 // Compute minor key, given a nonzero number, its major key, k1, and the
 // bit-size b of major bucket k1.  The minor key, k2, is is computed by first
 // removing the most significant 1-bit from the number, because it adds no
 // information, and then extracting the desired number of most significant bits
 // from the remainder.  For example, given the number n=1463 and a major bucket
 // size of b=6 bits, the keys are computed as follows:
 //
 //    Step 0:  Given number              n = 0b10110110111 = 1463
 //
 //    Step 1:  Compute major key:        k1 = floor(log_2(n)) = 10
 //
 //    Step 2:  Remove high-order 1-bit:  n' = 0b0110110111 = 439
 //
 //    Step 3:  Compute minor key:        k2 = n' >> (k1 - b)
 //                                          = 0b0110110111 >> (10 - 6)
 //                                          = 0b0110110111 >> 4
 //                                          = 0b011011
 //                                          = 27

 static inline
 uint minor_key(const uint64 n, const uint k1, const uint b)
 {
   assert(n > 0); assert(k1 >= 0); assert(b > 0);
   const uint k2 = (uint)((n ^ uint64_pow2(k1)) >> (k1 - b));
   return k2;
 }

 //------------------------------------------------------------------------------
 // Raw unsorted factor.

 #pragma push(pack, 4)

 typedef struct
 {
   uint64  n;   // Value of factor.
   uint32  k1;  // Major key.
   uint32  k2;  // Minor key.
 }
 UnsortedFactor;

 #pragma pop(pack)

 //------------------------------------------------------------------------------
 // Compute sorted list of factors, given a prime-power factorization.

 static uint64 memory_usage;

 uint64 *compute_factors(const PrimePowerFactorization ppf)
 {
   memory_usage = 0;

   if (ppf.n == 0)
     return NULL;

   uint64 *sorted_factors = calloc(ppf.d, sizeof(*sorted_factors));
   if (!sorted_factors)
     fatal_error("Failed to allocate array of %"PRIu32" factors.", ppf.d);
   memory_usage += ppf.d * sizeof(*sorted_factors);

   UnsortedFactor *unsorted_factors = malloc(ppf.d * sizeof(*unsorted_factors));
   if (!unsorted_factors)
     fatal_error("Failed to allocate array of %"PRIu32" factors.", ppf.d);
   memory_usage += ppf.d * sizeof(*unsorted_factors);


   // These arrays are indexed by the major key of a number.
   uint32 major_counts[64];   // Counts of factors in major buckets.
   uint32 major_spans[64];    // Counts rounded up to power of 2.
   uint32 major_bits[64];     // Base-2 logarithm of bucket size.
   uint32 major_indexes[64];  // Indexes into minor array.
   memset(major_counts,  0, sizeof(major_counts));
   memset(major_spans,   0, sizeof(major_spans));
   memset(major_bits,    0, sizeof(major_bits));
   memset(major_indexes, 0, sizeof(major_indexes));


   // --- Step 1:  Produce unsorted list of factors from prime-power
   //     factorization.  At the same time, count groups of factors by their
   //     major keys.
   {
     // This array is for counting in the multi-radix number system dictated by
     // the exponents of the prime-power factorization.  An invariant is that
     // e[i] <= ppf.e[i] for all i (0 < i <ppf.ω).
     uint8 e[MAX_ω];
     for (uint i = 0; i < ppf.ω; i++)
       e[i] = 0;

     // Initialize inverse-prime-powers.  This array allows for division by
     // p[i]**e[i] extremely quickly in the main loop below.  Note that 2-adic
     // inverses are not defined for even numbers (of which 2 is the only prime),
     // so powers of 2 must be handled specially.
     uint64 pe_inv[MAX_ω];
     for (uint i = 0; i < ppf.ω; i++)
     {
       uint64 pe = 1; for (uint j = 1; j <= ppf.e[i]; j++) pe *= ppf.p[i];
       pe_inv[i] = uint64_inv(pe);
     }

     uint64 n = 1;  // Current factor accumulator.
     for (uint k = 0; k < ppf.d; k++)   // k indexes into unsorted_factors[].
     {
       //printf("unsorted_factors[%u] = %"PRIu64"   j = %u\n", k, n, j);
       assert(ppf.n % n == 0);
       unsorted_factors[k].n = n;

       uint k1 = major_key(n);
       assert(k1 < ARRAY_CAPACITY(major_counts));
       unsorted_factors[k].k1 = k1;
       major_counts[k1] += 1;

       // Increment the remainder of the multi-radix number e[].
       for (uint i = 0; i < ppf.ω; i++)
       {
         if (e[i] == ppf.e[i])  // Carrying is occurring.
         {
           if (ppf.p[i] == 2)
             n >>= ppf.e[i];  // Divide n by 2 ** ppf.e[i].
           else
             n *= pe_inv[i];  // Divide n by ppf.p[i] ** ppf.e[i].

           e[i] = 0;
         }
         else  // Carrying is not occurring.
         {
           n *= ppf.p[i];
           e[i] += 1;
           break;
         }
       }
     }
     assert(n == 1);  // n always cycles back to 1, not to ppf.n.

     assert(unsorted_factors[ppf.d-1].n == ppf.n);
   }


   // --- Step 2:  Define the major bits array, the major spans array, the major
   //     index array, and count the total spans.

   uint32 total_spans = 0;
   {
     uint32 k = 0;
     for (uint k1 = 0; k1 < ARRAY_CAPACITY(major_counts); k1++)
     {
       uint32 count = major_counts[k1];
       uint32 bits = (count <= 1)? count : uint64_log2(count - 1) + 1;
       major_bits[k1] = bits;
       major_spans[k1] = (count > 0)? (UINT32_C(1) << bits) : 0;
       major_indexes[k1] = k;
       k += major_spans[k1];
     }
     total_spans = k;
   }


   // --- Step 3:  Allocate and populate the minor counts array.  Note that it
   //     must be initialized to zero.

   uint32 *minor_counts = calloc(total_spans, sizeof(*minor_counts));
   if (!minor_counts)
     fatal_error("Failed to allocate array of %"PRIu32" counts.", total_spans);
   memory_usage += total_spans * sizeof(*minor_counts);

   for (uint k = 0; k < ppf.d; k++)
   {
     const uint64 n = unsorted_factors[k].n;
     const uint k1 = unsorted_factors[k].k1;
     const uint k2 = minor_key(n, k1, major_bits[k1]);
     assert(k2 < major_spans[k1]);
     unsorted_factors[k].k2 = k2;
     minor_counts[major_indexes[k1] + k2] += 1;
   }


   // --- Step 4:  Define the minor indexes array.
   //
   // NOTE:  Instead of allocating a separate array, the earlier-allocated array
   // of minor indexes is simply repurposed here using an alias.

   uint32 *minor_indexes = minor_counts;  // Alias the array for repurposing.

   {
     uint32 k = 0;
     for (uint i = 0; i < total_spans; i++)
     {
       uint32 count = minor_counts[i];  // This array is the same array...
       minor_indexes[i] = k;            // ...as this array.
       k += count;
     }
   }


   // --- Step 5:  Populate the sorted factors array.  Note that the array must
   //              be initialized to zero earlier because values of zero are used
   //              as sentinels in the bucket lists.

   for (uint32 i = 0; i < ppf.d; i++)
   {
     uint64 n = unsorted_factors[i].n;
     const uint k1 = unsorted_factors[i].k1;
     const uint k2 = unsorted_factors[i].k2;

     // Insert factor into bucket using insertion sort (which happens to be
     // extremely fast because we know the bucket sizes are always very small).
     uint32 k;
     for (k = minor_indexes[major_indexes[k1] + k2];
          sorted_factors[k] != 0;
          k++)
     {
       assert(k < ppf.d);
       if (sorted_factors[k] > n)
         { uint64 t = sorted_factors[k]; sorted_factors[k] = n; n = t; }
     }
     sorted_factors[k] = n;
   }


   // --- Step 6:  Validate array of sorted factors.
   {
     for (uint32 k = 1; k < ppf.d; k++)
     {
       if (sorted_factors[k] == 0)
         fatal_error("Produced a factor of 0 at index %"PRIu32".", k);

       if (ppf.n % sorted_factors[k] != 0)
         fatal_error("Produced non-factor %"PRIu64" at index %"PRIu32".",
                     sorted_factors[k], k);

       if (sorted_factors[k-1] == sorted_factors[k])
         fatal_error("Duplicate factor %"PRIu64" at index %"PRIu32".",
                     sorted_factors[k], k);

       if (sorted_factors[k-1] > sorted_factors[k])
         fatal_error("Out-of-order factors %"PRIu64" and %"PRIu64" "
                     "at indexes %"PRIu32" and %"PRIu32".",
                     sorted_factors[k-1], sorted_factors[k], k-1, k);
     }
   }


   free(minor_counts);
   free(unsorted_factors);

   return sorted_factors;
 }

 //------------------------------------------------------------------------------
 // Compute prime-power factorization of a 64-bit value.  Note that this function
 // is designed to be fast *only* for numbers with very simple factorizations,
 // e.g., those that produce large factor lists.  Do not attempt to factor
 // large semiprimes with this function.  (The author does know how to factor
 // large numbers efficiently; however, efficient factorization is beyond the
 // scope of this small test program.)

 PrimePowerFactorization compute_ppf(const uint64 n)
 {
   PrimePowerFactorization ppf;

   if (n == 0)
   {
     ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
   }
   else if (n == 1)
   {
     ppf = (PrimePowerFactorization){ .p = { 1 }, .e = { 1 },
                                      .ω = 1, .Ω = 1, .d = 1, .n = 1 };
   }
   else
   {
     ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = n };

     uint64 m = n;
     for (uint64 p = 2; p * p <= m; p += 1 + (p > 2))
     {
       if (m % p == 0)
       {
         assert(ppf.ω <= MAX_ω);
         ppf.p[ppf.ω] = p;
         ppf.e[ppf.ω] = 0;
         while (m % p == 0)
           { m /= p; ppf.e[ppf.ω] += 1; }
         ppf.d *= (1 + ppf.e[ppf.ω]);
         ppf.Ω += ppf.e[ppf.ω];
         ppf.ω += 1;
       }
     }
     if (m > 1)
     {
       assert(ppf.ω <= MAX_ω);
       ppf.p[ppf.ω] = m;
       ppf.e[ppf.ω] = 1;
       ppf.d *= 2;
       ppf.Ω += 1;
       ppf.ω += 1;
     }
   }

   return ppf;
 }

 //------------------------------------------------------------------------------
 // Parse prime-power factorization from a list of ASCII-encoded base-10 strings.
 // The values are assumed to be 2-tuples (p,e) of prime p and exponent e.
 // Primes must not exceed 2^64 - 1.  Exponents must not exceed 2^8 - 1.  The
 // constructed value must not exceed 2^64 - 1.

 PrimePowerFactorization parse_ppf(const uint pairs, const char *const values[])
 {
   assert(pairs <= MAX_ω);

   PrimePowerFactorization ppf;

   if (pairs == 0)
   {
     ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
   }
   else
   {
     ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = 1 };

     for (uint i = 0; i < pairs; i++)
     {
       ppf.p[i] = (uint64)strtoumax(values[(i*2)+0], NULL, 10);
       ppf.e[i] =  (uint8)strtoumax(values[(i*2)+1], NULL, 10);

       // Validate prime value.
       if (ppf.p[i] < 2)  // (Ideally this would actually do a primality test.)
         fatal_error("Factor %"PRIu64" is invalid.", ppf.p[i]);

       // Accumulate count of unique prime factors.
       if (ppf.ω > UINT8_MAX - 1)
         fatal_error("Small-omega overflow at factor %"PRIu64"^%"PRIu8".",
                     ppf.p[i], ppf.e[i]);
       ppf.ω += 1;

       // Accumulate count of total prime factors.
       if (ppf.Ω > UINT8_MAX - ppf.e[i])
         fatal_error("Big-omega wverflow at factor %"PRIu64"^%"PRIu8".",
                     ppf.p[i], ppf.e[i]);
       ppf.Ω += ppf.e[i];

       // Accumulate total divisor count.
       if (ppf.d > UINT32_MAX / (1 + ppf.e[i]))
         fatal_error("Divisor count overflow at factor %"PRIu64"^%"PRIu8".",
                     ppf.p[i], ppf.e[i]);
       ppf.d *= (1 + ppf.e[i]);

       // Accumulate value.
       for (uint8 k = 1; k <= ppf.e[i]; k++)
       {
         if (ppf.n > UINT64_MAX / ppf.p[i])
           fatal_error("Value overflow at factor %"PRIu64".", ppf.p[i]);
         ppf.n *= ppf.p[i];
       }
     }
   }

   return ppf;
 }

 //------------------------------------------------------------------------------
 // Main control.  Parse command line and produce list of factors.

 int main(const int argc, const char *const argv[])
 {
   PrimePowerFactorization ppf;

   uint values = (uint)argc - 1;  // argc is always guaranteed to be at least 1.

   if (values == 1)
   {
     ppf = compute_ppf((uint64)strtoumax(argv[1], NULL, 10));
   }
   else
   {
     if (values % 2 != 0)
       fatal_error("Odd number of arguments (%u) given.", values);
     uint pairs = values / 2;
     ppf = parse_ppf(pairs, &argv[1]);
   }

   // Run for (as close as possible to) a fixed amount of time, tallying the
   // elapsed CPU time.
   uint64 iterations = 0;
   double cpu_time = 0.0;
   const double cpu_time_limit = 0.05;
   while (cpu_time < cpu_time_limit)
   {
     clock_t clock_start = clock();
     uint64 *factors = compute_factors(ppf);
     clock_t clock_end = clock();
     cpu_time += (double)(clock_end - clock_start) / (double)CLOCKS_PER_SEC;

     if (++iterations == 1)
     {
       for (uint32 i = 0; i < ppf.d; i++)
         printf("%"PRIu64"\n", factors[i]);
     }

     if (factors) free(factors);
   }

   // Print the average amount of CPU time required for each iteration.
   uint mem_scale = (memory_usage >= 1e9)? 9:
                    (memory_usage >= 1e6)? 6:
                    (memory_usage >= 1e3)? 3:
                                           0;
   char *mem_units = (mem_scale == 9)? "GB":
                     (mem_scale == 6)? "MB":
                     (mem_scale == 3)? "KB":
                                        "B";

   printf("%"PRIu64"  %"PRIu32" factors  %.6f ms  %.3f ns/factor  %.3f %s\n",
          ppf.n,
          ppf.d,
          cpu_time/iterations * 1e3,
          cpu_time/iterations * 1e9 / (double)(ppf.d? ppf.d : 1),
          (double)memory_usage / pow(10, mem_scale),
          mem_units);

   return 0;
 }
Bolte answered 8/5, 2015 at 5:46 Comment(4)
your first plot seems to indicate the ~ n^1.1 orders of growth, for both of your methods, on the right end (large numbers of factors).Nadaha
@WillNess — ya, there is a slight upturn away from linear in the log–log plot. It's very difficult for me to model the runtime performance theoretically, due to both d(n) and especially ω(n) being hard to predict. The upturn may be due to L2 cache misses, but I suspect it may also be that the runtime complexity is something more like O(log(ω(n)) d(n) log(d(n))).Bolte
the linear correlation in the log-log plot, signifies the log t = k * log n equation; that means t = n^k. Your plot looks linear to me, in its right half (for the biggest values of n). If I measure the angle, I get y = 1.1 x, i.e. ~ n^1.1. (n log n can express itself as n^1.1, depending on range etc; O(n) is seen as n^1.0). For my list-based code (translation: lacking in any fine tuning) I measured it directly, as logBase (n2/n1) (t2/t1), getting n^1.13; a little worse than yours (the constant factors seem much worse). But, it is on-line, i.e. it starts producing right away.Nadaha
see en.wikipedia.org/wiki/… . ... (yes, perhaps there is a very slight up-turn, meaning, empirical growth turns from n^1.1 to something like n^1.13 ... ;) I'm guessing).Nadaha

© 2022 - 2024 — McMap. All rights reserved.