To answer this question we need to interpret in some way the assumption from OP: need not worry about overflow
. In larger part of this answer it is interpreted as "ignore overflows". But I start with several thoughts about other interpretation ("use multiprecision arithmetic"). In this case process of multiplication may be approximately split into 3 stages:
- Multiply together small sets of small numbers to get a large set of not-so-small numbers. Some of the ideas from second part of this answer may be used here.
- Multiply together these numbers to get a set of large numbers. Either trivial (quadratic time) algorithm or Toom–Cook/Karatsuba (sub-quadratic time) methods may be used.
- Multiply together large numbers. Either Fürer's or Schönhage–Strassen algorithm may be used. Which gives O(N polylog N) time complexity for the whole process.
Binary exponentiation may give some (not very significant) speed improvement, because most (if not every) of complex multiplication algorithms mentioned here do squaring faster than multiplication of two unequal numbers. Also we could factorize every "small" number and use binary exponentiation only for prime factors. For evenly distributed "small" numbers this will decrease number of exponentiations by factor log(number_of_values)
and slightly improve balance of squarings/multiplications.
Divide and conquer is OK when numbers are evenly distributed. Otherwise (for example when input array is sorted or when binary exponentiation is used) we could do better by placing all multiplicands into priority queue, ordered (may be approximately ordered) by number length. Then we could multiply two shortest numbers and place the result back to the queue (this process is very similar to Huffman encoding). There is no need to use this queue for squaring. Also we should not use it while numbers are not long enough.
More information on this could be found in the answer by A. Webb.
If overflows may be ignored we could multiply the numbers with linear-time or better algorithms.
Sub-linear time algorithm is possible if input array is sorted or input data is presented as set of tuples {value, number of occurrences}. In latter case we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log(N/C)), where C
is number of different values in the array. (See also this answer).
If input array is sorted, we could use binary search to find positions where value changes. This allows to determine how many times each value occurs in the array. Then we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log N). We could do better by using one-sided binary search here. In this case time complexity is O(C log(N/C)).
But if input array is not sorted, we have to inspect each element, so O(N) time complexity is the best we can do. Still we could use parallelism (multithreading, SIMD, word-level parallelism) to obtain some speed improvement. Here I compare several such approaches.
To compare these approaches I've chosen very small (3-bit) values, which are pretty tightly packed (one value per 8-bit integer). And implemented them in low-level language (C++11) to get easier access to bit manipulation, specific CPU instructions, and SIMD.
Here are all the algorithms:
accumulate
from standard library.
- Trivial implementation with 4 accumulators.
- Word-level parallelism for multiplication, as described in the answer by templatetypedef. With 64-bit word size this approach allows up to 10-bit values (with only 3 multiplications instead of each 4) or it may be applied twice (and I did it in the tests) with up to 5-bit values (requiring only 5 multiplications out of each 8).
- Table lookup. In the tests 7 multiplications out of each 8 are substituted by single table lookup. With values larger than in these tests, number of substituted multiplications decreases, slowing down the algorithm. Values larger than 11-12 bits make this approach useless.
- Binary exponentiation (see more details below). Values larger than 4 bits make this approach useless.
- SIMD (AVX2). This implementation can use up to 8-bit values.
Here are sources for all tests on Ideone. Note that SIMD test requires AVX2 instruction set from Intel. Table lookup test requires BMI2 instruction set. Other tests do not depend on any particular hardware (I hope). I run these tests on 64-bit Linux, compiled with gcc 4.8.1, optimization level -O2
.
Here are some more details for binary exponentiation test:
for (size_t i = 0; i < size / 8; i += 2)
{
auto compr = (qwords[i] << 4) | qwords[i + 1];
constexpr uint64_t lsb = 0x1111111111111111;
if ((compr & lsb) != lsb) // if there is at least one even value
{
auto b = reinterpret_cast<uint8_t*>(qwords + i);
acc1 *= accumulate(b, b + 16, acc1, multiplies<unsigned>{});
if (!acc1)
break;
}
else
{
const auto b2 = compr & 0x2222222222222222;
const auto b4 = compr & 0x4444444444444444;
const auto b24 = b4 & (b2 * 2);
const unsigned c7 = __builtin_popcountll(b24);
acc3 += __builtin_popcountll(b2) - c7;
acc5 += __builtin_popcountll(b4) - c7;
acc7 += c7;
}
}
const auto prod4 = acc1 * bexp<3>(acc3) * bexp<5>(acc5) * bexp<7>(acc7);
This code packs values even more densely than in the input array: two values per byte. Low-order bit of each value is handled differently: since we could stop after 32 zero bits is found here (with result "zero"), this case cannot alter performance very much, so it is handled by simplest (library) algorithm.
Out of 4 remaining values, "1" is not interesting, so we need to count only occurrences of "3", "5" , and "7" with bitwise manipulations and "population count" intrinsic.
Here are the results:
source size: 4 Mb: 400 Mb:
1. accumulate: 0.835392 ns 0.849199 ns
2. accum * 4: 0.290373 ns 0.286915 ns
3. 2 mul in 1: 0.178556 ns 0.182606 ns
4. mult table: 0.130707 ns 0.176102 ns
5. binary exp: 0.128484 ns 0.119241 ns
6. AVX2: 0.0607049 ns 0.0683234 ns
Here we can see that accumulate
library algorithm is pretty slow: for some reason gcc could not unroll the loop and use 4 independent accumulators.
It is not too difficult to do this optimization "by hand". The result is not particularly fast. But if we allocate 4 threads for this task, CPU would approximately match memory bandwidth (2 channels, DDR3-1600).
Word-level parallelism for multiplications is almost twice as fast. (We'll need only 3 threads to match memory bandwidth).
Table lookup is even faster. But its performance degrades when input array cannot fit in L3 cache. (We'll need 3 threads to match memory bandwidth).
Binary exponentiation has approximately the same speed. But with larger inputs this performance does not degrade, it even slightly improves because exponentiation itself uses less time compared to value counting. (We'll need only 2 threads to match memory bandwidth).
As expected, SIMD is the fastest. Its performance slightly degrades when input array cannot fit in L3 cache. Which means we are close to memory bandwidth with single thread.