Efficient way to multiply a large set of small numbers

Asked 15/4, 2014 at 17:50 Answered 29/3, 2023 at 12:47

This question was asked in an interview.

You have an array of small integers. You have to multiply all of them. You need not worry about overflow you have ample support for that. What can you do to speed up the multiplication on your machine?

Would multiple additions be better in this case?

I suggested multiplying using a divide and conquer approach but the interviewer was not impressed. What could be the best possible solution for this?

Moulin answered 15/4, 2014 at 17:50 Comment(20)

Hint: Can you use multiple threads? – Meadowlark 15/4, 2014 at 17:52

Never struck me at that time. If I do multi-threading plus divide and conquer it should make a difference I suppose. – Moulin 15/4, 2014 at 17:54

SIMD, multiple accumulators, maybe early exit on zero, we can't know what the interviewer was expecting, didn't you ask him? – Nickolas 15/4, 2014 at 18:0

Since the set is large and integers are small, there should be many duplicates. You could sort the numbers, then use binary exponentiation for duplicates, then multiply results together. – Mystique 15/4, 2014 at 18:0

@EvgenyKluev You should post that. That's great. – Menchaca 15/4, 2014 at 18:2

@EvgenyKluev that sounds like a terrible idea - how could that possibly be faster than just multiplying them naively? – Nickolas 15/4, 2014 at 18:3

@harold en.wikipedia.org/wiki/Exponentiation_by_squaring – Albino 15/4, 2014 at 18:6

@harold Well, the wikipedia states that: "A brief analysis shows that such an algorithm uses O(log2n) squarings and O(log2n) multiplications. For n > about 4 this is computationally more efficient than naively multiplying the base with itself repeatedly." And I'm far from surprised. We have more information than when we naively multiply, and we make use of that information. – Menchaca 15/4, 2014 at 18:7

@A.Webb so what? You still have to sort the whole array, versus just multiplying every item. – Nickolas 15/4, 2014 at 18:8

Oh you meant the sorting :) Well that's valid of course :p Though I wouldn't discard the concept. – Menchaca 15/4, 2014 at 18:9

Sorting is not terrible for a large number of small integers. Just count sort. – Albino 15/4, 2014 at 18:9

Using counting sort you can build up the pieces of binary exponentiation at the same time at each slot. – Albino 15/4, 2014 at 18:15

@A.Webb even counting sort is terrible, if you compare it to multiplication. – Nickolas 15/4, 2014 at 18:16

@harold Multiplying small numbers quickly leads to large numbers. Multiplying with large multiplicands is expensive. Traversing the array once to greatly reduce the complexity of multiplying the duplicates seems like a win to me. I'll not debate it further though. – Albino 15/4, 2014 at 18:22

@A.Webb OP specifies that you don't need to care about overflow, so it's clearly about single-word multiplication. The speed of single-word multiplication hasn't depended on the values of the operands for decades now. If it was about big-int multiplication you'd be absolutely right. – Nickolas 15/4, 2014 at 18:23

@harold It is clearly about big-int multiplication (which does not overflow). Otherwise your "large array" is limited to less than 64 "2"s with current word sizes! – Albino 15/4, 2014 at 18:25

That would be a completely different question -- compute the product modulo X. – Albino 15/4, 2014 at 18:29

@A.Webb no, then it wouldn't say what it says now: you don't have to worry about overflow. It either means that there won't be any (apparently the input is small enough), or you can ignore it. If you had to use big-ints, then you're worrying about overflow, by preventing it. – Nickolas 15/4, 2014 at 18:30

It would be useful to know what "small integers" really means here. If the numbers are limited to, say, 16 bits, then you can do a counting sort in O(n), building a table of counts for 65,536 numbers. You could build up your exponentiation table at the same time and then go through and do the multiplies for every item in the table that isn't 0. – Intosh 15/4, 2014 at 21:0

If you were using fixed size, say 64-bit unsigned integers, and allowing for silent overflow, then unless your numbers are "2"-free, in other words odd, then you can expect a very dull result for large array of random numbers. The answer will be "probably 0". Once you encounter 64 even numbers (or fewer with higher powers of 2 as factors) your accumulated multiplication will become 0 and you can stop and report. I doubt this or just saying "do naive multiplication with silent overflow in parallel" was the point of the question. – Albino 16/4, 2014 at 1:3

Here are some thoughts:

Divide-and-Conquer with Multithreading: Split the input apart into n different blocks of size b and recursively multiply all the numbers in each block together. Then, recursively multiply all n / b blocks back together. If you have multiple cores and can run parts of this in parallel, you could save a lot of time overall.

Word-Level Parallelism: Let's suppose that your numbers are all bounded from above by some number U, which happens to be a power of two. Now, suppose that you want to multiply together a, b, c, and d. Start off by computing (4U²a + b) × (4U²c + d) = 16U⁴ac + 4U²ad + 4U²bc + bd. Now, notice that this expression mod U² is just bd. (Since bd < U², we don't need to worry about the mod U² step messing it up). This means that if we compute this product and take it mod U², we get back bd. Since U² is a power of two, this can be done with a bitmask.

Next, notice that

4U²ad + 4U²bc + bd < 4U⁴ + 4U⁴ + U² < 9U⁴ < 16U⁴

This means that if we divide the entire expression by 16U⁴ and round down, we will end up getting back just ad. This division can be done with a bitshift, since 16U⁴ is a power of two.

Consequently, with one multiplication, you can get back the values of both ac and bd by applying a subsequent bitshift and bitmask. Once you have ac and bd, you can directly multiply them together to get back the value of abcd. Assuming that bitmasks and bitshifts are faster than multiplies, this reduces the number of multiplications necessary by 33% (two instead of three here).

Hope this helps!

Meadowlark answered 15/4, 2014 at 18:7 Comment(7)

Is there a good source somewhere for this word level parallelism? I cant quite understand that divide by U2 and round down part. – Moulin 15/4, 2014 at 18:17

@Moulin I've updated my answer (looks like there was an initial bug in the math). Hope this helps! – Meadowlark 15/4, 2014 at 18:33

If we divide by 16U^4 wouldnt we get 0? I'm a little confused here. Even if we do a rightshift by n bits considers 16U^4=2^n why doesn't the whole expression become 0? – Moulin 15/4, 2014 at 18:46

If you divide 16U^4 ac + 4U^2 ac + 4U^2 bc + bd by 16U^4, then the three lower terms all go to zero. In the first term, the 16U^4 term gets divided out, leaving ac (since 16U^4 ac / 16U^4 = ac) – Meadowlark 15/4, 2014 at 18:48

The second bit is clever, but can put some strenuous constraints on the size of your integers for this to fit in the word size. For example, if your word size is 32 bits then your U and all your inputs are limited to 16 because 16U^4ac + 4U^2ac + 4U^2bc + bd overflows 32 bits when U = 32 and a = b = c = d = 31 and the product ac is lost. A 64 bit word size is more forgiving, up to 1024... – Albino 15/4, 2014 at 20:40

Also, its not clear without testing how much you actually gain with all the additional bit manipulation to save one word-size multiplication. – Albino 15/4, 2014 at 20:42

(With the numbers strictly smaller than U, ac, for example, is at most (U-1)(U-1)=U²-2U+1: (2U²a + b)×(2U²c + d) should do fine.) It seems a pity to use just two products out of four from the multiplication of two sums of two scaled input variables: (U⁴a + b)×(U²c + d) = U⁶ac + U⁴ad + U²bc + bd. Note that you get U⁴ad + U²bc with U²-wise scaling, only, and less bits necessary if ac is dispensable (enabling U=2⁵ with 32 product bits, and 2²¹ with 128). – Lorsung 17/4, 2016 at 5:38

Your divide and conquer suggestion was a good start. It just needed more explanation to impress.

With fast multiplication algorithms used to multiply large numbers (big-ints), it is much more efficient to multiply similar sized multiplicands than a series of mismatched sizes.

Here's an example in Clojure

; Define a vector of 100K random integers between 2 and 100 inclusive
(def xs (vec (repeatedly 100000 #(+ 2 (rand-int 99)))))

; Naive multiplication accumulating linearly through the array
(time (def a1 (apply *' xs)))
"Elapsed time: 7116.960557 msecs"

; Simple Divide and conquer algorithm
(defn d-c [v] 
  (let [m (quot (count v) 2)] 
    (if (< m 3) 
      (reduce *' v)
      (*' (d-c (subvec v 0 m)) (d-c (subvec v m))))))

; Takes less than 1/10th the time.
(time (def a2 (d-c xs)))
"Elapsed time: 600.934268 msecs"

(= a1 a2) ;=> true (same result)

Note that this improvement does not rely on a set limit for the size of the integers in the array (100 chosen arbitrarily and to demonstrate the next algorithm), but only that they be similar in size. This is a very simple divide an conquer. As the numbers get larger and more expensive to multiply, it would make sense to invest more time in iteratively grouping them by similar size. Here I am relying on random distribution and chance that the sizes will stay similar, but it is still going to be significantly better than the naive approach even for the worst case.

As suggested by Evgeny Kluev in the comments, for a large number of small integers, there is going to be a lot of duplication, so efficient exponentiation is also better than naive multiplication. This depends a lot more on the relative parameters than the divide and conquer, that is the numbers must be sufficiently small relative to the count for enough duplicates to accumulate to bother, but certainly performs well with these parameters (100K numbers in the range 2-100).

; Hopefully an efficient implementation
(defn pow [x n] (.pow (biginteger x) ^Integer n))

; Perform pow on duplications based on frequencies
(defn exp-reduce [v] (reduce *' (map (partial apply pow) (frequencies v))))

(time (def a3 (exp-reduce xs)))
"Elapsed time: 650.211789 msecs"

Note the very simple divide and conquer performed just a wee better in this trial, but would be even relatively better if fewer duplicates were expected.

Of course we can also combine the two:

(defn exp-d-c [v] (d-c (mapv (partial apply pow) (frequencies v))))

(time (def a4 (exp-d-c xs)))
"Elapsed time: 494.394641 msecs"

(= a1 a2 a3 a4) ;=> true (all equal)

Note there are better ways to combine these two since the result of the exponentiation step is going to result in various sizes of multiplicands. The value of added complexity to do so depends on the expected number of distinct numbers in the input. In this case, there are very few distinct numbers so it wouldn't pay to add much complexity.

Note also that both of these are easily parallelized if multiple cores are available.

Albino answered 15/4, 2014 at 19:37 Comment(0)

If many of the small integers occur multiple times, you could start by counting every unique integer. If c(n) is the number of occurrences of integer n, the product can be computed as

P = 2 ^ c(2) * 3 ^ c(3) * 4 ^ c(4) * ...

For the exponentiation steps, you can use exponentiation by squaring which can reduce the number of multiplications considerably.

Firebug answered 15/4, 2014 at 18:19 Comment(1)

Same as what Evgeny Kluev suggested. Good approach. The answer by templatetypedef doesn't really leave anything to chance though. – Moulin 15/4, 2014 at 18:53

If the count of numbers really is large compared to the range, then we have seen two asymptotic solutions presented to reduce the complexity considerably. One was based on successive squaring to compute c^k in O(log k) time for each number c, giving O(C mean(log k)) time if the largest number is C and k gives the exponent for each number between 1 and C. The mean(log k) term is maximized if every number appears an equal number of times, so if you have N numbers then the complexity becomes O(C log(N/C)), which is very weakly dependent on N and essentially just O(C) where C specifies the range of numbers.

The other approach we saw was sorting numbers by the number of times they appear, and keeping track of the product of leading numbers (starting with all numbers) and raising this to a power so that the least frequent number is removed from the array, and then updating the exponents on the remaining element in the array and repeating. If all numbers occur the same number of times K, then this gives O(C + log K) which is an improvement over O(C log K). However, say the kth number appears 2^k times. Then this will still give O(C^2 + C log(N/C)) time which is technically worse than the previous method O(C log(N/C)) if C > log(N/C). Thus, if you don't have good information on how evenly distributed the occurrences of each number are, you should go with the first approach, just take the appropriate power of each distinct number that appears in the product by using successive squaring, and take the product of the results. Total time O(C log (N/C)) if there are C distinct numbers and N total numbers.

Trutko answered 15/4, 2014 at 19:33 Comment(0)

To answer this question we need to interpret in some way the assumption from OP: need not worry about overflow. In larger part of this answer it is interpreted as "ignore overflows". But I start with several thoughts about other interpretation ("use multiprecision arithmetic"). In this case process of multiplication may be approximately split into 3 stages:

Multiply together small sets of small numbers to get a large set of not-so-small numbers. Some of the ideas from second part of this answer may be used here.
Multiply together these numbers to get a set of large numbers. Either trivial (quadratic time) algorithm or Toom–Cook/Karatsuba (sub-quadratic time) methods may be used.
Multiply together large numbers. Either Fürer's or Schönhage–Strassen algorithm may be used. Which gives O(N polylog N) time complexity for the whole process.

Binary exponentiation may give some (not very significant) speed improvement, because most (if not every) of complex multiplication algorithms mentioned here do squaring faster than multiplication of two unequal numbers. Also we could factorize every "small" number and use binary exponentiation only for prime factors. For evenly distributed "small" numbers this will decrease number of exponentiations by factor log(number_of_values) and slightly improve balance of squarings/multiplications.

Divide and conquer is OK when numbers are evenly distributed. Otherwise (for example when input array is sorted or when binary exponentiation is used) we could do better by placing all multiplicands into priority queue, ordered (may be approximately ordered) by number length. Then we could multiply two shortest numbers and place the result back to the queue (this process is very similar to Huffman encoding). There is no need to use this queue for squaring. Also we should not use it while numbers are not long enough.

More information on this could be found in the answer by A. Webb.

If overflows may be ignored we could multiply the numbers with linear-time or better algorithms.

Sub-linear time algorithm is possible if input array is sorted or input data is presented as set of tuples {value, number of occurrences}. In latter case we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log(N/C)), where C is number of different values in the array. (See also this answer).

If input array is sorted, we could use binary search to find positions where value changes. This allows to determine how many times each value occurs in the array. Then we could perform binary exponentiation of each value and multiply results together. Time complexity is O(C log N). We could do better by using one-sided binary search here. In this case time complexity is O(C log(N/C)).

But if input array is not sorted, we have to inspect each element, so O(N) time complexity is the best we can do. Still we could use parallelism (multithreading, SIMD, word-level parallelism) to obtain some speed improvement. Here I compare several such approaches.

To compare these approaches I've chosen very small (3-bit) values, which are pretty tightly packed (one value per 8-bit integer). And implemented them in low-level language (C++11) to get easier access to bit manipulation, specific CPU instructions, and SIMD.

Here are all the algorithms:

accumulate from standard library.
Trivial implementation with 4 accumulators.
Word-level parallelism for multiplication, as described in the answer by templatetypedef. With 64-bit word size this approach allows up to 10-bit values (with only 3 multiplications instead of each 4) or it may be applied twice (and I did it in the tests) with up to 5-bit values (requiring only 5 multiplications out of each 8).
Table lookup. In the tests 7 multiplications out of each 8 are substituted by single table lookup. With values larger than in these tests, number of substituted multiplications decreases, slowing down the algorithm. Values larger than 11-12 bits make this approach useless.
Binary exponentiation (see more details below). Values larger than 4 bits make this approach useless.
SIMD (AVX2). This implementation can use up to 8-bit values.

Here are sources for all tests on Ideone. Note that SIMD test requires AVX2 instruction set from Intel. Table lookup test requires BMI2 instruction set. Other tests do not depend on any particular hardware (I hope). I run these tests on 64-bit Linux, compiled with gcc 4.8.1, optimization level -O2.

Here are some more details for binary exponentiation test:

    for (size_t i = 0; i < size / 8; i += 2)
    {
        auto compr = (qwords[i] << 4) | qwords[i + 1];
        constexpr uint64_t lsb = 0x1111111111111111;
        if ((compr & lsb) != lsb) // if there is at least one even value
        {
            auto b = reinterpret_cast<uint8_t*>(qwords + i);
            acc1 *= accumulate(b, b + 16, acc1, multiplies<unsigned>{});
            if (!acc1)
                break;
        }
        else
        {
            const auto b2 = compr & 0x2222222222222222;
            const auto b4 = compr & 0x4444444444444444;
            const auto b24 = b4 & (b2 * 2);
            const unsigned c7 = __builtin_popcountll(b24);
            acc3 += __builtin_popcountll(b2) - c7;
            acc5 += __builtin_popcountll(b4) - c7;
            acc7 += c7;
        }
    }
    const auto prod4 = acc1 * bexp<3>(acc3) * bexp<5>(acc5) * bexp<7>(acc7);

This code packs values even more densely than in the input array: two values per byte. Low-order bit of each value is handled differently: since we could stop after 32 zero bits is found here (with result "zero"), this case cannot alter performance very much, so it is handled by simplest (library) algorithm.

Out of 4 remaining values, "1" is not interesting, so we need to count only occurrences of "3", "5" , and "7" with bitwise manipulations and "population count" intrinsic.

Here are the results:

  source size:    4 Mb:         400 Mb:
1. accumulate: 0.835392 ns    0.849199 ns
2.  accum * 4: 0.290373 ns    0.286915 ns
3. 2 mul in 1: 0.178556 ns    0.182606 ns
4. mult table: 0.130707 ns    0.176102 ns
5. binary exp: 0.128484 ns    0.119241 ns
6.       AVX2: 0.0607049 ns   0.0683234 ns

Here we can see that accumulate library algorithm is pretty slow: for some reason gcc could not unroll the loop and use 4 independent accumulators.

It is not too difficult to do this optimization "by hand". The result is not particularly fast. But if we allocate 4 threads for this task, CPU would approximately match memory bandwidth (2 channels, DDR3-1600).

Word-level parallelism for multiplications is almost twice as fast. (We'll need only 3 threads to match memory bandwidth).

Table lookup is even faster. But its performance degrades when input array cannot fit in L3 cache. (We'll need 3 threads to match memory bandwidth).

Binary exponentiation has approximately the same speed. But with larger inputs this performance does not degrade, it even slightly improves because exponentiation itself uses less time compared to value counting. (We'll need only 2 threads to match memory bandwidth).

As expected, SIMD is the fastest. Its performance slightly degrades when input array cannot fit in L3 cache. Which means we are close to memory bandwidth with single thread.

Mystique answered 18/4, 2014 at 21:19 Comment(0)

I have one solution. Let us discuss it with other solutions.

The key part of question is how to reduce times of multiply. And integers are small but set is big.

My solution:

use an small array to record how many times each number appears.
Remove number 1 from array. You don't need to count it.
Find the number which appears least times n. Then multiply all numbers and get result K. Then count K^n.
Remove this number (For instance, you can switch it with the last number of array and reduce size of array for 1). So next time you won't consider this number any more. At same time, the appearance times of other numbers need to be reduced with the times of removed number.
Once again get the number which appears least times. Do same thing as step 2.
Repeatedly do step 2-4 and complete counting.

Let me use an example to show how many multiply we need to do: Assume we have 5 numbers [1, 2, 3, 4, 5]. Number 1 appears 100 times, number 2 appears 150 times, number 3 appears 200 times, number 4 appears 300 times, and number 5 appears 400 times.

method 1: multiply it directly or use divide and conquer we need 100+150+200+300+400-1 = 1149 multiply to get result.

method 2: we do (1^100)(2^150)(3^200)(4^300)(5^400) (100-1)+(150-1)+(200-1)+(300-1)+(400-1)+4 = 1149.[same as method 1] Cause n^m will do m-1 multiply in deed. Plus you need time to go through all numbers, though this time is short.

method in this post: First, you need time to go through all numbers. It can be discarded compare to time of multiply.

The real counting you are making is: ((2*3*4*5)^150)*((3*4*5)^50)*((4*5)^100)*(5^100)

Then you need to do multiply 3+149+2+49+1+99+99+3 = 405 times

Duenna answered 15/4, 2014 at 18:51 Comment(0)

another shortcut you can also consider is that with 64-bit double precision floating point, you basically get ...

    ( 2 ^ 53 - 1 )  *

    5 ^ ( 7 ^ 2 + 4 ^ 5 + 1 )
or  2 ^ (         4 ^ 5 - 1 )

...for free.

So before you group them into common prime factors before exponentiating, one time saver would be leveraging this property about floating points and take out

1. all pairs of 2 and 5 (trailing zeros for free)

2. for any excess 2's or 5's, pair them with 

   singled out factors that totients below 2 ^ 53

...before proceding with the methodologies mentioned by others above.

==============

UPDATE 1 : sample code for obtaining a 2546-bit full precision odd integer for free via double precision floating point :

The scalar had to be chopped up since some variants of awk strangely returns a zero when directly attempting to do 2^-1074

echo '5 1074 2 53 -1' |

mawk '
function ______(_, __, ____, _____, ___) { 

    return substr( _ = sprintf("%.*f",
                  __+=   \
                   _ = _<_,
                       (____^_____ + ___) \
                     * (_+=_+=++_)^-_^_--  \
                     *  --_^(_++^_^--_-__) ), match(_, "[1-_]"))
} 
($!NF = ______($++_, $++_, $++_, $++_, $++_))^(_-=_)' |

gfactor

445014771701440227211481959341826395186963909270329129604685221944964444
404215389103305904781627017582829831782607924221374017287738918929105531
441481564124348675997628212653465850710457376274429802596224490290377969
811444461457051026631151003182879495279596682360399864792509657803421416
370138126133331198987655154514403152612538132666529513060001849177663286
607555958373922409899478075565940981010216121988146052587425791790000716
759993441450860872056815779154359230189103349648694206140521828924314457
976051636509036065141403772174422625615902446685257673724464300755133324
500796506867194913776884780053099639677097589658441378944337966219939673
169362804570848666132067970177289160800206986794085513437288676754097207
57232455434770912461317493580281734466552734375: 

    5^1074
    6,361
    69,431
    20,394,401

or getting rid of 3s 5s and 7s together :

387680457298950664073296177490588132296042754284028529270612377360918422
947817482652408747369347674431117506349120980646344949514763053229043614
509998947489563624036336892405230330828180457467889634564857995351336374
719447666305371129132194381769787960050407273543765109962984895782367121
463382405412310643194153022134247528331666859297826048546031119725300827
921265378132691552399894681404624772583214031731035608462113169563617644
127964218693261501761769947582703224898874895422622668642629036328731938
844281038920507603595824539287219912796419702420480807338564835315911825
107785542118041749732497304187803152348293015614847890314938870312460667
595971752275778967078168844912030175295965089998326094729596558164868167
01043303934337558303013793192803859710693359375:

    3^28
    5^1074
    7^3

Lanciform answered 29/3, 2023 at 12:47 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags