Is integer multiplication really done at the same speed as addition on a modern CPU?

Asked 17/2, 2014 at 2:23 Answered 7/6, 2021 at 12:47

Solved c++performance cpu multiplication addition

I hear this statement quite often, that multiplication on modern hardware is so optimized that it actually is at the same speed as addition. Is that true?

I never can get any authoritative confirmation. My own research only adds questions. The speed tests usually show data that confuses me. Here is an example:

#include <stdio.h>
#include <sys/time.h>

unsigned int time1000() {
    timeval val;
    gettimeofday(&val, 0);
    val.tv_sec &= 0xffff;
    return val.tv_sec * 1000 + val.tv_usec / 1000;
}

int main() {
    unsigned int sum = 1, T = time1000();
    for (int i = 1; i < 100000000; i++) {
        sum += i + (i+1); sum++;
    }
    printf("%u %u\n", time1000() - T, sum);
    sum = 1;
    T = time1000();
    for (int i = 1; i < 100000000; i++) {
        sum += i * (i+1); sum++;
    }
    printf("%u %u\n", time1000() - T, sum);
}

The code above can show that multiplication is faster:

clang++ benchmark.cpp -o benchmark
./benchmark
746 1974919423
708 3830355456

But with other compilers, other compiler arguments, differently written inner loops, the results can vary and I cannot even get an approximation.

Ushas answered 17/2, 2014 at 2:23 Comment(17)

It might also be worth pointing out that your loop counters and summing logic also consumes adder resources. – Leak 17/2, 2014 at 2:28

@Mysticial: Not that this is really related to this question, but do you happen to know if there's a fast way to divide by 11 (faster than multiplying by its inverse)? I needed this a few days ago... – Inflexible 17/2, 2014 at 2:33

@Mehrdad Are you looking for something faster than the integer multiply by invariant trick that the compiler should already be able to do? – Leak 17/2, 2014 at 2:38

But remember that even though multiplication is slower, on modern hardware both addition and multiplication are so fast that you very rarely need to worry about it - look how hard it was for you to demonstrate it. – Immeasurable 17/2, 2014 at 2:40

@Mysticial: Yes, I'm looking for something faster than multiplications -- i.e. bit-shifts and the like. – Inflexible 17/2, 2014 at 2:40

It looks like your CPU can do an addition and a multiplication at the same time, but it cannot do two additions at the same time. Your first loop uses the adder twice in each iteration; your second loop uses an adder once, and a multiplier once. The first loop does two additions sequentially; your second loop does them in parallel, so what you probably see is that a multiplication is faster than two additions. – Cajeput 17/2, 2014 at 2:40

@TimBergel: That's false. I observed the difference quite readily and easily just a few days ago, and was in fact quite surprised by it (I saved 1.8 seconds on a 9-second task, see my answer). – Inflexible 17/2, 2014 at 2:42

@Mehrdad:yes I suppose so, and worth worrying about if you are writing 3d rendering code or the like, but I do feel its not too important in more run of the mill programming, particularly when you consider all the other ways there are to slow down one's code.... – Immeasurable 17/2, 2014 at 2:50

@TimBergel: My situation was nothing even close to 3D rendering. It was pretty typical -- all I was doing was implementing and using a heap (basically I was reimplementing my own version of std::push_heap, etc.) so I could process items in a prioritized order. I ran out of ideas to make its faster; it took me a lot of guess-and-check debugging/profiling to figure out the cause, and I was left scratching my head when I saw an imul instruction that I didn't expect. I know it's surprising, but my task was pretty darn typical, and this pointer subtraction in the heap really was the bottleneck. – Inflexible 17/2, 2014 at 2:52

@Mysticial: In case you're wondering, the question about division by 11 was related to something similar: I was also trying to implement a 2-3 heap (the arity alternates as 2+3+3+3), and hence I needed to divide by 11. It ended up being too slow because of the multiplications so I gave up on a 2-3 heap, but that's why I asked if you know of a way to divide by 11 quickly. – Inflexible 17/2, 2014 at 3:4

@Mehrdad I did the math and I don't think it's possible to do better. 11 can't be represented by fewer than the sum/difference of 3 powers of twos. So it will require a minimum 2 shifts. The fastest shift on Haswell is 0.5 r-throughput. Multiplication is 1.0 r-throughput. – Leak 17/2, 2014 at 3:6

@Mystial: One left shift 3 (multiply by 8), one lea (multiple by 3), and one add, right? But that's for multiplication, not division. – Precession 17/2, 2014 at 3:9

@Mysticial: Ohh I see, good point. :( Thanks! – Inflexible 17/2, 2014 at 3:11

Wait actually I think two LEA instructions can do it. First is times 3, second is times 8 plus first. See https://mcmap.net/q/323532/-assembly-language-using-shl-to-multiply-by-an-odd-number for more examples. – Precession 17/2, 2014 at 3:15

@BenVoigt: That's just small multiplications though, right? Doesn't help with dividing by 11 (though interesting nevertheless). – Inflexible 17/2, 2014 at 3:30

@Mehrdad: Yes, it's for small multiplications. When Mysticial said 3 powers of two, was that for the 11 or its reciprocal? – Precession 17/2, 2014 at 4:6

Related: electronics.stackexchange.com/q/452181/192433 – Carom 18/2, 2021 at 20:36

Multiplication of two n-bit numbers can in fact be done in O(log n) circuit depth, just like addition.

Addition in O(log n) is done by splitting the number in half and (recursively) adding the two parts in parallel, where the upper half is solved for both the "0-carry" and "1-carry" case. Once the lower half is added, the carry is examined, and its value is used to choose between the 0-carry and 1-carry case.

Multiplication in O(log n) depth is also done through parallelization, where every sum of 3 numbers is reduced to a sum of just 2 numbers in parallel, and the sums are done in some manner like the above.
I won't explain it here, but you can find reading material on fast addition and multiplication by looking up "carry-lookahead" and "carry-save" addition.

So from a theoretical standpoint, since circuits are obviously inherently parallel (unlike software), the only reason multiplication would be asymptotically slower is the constant factor in the front, not the asymptotic complexity.

Inflexible answered 22/7, 2014 at 11:28 Comment(7)

I hear a little bit of what you are saying, however look here, and you will see that multiplication is STILL reliant on multiple additions!!! The numbers on slide 53 are direct from AMD. You may be correct, but it is also a die size and cost of implementation versus statistical usage (versus addition alone) problem. cis.upenn.edu/~milom/cis371-Spring08/lectures/04_integer.pdf – Balbo 6/9, 2018 at 14:43

@trumpetlicks: Confused, did I say multiplication isn't dependent on additions? I thought I myself explicitly said multiplication is dependent on additions. – Inflexible 6/9, 2018 at 18:33

@trumpetlicks: Wha....? What are you talking about? I didn't even use the words "cycle" or "clock" or "pipeline" etc. in my answer. Nor did I ever claim multiplication is as fast as addition (again, I said the opposite). I don't know what answer you're reading that is correct or incorrect on these points, but it definitely isn't mine... – Inflexible 7/9, 2018 at 14:34

Multiplication in O((log(n) log(log(N))) depth can be accomplished with NTT-based approaches like the done with the Coppersmith-Winograd algorithm which are absolutely impractical for anything less than, say O(1000) bits. – Manuscript 5/12, 2021 at 20:18

I think actual chip implementations use Karatsuba-like schemes which can reduce the surface*cycles from O(N^2) to O(N^1.5), and always with a much higher constant prefactor. O() complexity analysis is useful, but can be often misleading. – Manuscript 5/12, 2021 at 20:24

@paperclipoptimizer: You think this based on... what source? This answer says they use Dadda multipliers, which are similar to but slightly faster than Wallace multipliers, which have O(log n) depth. – Inflexible 11/12, 2021 at 10:36

@user541686: Perhaps I interpreted the question slightly differently. From the theoretical standpoint a multiplication is just a convolution with some extra carry steps. This can indeed be carried in O(log N) depth with similar prefactor to carry lookahead addition, but only at O(N^2) transistor count. This is afaik too expensive/impractical for, say, 64 bit (and even lower) integer multiplication in most systems, which is why many chips compromise by using Karatsuba and other fast short convolution techniques at the top level, and shorter-length integer multipliers, which might use Dadda. – Manuscript 19/12, 2021 at 1:54

Integer multiplication will be slower.

Agner Fog's instruction tables show that when using 32-bit integer registers, Haswell's ADD/SUB take 0.25–1 cycles (depending on how well pipelined your instructions are) while MUL takes 2–4 cycles. Floating-point is the other way around: ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles.

Kinematograph answered 17/2, 2014 at 2:29 Comment(9)

"Who told you that?" -- To be fair, perhaps someone who mistook it for FP add/mul. Perhaps an understandable mistake. – Abduction 17/2, 2014 at 4:0

Also, "while MUL takes 2–4 cycles" -- While this is true, that's only the latency. MUL most likely has a throughput of (at least) one instruction per cycle. At least this is the case on Sandy Bridge, and I doubt Haswell does worse. ADDSS/MULSS also have throughputs of one instruction per cycle on Sandy Bridge (but latencies of 3 and 5, resp.). – Abduction 17/2, 2014 at 4:2

@Abduction that's what Agner lists as throughput of 32-bit MUL/IMUL using only registers. Interestingly, he lists the 64-bit variant as 1 cycle, not 2 -- I wonder if it's a typo. – Kinematograph 17/2, 2014 at 16:28

@Cory Nelson, I upvoted you, but that link to Agner Fog's page doesn't say a single thing about # of clock cycles does it? – Carom 9/8, 2019 at 5:9

@user1271772 the "Latency" column defines the minimum latency, in clock cycles, of each instruction. Note that pipelining means that instructions can overlap so this does not indicate throughput. – Kinematograph 9/8, 2019 at 5:37

@CoryNelson, what I mean is, that link that you gave, just contains a bunch of URLs to other pages. It doesn't seem to say anything about latency or clock cycles! – Carom 9/8, 2019 at 19:54

@user1271772: Agner Fog's instruction tables are available in PDF or spreadsheet form, and the microarch PDF is really useful for understanding what the instruction tables mean. I usually link to agner.org/optimize, not agner.org/optimize/instruction_tables.pdf, when citing data from it. (Although uops.info has even more detailed test results these days, especially for multi-uop instructions with different latencies from different inputs to the output.) – Denunciation 26/6, 2020 at 18:25

@Dolda2000: one-operand mul (32x32 => 64-bit) does have a throughput of one per 2 cycles on Haswell, and costs 3 uops. (2 uops for 64x64=>128-bit, I guess not needing an extra uop to split up the low 64 into two 32-bit halves). The problem is that's not the normal way to multiply. Non-widening multiply is done with imul reg,reg/mem or imul reg, reg/mem, immediate, which is always 1 uop, 1c throughput on Intel SnB-family and on Zen, for any operand-size, only writing one output register. – Denunciation 26/6, 2020 at 18:28

Interesting that you found this a year after I wrote my comments here @PeterCordes. I happened to be the same guy that offered the 200 point bounty for the long-integer multiplication Hot Network Question on Matter Modeling Stack Exchange! Now I also see that you commented on my HNQ on Electrical Engineering: electronics.stackexchange.com/q/452181/192433 (which actually arose from this Stack Overflow question). – Carom 26/6, 2020 at 18:51

This is an even more complex answer than simply multiplication versus addition. In reality the answer will most likely NEVER be yes. Multiplication, electronically, is a much more complicated circuit. Most of the reasons why, is that multiplication is the act of a multiplication step followed by an addition step, remember what it was like to multiply decimal numbers prior to using a calculator.

The other thing to remember is that multiplication will take longer or shorter depending on the architecture of the processor you are running it on. This may or may not be simply company specific. While an AMD will most likely be different than an Intel, even an Intel i7 may be different from a core 2 (within the same generation), and certainly different between generations (especially the farther back you go).

In all TECHNICALITY, if multiplies were the only thing you were doing (without looping, counting etc...), multiplies would be 2 to (as ive seen on PPC architectures) 35 times slower. This is more an exercise in understanding your architecture, and electronics.

In Addition: It should be noted that a processor COULD be built for which ALL operations including a multiply take a single clock. What this processor would have to do is, get rid of all pipelining, and slow the clock so that the HW latency of any OPs circuit is less than or equal to the latency PROVIDED by the clock timing.

To do this would get rid of the inherent performance gains we are able to get when adding pipelining into a processor. Pipelining is the idea of taking a task and breaking it down into smaller sub-tasks that can be performed much quicker. By storing and forwarding the results of each sub-task between sub-tasks, we can now run a faster clock rate that only needs to allow for the longest latency of the sub-tasks, and not from the overarching task as a whole.

Picture of time through a multiply:

 |------------------------------------------------------|     Non-Pipelined


 |--Step 1--|--Step 2--|--Step 3--|--Step 4--|--Step 5--| Pipelined

In the above diagram, the non-pipelined circuit takes 50 units of time. In the pipelined version, we have split the 50 units into 5 steps each taking 10 units of time, with a store step in between. It is EXTREMELY important to note that in the pipelined example, each of the steps can be working completely on their own and in parallel. For an operation to be completed, it must move through all 5 steps in order but another of the same operation with operands can be in step 2 as one is in step 1, 3, 4, and 5.

With all of this being said, this pipelined approach allows us to continuously fill the operator each clock cycle, and get a result out on each clock cycle IF we are able to order our operations such that we can perform all of one operation before we switch to another operation, and all we take as a timing hit is the original amount of clocks necessary to get the FIRST operation out of the pipeline.

Mystical brings up another good point. It is also important to look at the architecture from a more systems perspective. It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock.

All of this can be summed up as follows:

Each architecture is different from a lower level HW perspective as well as from a system perspective
FUNCTIONALLY, a multiply will always take more time than an add because it combines a true multiply along with a true addition step.
Understand the architecture you are trying to run your code on, and find the right balance between readability and getting truly the best performance from that architecture.

Balbo answered 17/2, 2014 at 2:34 Comment(19)

It's worth nothing that on Haswell, the FP multiply throughput is double that of FP add. That's because both ports 0 and 1 can be used for multiply, but only port 1 can be used for addition. That said, you can cheat with fused-multiply adds since both ports can do them. – Leak 17/2, 2014 at 2:37

@Leak - Good point, also another reason why my answer is more general to talk about architecture of the machine :-) Great Note. – Balbo 17/2, 2014 at 2:42

@Leak - should note though that while the architecture supports 2 FP multipliers, versus a single add, does not mean that when looking at a single multiply operator versus a single add multiplier that it is faster. Again this is HW system level architecture, not direct performance measure of a single add versus a single multiply. – Balbo 17/2, 2014 at 10:39

It is understood that Multiplication is slower than Addition so why we take unit cost for both in Algorithm Analysis? – Ingamar 5/12, 2018 at 9:19

"It is true that the newer Haswell architectures was built to better the Floating Point multiply performance within the processor. For this reason as the System level, it was architected to allow multiple multiplies to occur in simultaneity versus an add which can only happen once per system clock." How many multiplies can a single Haswell thread do simultaneously, and how many adds can it do simultaneously? – Carom 8/8, 2019 at 21:53

@Leak Did you notice that your comment became this question? I also don't know if you've had a chance to think about my question in my comment frmo August 2019. (no pressure if you are very busy, I was just very curious). – Carom 29/6, 2020 at 18:32

@ShajeelAfzal They use the same cost because algorithms are generally analyzed with big-O or something similar and both addition and multiplication of two CPU words have a constant upper bound on the amount of time they take: the slowest two inputs to add or to multiply together take a certain amount of time. Two constant time things are both considered O(1). The operations would have different results if you used bigintegers because unlike CPU words which only go so high, bigintegers have no upper limit and so you can always choose a pair of larger numbers that take longer than the last pair. – Creditor 30/6, 2023 at 4:30

@ChaiT.Rex You are not technically correct with your big O argument. IF we look at the electronics of an add, and at a single bit in/out with a carry bit, then how those single bit adders get put together to make a multi-bit wide adder, it follows that addition is actually an O(n) algorithm. n is the number of bits wide of the 2 multi-bit inputs. The reason it LOOKS like O(1) to YOU is that it happens in (essentially) 1 clock cycle (as software sees it). – Balbo 14/7, 2023 at 12:52

@ChaiT.Rex If we were to simply provide a 1 bit adder, it would be able to have a FAR shorter clock cycle. A longer clock is what allows the electrons to propegate through the width of the circuit and "settle". If a clock cycle that is shorter than the propegation delay "settling time" of the multi-bit adder were to be used, the addition algorithm in hardware would give unstable results. Another way to think about this algorithmically, is to analogize this with the building of a multi-precision library using the available hardware primitives. – Balbo 14/7, 2023 at 12:52

@Balbo The word size of a given CPU is constant, which means that it has a constant upper bound in the amount of time adding two words takes, which means that it's O(1), as just about every function with a constant upper bound is O(1). We're not talking here about how a CPU might have had a larger adder than it does, we're talking about the actual adder in the CPU in front of us. There are no addition circuits in common use that are unbounded in the number of bits they can add together in one operation. As I pointed out, biginteger arithmetic is O(n) because it has no upper bound. – Creditor 17/7, 2023 at 2:29

@ChaiT.Rex you misunderstand BigO notation. It is algorithmic in its determination, and not based on the simple fact that processor manufactures have arbitrarily decided to hone in on 32 or 64 bit. If BigO weren’t applicable, then there wouldn’t be a need to have differing precision operations. Constant op time does not equate to BigO(1), that would be the same as saying that by alway pushing a constant input size to an n^2 algorithm having a constant time being considered an O(1) algorithm! – Balbo 18/7, 2023 at 23:47

@Balbo I'm not saying a single specific input pair is O(1). I'm saying that every single possible pair of words put into the addition instruction on a CPU will have a certain time they take. Out of those times, there will be a maximum time, so it's O(1). That's why computer science courses teach that a for loop doing a single word-sized addition iterating O(n) times is O(n) rather than O(n²), which it would be if you did O(n) iterations of an O(n) algorithm. Big integer arithmetic is also O(n), but bigint addition does O(n) word-sized additions and it's not O(n²) either. – Creditor 20/7, 2023 at 21:0

@ChaiT.Rex I dare you to learn VHDL, or Verizon, and code an adder yourself, because what you’ll find is that the adder for a given bit size can be written with a for loop using bit width as the termination. I argued that your CS class is wrong. Yes, we many of the times look at an algorithm we dev as BigO(whatever) completely forgetting that other routines we USE may also be of a BigO(>1), that does not make those routines (same really as a hardware adder) BigO(1) just because we choose to ignore it. You actually could write the adder in SW starting with a single bit also. – Balbo 22/7, 2023 at 11:26

@ChaiT.Rex in general, we ANALYZE algorithms that use additions as if the add is constant time. It is important to note however that constant time is NOT necessarily the same as BigO(1). I interviewed someone once who tried to tell me that by using massively parallel processing GPUs turned an n^3 algorithm into an n^2. That was completely incorrect. The algo IS the algo. Again, BigO is an analysis of the algo, not the platform it’s running on!!! – Balbo 22/7, 2023 at 11:37

Constant time is O(1), as you can, by definition of big O, have |f(x)| ≤ Mg(x) for all x ≥ x₀ where f(x) is the time function, g(x) is 1, M is the maximum possible time (which is a constant upper bound), and x₀ = 0. When that happens, f(x) ∈ O(g(x)), so in this case the time function is in O(1). See the formal definition of big O. I don't see what someone making an unrelated error has to do with this. – Creditor 23/7, 2023 at 9:18

@ChaiT.Rex I believe you are misinterpreting the definition (from your own words). F(x) is not the time function, it is the “function to be estimated). G(x) is not 1, M yes is the constant upper bound time for the addition of a single bit, not 32 of them. G(x) is n, and thus M times the number of bits in the addition equates to M*n which is the total time it takes to perform f(x) (addition in this case). Apologies you didn’t catch the correlation of my story. I really do emplore you to write a multi-bit adder for yourself from a single bit. The for loop will be a clue. :-) – Balbo 24/7, 2023 at 17:54

@Balbo We're talking about time complexity and we're giving it a big O class, so yes, f is definitely the time function. There are only two functions there, f and g. g is the one inside the big O. f is the one that is being checked to be in O(g(x)) or not. M is any arbitrary constant I choose that happens to work, and I chose the maximum time for a word-sized addition because it works. All word-sized additions take that amount of time or less, so the inequality holds. I already know how to design and connect together full adders. I don't think this will go anywhere, so I'll stop here. – Creditor 25/7, 2023 at 5:42

@ChaiT.Rex At the end of the day, you and I will have to agree to disagree, but I will leave you with a couple links, because I really think that you are choosing to manipulate the meaning of bigO to suite your perception. 1. iq.opengenus.org/time-complexity-of-addition Be careful because here because she does switch between decimal and binary addition without stating it. bing.com/… – Balbo 27/7, 2023 at 2:5

@ChaiT.Rex I should also redirect you back down to your original definition (straight from wikipedia) where below it quite literally lists binary addition using ripple care as the a defining O(n) algorithm. Look in the section called "Order of Common Functions) – Balbo 27/7, 2023 at 2:6

Intel since Haswell has

add performance of 4/clock throughput, 1 cycle latency. (Any operand-size)
imul performance of 1/clock throughput, 3 cycle latency. (Any operand-size)

Ryzen is similar. Bulldozer-family has much lower integer throughput and not-fully-pipelined multiply, including extra slow for 64-bit operand-size multiply. See https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info

But a good compiler could auto-vectorize your loops. (SIMD-integer multiply throughput and latency are both worse than SIMD-integer add). Or simply constant-propagate through them to just print out the answer! Clang really does know the closed-form Gauss's formula for sum(i=0..n) and can recognize some loops that do that.

You forgot to enable optimization so both loops bottleneck on the ALU + store/reload latency of keeping sum in memory between each of sum += independent stuff and sum++. See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for more about just how bad the resulting asm is, and why that's the case. clang++ defaults to -O0 (debug mode: keep variables in memory where a debugger can modify them between any C++ statements).

Store-forwarding latency on a modern x86 like Sandybridge-family (including Haswell and Skylake) is about 3 to 5 cycles, depending on timing of the reload. So with a 1-cycle latency ALU add in there, too, you're looking at about two 6-cycle latency steps in the critical path for this loop. (Plenty to hide all the store / reload and calculation based on i, and the loop-counter update).

See also Adding a redundant assignment speeds up code when compiled without optimization for another no-optimization benchmark. In that one, store-forwarding latency is actually reduced by having more independent work in the loop, delaying the reload attempt.

Modern x86 CPUs have 1/clock multiply throughput so even with optimization you wouldn't see a throughput bottleneck from it. Or on Bulldozer-family, not fully pipelined with 1 per 2-clock throughput.

More likely you'd bottleneck on the front-end work of getting all the work issued every cycle.

Although lea does allow very efficient copy-and-add, and doing i + i + 1 with a single instruction. Although really a good compiler would see that the loop only uses 2*i and optimize to increment by 2. i.e. a strength-reduction to do repeated addition by 2 instead of having to shift inside the loop.

And of course with optimization the extra sum++ can just fold into the sum += stuff where stuff already includes a constant. Not so with the multiply.

Denunciation answered 10/8, 2019 at 10:14 Comment(1)

(Update: Ice Lake added a 2nd store port and widened the front-end to 5-wide. Alder Lake P-cores adds a 5th integer ALU port, and a 3rd load port, with the front-end being 6 uops wide. As uops.info shows, 5 add per clock, plus room for one more uop like a load or store) – Denunciation 22/11, 2022 at 14:46

I came to this thread to get an idea of what the modern processors are doing in regard to integer math and the number of cycles required to do them. I worked on this problem of speeding up 32-bit integer multiplies and divides on the 65c816 processor in the 1990's. Using the method below, I was able to triple the speed of the standard math libraries available in the ORCA/M compilers at the time.

So the idea that multiplies are faster than adds is simply not the case (except rarely) but like people said it depends upon how the architecture is implemented. If there are enough steps being performed available between clock cycles, yes a multiply could effectively be the same speed as an add based on the clock, but there would be a lot of wasted time. In that case it would be nice to have an instruction that performs multiple (dependent) adds / subtracts given one instruction and multiple values. One can dream.

On the 65c816 processor, there were no multiply or divide instructions. Mult and Div were done with shifts and adds.
To perform a 16 bit add, you would do the following:

LDA $0000 - loaded a value into the Accumulator (5 cycles)
ADC $0002 - add with carry  (5 cycles)
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
15 cycles total for an add

If dealing with a call like from C, you would have additional overhead of dealing with pushing and pulling values off the stack. Creating routines that would do two multiples at once would save overhead for example.

The traditional way of doing the multiply is shifts and adds through the entire value of the one number. Each time the carry became a one as it is shifted left would mean you needed to add the value again. This required a test of each bit and a shift of the result.

I replaced that with a lookup table of 256 items so as the carry bits would not need to be checked. It was also possible to determine overflow before doing the multiply to not waste time. (On a modern processor this could be done in parallel but I don't know if they do this in the hardware). Given two 32 bit numbers and prescreened overflow, one of the multipliers is always 16 bits or less, thus one would only need to run through 8 bit multiplies once or twice to perform the entire 32 bit multiply. The result of this was multiplies that were 3 times as fast.

the speed of the 16 bit multiplies ranged from 12 cycles to about 37 cycles

multiply by 2  (0000 0010)
LDA $0000 - loaded a value into the Accumulator  (5 cycles).
ASL  - shift left (2 cycles).
STA $0004 - store the value in the Accumulator back to memory (5 cycles).
12 cycles plus call overhead.

multiply by (0101 1010)
LDA $0000 - loaded a value into the Accumulator  (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
ASL  - shift left (2 cycles) 
ADC $0000 - add with carry for next bit (5 cycles) 
ASL  - shift left (2 cycles) 
STA $0004 - store the value in the Accumulator back to memory (5 cycles)
37 cycles plus call overhead

Since the databus of the AppleIIgs for which this was written was only 8 bits wide, to load 16 bit values required 5 cycles to load from memory, one extra for the pointer, and one extra cycle for the second byte.

LDA instruction (1 cycle because it is an 8 bit value) $0000 (16 bit value requires two cycles to load) memory location (requires two cycles to load because of an 8 bit data bus)

Modern processors would be able to do this faster because they have a 32 bit data bus at worst. In the processor logic itself the system of gates would have no additional delay at all compared to the data bus delay since the whole value would get loaded at once.

To do the complete 32 bit multiply, you would need to do the above twice and add the results together to get the final answer. The modern processors should be able to do the two in parallel and add the results for the answer. Combined with the overflow precheck done in parallel, it would minimize the time required to do the multiply.

Anyway it is readily apparent that multiplies require significantly more effort than an add. How many steps to process the operation between cpu clock cycles would determine how many cycles of the clock would be required. If the clock is slow enough, then the adds would appear to be the same speed as a multiply.

Regards, Ken

Notecase answered 18/11, 2019 at 22:35 Comment(1)

Any estimation on a modern CPU (e.g. Intel's i7) how many cycles an ADD or MUL can take? – Analyzer 27/2, 2020 at 16:9

A multiplication requires a final step of an addition of, at minimum, the same size of the number; so it will take longer than an addition. In decimal:

    123
    112
   ----
   +246  ----
   123      | matrix generation  
  123    ----
  -----
  13776 <---------------- Addition

Same applies in binary, with a more elaborate reduction of the matrix.

That said, reasons why they may take the same amount of time:

To simplify the pipelined architecture, all regular instructions can be designed to take the same amount of cycles (exceptions are memory moves for instance, that depend on how long it takes to talk to external memory).
Since the adder for the final step of the multiplier is just like the adder for an add instruction... why not use the same adder by skipping the matrix generation and reduction? If they use the same adder, then obviously they will take the same amount of time.

Of course, there are more complex architectures where this is not the case, and you might obtain completely different values. You also have architectures that take several instructions in parallel when they don't depend on each other, and then you are a bit at the mercy of your compiler... and of the operating system.

The only way to run this test rigorously you would have to run in assembly and without an operating system - otherwise there are too many variables.

Starryeyed answered 28/5, 2014 at 16:39 Comment(0)

Even if it were, that mostly tells us what restriction the clock puts on our hardware. We can't clock higher because heat(?), but the number of ADD instruction gates a signal could pass during a clock could be very many but a single ADD instruction would only utilize one of them. So while it may at some point take equally many clock cycles, not all of the propagation time for the signals is utilized.

If we could clock higher we could def. make ADD faster probably by several orders of magnitude.

Precontract answered 9/2, 2017 at 7:0 Comment(0)

This really depends on your machine. Of course, integer multiplication is quite complex compared to addition, but quite a few AMD CPU can execute a multiplication in a single cycle. That is just as fast as addition.

Other CPUs take three or four cycles to do a multiplication, which is a bit slower than addition. But it's nowhere near the performance penalty you had to suffer ten years ago (back then a 32-Bit multiplication could take thirty-something cycles on some CPUs).

So, yes, multiplication is in the same speed class nowadays, but no, it's still not exactly as fast as addition on all CPUs.

Cletus answered 9/5, 2014 at 4:48 Comment(2)

It's not in the same speed class because the clock has hit the speed limit. What it means is that the CPU is idle much more of the time during one clock cycle if you do an ADD because as transistors shrunk but the max clock frequency stayed the same the so the signal could pass more gates in one clock. – Precontract 10/2, 2017 at 14:30

@Precontract How long it takes the hardware gates to come up with the correct result is entirely irrelevant to the software: The result cannot take effect before the next clock cycle starts. Also, you can bet on the chip designers not to build adders that are significantly faster than they must be to fulfill the requirements of the intended clock frequency: That would mean using precious chip real estate for pointlessly complex carry-look-ahead generators. These circuits are needed because even today it's not possible to ripple a carry through 64 bits in a single cycle. – Cletus 10/2, 2017 at 14:46

Even on ARM (known for its high efficiency and small, clean design), integer multiplications take 3-7 cycles and than integer additions take 1 cycle.

However, an add/shift trick is often used to multiply integers by constants faster than the multiply instruction can calculate the answer.

The reason this works well on ARM is that ARM has a "barrel shifter", which allows many instructions to shift or rotate one of their arguments by 1-31 bits at zero cost, i.e. x = a + b and x = a + (b << s) take exactly the same amount of time.

Utilizing this processor feature, let's say you want to calculate a * 15. Then since 15 = 1111 (base 2), the following pseudocode (translated into ARM assembly) would implement the multiplication:

a_times_3 = a + (a << 1)                  // a * (0011 (base 2))
a_times_15 = a_times_3 + (a_times_3 << 2) // a * (0011 (base 2) + 1100 (base 2))

Similarly you could multiply by 13 = 1101 (base 2) using either of the following:

a_times_5 = a + (a << 2)
a_times_13 = a_times_5 + (a << 3)

a_times_3 = a + (a << 1)
a_times_15 = a_times_3 + (a_times_3 << 2)
a_times_13 = a_times_15 - (a << 1)

The first snippet is obviously faster in this case, but sometimes subtraction helps when translating a constant multiplication into add/shift combinations.

This multiplication trick was used heavily in the ARM assembly coding community in the late 80s, on the Acorn Archimedes and Acorn RISC PC (the origin of the ARM processor). Back then, a lot of ARM assembly was written by hand, since squeezing every last cycle out of the processor was important. Coders in the ARM demoscene developed many techniques like this for speeding up code, most of which are probably lost to history now that almost no assembly code is written by hand anymore. Compilers probably incorporate many tricks like this, but I'm sure there are many more that never made the transition from "black art optimization" to compiler implementation.

You can of course write explicit add/shift multiplication code like this in any compiled language, and the code may or may not run faster than a straight multiplication once compiled.

x86_64 may also benefit from this multiplication trick for small constants, although I don't believe shifting is zero-cost on the x86_64 ISA, in either the Intel or AMD implementations (x86_64 probably takes one extra cycle for each integer shift or rotate).

Ogham answered 5/9, 2020 at 16:57 Comment(0)

There are lots of good answers here about your main question, but I just wanted to point out that your code is not a good way to measure operation performance. For starters, modern cpus adjust freqyuencies all the time, so you should use rdtsc to count the actual number of cycles instead of elapsed microseconds. But more importantly, your code has artificial dependency chains, unnecessary control logic and iterators that will make your measure into an odd mix of latency and throughtput plus some constant terms added for no reason. To really measure throughtput you should significantly unroll the loop and also add several partial sums in parallel (more sums than steps in the add/mul cpu pipelines).

Manuscript answered 7/6, 2021 at 12:47 Comment(0)

No it's not, and in fact it's noticeably slower (which translated into a 15% performance hit for the particular real-world program I was running).

I realized this myself when asking this question from just a few days ago here.

Inflexible answered 17/2, 2014 at 2:31 Comment(3)

On the other hand yours was 64-bit multiplication, used to emulate 32-bit division – Precession 17/2, 2014 at 3:6

@BenVoigt: Wait, is there a "more" 32-bit kind of 32-bit multiplication available than the one the compiler was using? – Inflexible 17/2, 2014 at 3:13

There are multiplies that yield 32-bit results, but I'm not sure of any that take arbitrary 32-bit operands and have a 32-bit result. – Precession 17/2, 2014 at 3:26

-4

Since the other answers deal with real, present-day devices -- which are bound to change and improve as time passes -- I thought we could look at the question from the theoretical side.

Proposition: When implemented in logic gates, using the usual algorithms, an integer multiplication circuit is O(log N) times slower than an addition circuit, where N is the number of bits in a word.

Proof: The time for a combinatorial circuit to stabilise is proportional to the depth of the longest sequence of logic gates from any input to any output. So we must show that a gradeschool multiply circuit is O(log N) times deeper than an addition circuit.

Addition is normally implemented as a half adder followed by N-1 full adders, with the carry bits chained from one adder to the next. This circuit clearly has depth O(N). (This circuit can be optimized in many ways, but the worst case performance will always be O(N) unless absurdly large lookup tables are used.)

To multiply A by B, we first need to multiply each bit of A with each bit of B. Each bitwise multiply is simply an AND gate. There are N^2 bitwise multiplications to perform, hence N^2 AND gates -- but all of them can execute in parallel, for a circuit depth of 1. This solves the multiplication phase of the gradeschool algorithm, leaving just the addition phase.

In the addition phase, we can combine the partial products using an inverted binary tree-shaped circuit to do many of the additions in parallel. The tree will be (log N) nodes deep, and at each node, we will be adding together two numbers with O(N) bits. This means each node can be implemented with an adder of depth O(N), giving a total circuit depth of O(N log N). QED.

Mandymandych answered 9/5, 2014 at 1:52 Comment(2)

-1, the implementation outlined here is painfully naive - this is not the "usual" algorithm. – Incrassate 23/7, 2014 at 21:34

en.wikipedia.org/wiki/Binary_multiplier#Implementations mentions some modern implementation options, like en.wikipedia.org/wiki/Wallace_tree for example. And no, you don't use O(N) ripple-carry addition in a modern x86-64 ALU! That would be 64 full-adder delays in the critical path for 1-cycle integer add. en.wikipedia.org/wiki/Carry-lookahead_adder is one way around this problem, or there's also carry-select. Fun fact: Pentium 4 really did use 16-bit ripple-carry adders at very high clock speed. – Denunciation 10/8, 2019 at 9:57

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags