What is the computational cost of floating point Multiplication vs Addition?
Asked Answered
F

1

8

Ignoring all the other issues like memory transfer, etc.

I'm looking for some measure of the "cost", which I guess I would quantify as the expected number of bit flips, for multiplying two random floating point (say 32-bit) numbers, vs the cost for adding.

I guess there may be some important issues (like whether the numbers have the same exponent, etc) that may be worth considering.

Edit: To clarify, I'm interested in the amount of energy required to perform these operations, rather than the time or amount of hardware, which is why I think "expected number of bit flips" is the quantity of interest. I think this is a well-defined question, and there is certainly some "expected number of bit flips" required by a given algorithm to perform floating point multiplication... And I'm looking for the minimum over all algorithms.

Edit 2: Thanks all for responding. The most relevant response I got was from njuffa, who referenced Mark Horowitz's estimates (see page 33). A more up-to-date paper by Horowitz posts slightly different numbers, that is:

Float32 Mult: 3.7pJ.  
Float32 Add:  0.9pJ
Int32 Mult:   3.1pJ
Int32 Add:    0.1pJ
Fragile answered 27/9, 2016 at 8:6 Comment(7)
I think you would need to be more specific - e.g., hardware ALU implementations vs. asymptotic complexity of 'unlimited' silicon resources. You could look for resources on how floating-point is implemented on FPGAs. It's probably a good starting point. But as is, this question is probably a little too broad in scope.Eve
I see how these would matter if one were looking at time/space complexity, but I would assume that if one is simply measuring the amount of bit-flips required to perform the algorithm, that there is a clear-cut number out there.Fragile
The relative energy consumption of addition vs multiplication depens on the bit-width of the operands. For 32-bit floating-point operations, Mark Horowitz of Stanford gives numbers of 0.9 pJ for an addition versus 4 pJ for a multiplication (on a now outdated 45 nm process, but I believe the general relationship still holds on modern 14 nm processes).Missus
You may also want to consider that many modern floating-point processing units are focusing on fused multiply-add (FMA) units, rather than separate floating-point adders and multipliers. Due to the retention of the full (not rounded, not truncated) product prior to the addition, this creates new challenges for power efficient designs.Missus
Ah, thanks. For that info. I'm really more interested in the fundamental limits rather than how it is implemented on common processors.Fragile
@Fragile You specifically asked for the "amount of energy required to perform these operations", and I am afraid that can only be determined based on physical implementations that we know how to build today or in the near future (e.g. graphene instead of silicon), or accurate simulations thereof. It is not clear (to me, anyhow) what you mean by "fundamental limits". Is this question possibly motivated by an XY problem?Missus
I'm interested in the relative energy use of multiplication vs addition, rather than the absolute number of pJ for each operation in silcon. I figure that, if we ignore hardware-specific issues (capacitance of lines, etc), this should be proportional to the relative numbers of bit-flips required by these operations.Fragile
S
4

On modern processors, floating point multiplication is generally slightly more expensive than addition (which is one reason why compilers will typically replace 2*x by x+x).

On x86 and and x86_64, floating point operations are almost always done using SSE instructions (ADDSS, MULSS, etc.), for which addition and multiplication are constant time with no "early outs" (which makes pipelining easier).

The actual relative cost is more difficult to quantify, and will depend on a lot of things. The canonical reference here is Agner Fog's "Lists of instruction latencies, throughputs and micro-operation breakdowns": http://www.agner.org/optimize/

One rough heuristic which I've heard (though don't have any references for) is that multiplication takes roughly 50% longer.

Situation answered 27/9, 2016 at 8:51 Comment(10)
Not quite. AFAIK, on many (most?) x86 (32 bit) systems, FP math is done on the FPU. On most, if not all, x86_64 systems, it is done using SSE.Florentinaflorentine
I'm not entirely sure that a modern compiler would replace x*2 with x+x... without something like gcc's -ffast-math option. I might be wrong; there are too many FP edge cases for me to remember.Eve
@RudyVelthuis I think this depends on your compiler and options: Clang seems to use SSE on both, GCC seems to prefer the FPU for some reason.Situation
@BrettHale It shouldn't require a -ffast-math flag, as it's a perfectly valid replacement (both give the same answers for all possible values).Situation
@SimonByrne - I'll take your word for it - like I said, I wasn't sure. I don't disagree about the logic of addition vs multiplication cost. I don't think this really answers anything but x86-64 micro-architectural questions. Though the question is probably too broad anyway.Eve
On Windows, still the most used 32 bit OS, it is the FPU, by most compilers.Florentinaflorentine
@SimonByrne Interestingly, there is one place where it does make a difference: -0 + -0 = -0, but -0 * 2 = +0. So replacing one with the other would violate IEEE-754 (something GCC won't do unless you give it permission to).Rotator
What compiler are you using? -0.0*2 should give -0.0Situation
from the IEEE spec: "When neither the inputs nor result are NaN, the sign of a product or quotient is the exclusive OR of the operands’ signs"Situation
@SimonByrne Huh, you're right... I remembered the zero-sign rules as being the same between multiplication and addition. The actual behavior makes several kinds more sense, of course.Rotator

© 2022 - 2024 — McMap. All rights reserved.