x86_64: is IMUL faster than 2x SHL + 2x ADD?
Asked Answered
C

1

8

When looking at the assembly produced by Visual Studio (2015U2) in /O2 (release) mode I saw that this 'hand-optimized' piece of C code is translated back into a multiplication:

int64_t calc(int64_t a) {
  return (a << 6) + (a << 16) - a;
}

Assembly:

  imul        rdx,qword ptr [a],1003Fh  

So I was wondering if that is really faster than doing it the way it is written, something like:

  mov         rbx,qword ptr [a]  
  mov         rax,rbx  
  shl         rax,6  
  mov         rcx,rbx  
  shl         rcx,10h  
  add         rax,rcx  
  sub         rax,rbx  

I was always under the impression that multiplication is always slower than a few shifts/adds? Is that no longer the case with modern Intel x86_64 processors?

Colon answered 20/6, 2016 at 14:30 Comment(4)
IMUL is typically the order of 3-4 latency, throughput 1 per clock. So yeah, probably faster. Also note at least your last 3 instructions depend on each other so that by itself would produce an equivalent dependency chain.Lauritz
It would have been good in the old days: the 8086 IMUL took around 100 clocks, depending on addressing mode, register size etc.Tincture
If you assume the load from memory is free (since you'd have to wait for it anyway) and the register-register movs can be eliminated, the shl can execute in parallel, then ideally, the shifting code would be about as fast as the multiply. However, it gunks up a lot of functional units and it's much more i-cache.Nunciata
@EOF: latency yes, throughput no (because the shift version bottlenecks on total uop throughput, if not on execution ports). Interesting point that with zero-latency mov and 2 execution ports that can run shift imm, the shift version is still only 3c latency.Mileage
M
19

That's right, modern x86 CPUs (especially Intel) have very high performance multipliers.
imul r, r/m and imul r, r/m, imm are both 3 cycle latency, one per 1c throughput on Intel SnB-family and AMD Ryzen, even for 64bit operand size.

On AMD Bulldozer-family, it's 4c or 6c latency, and one per 2c or one per 4c throughput. (The slower times for 64-bit operand-size).

Data from Agner Fog's instruction tables. See also other stuff in the tag wiki.


The transistor budget in modern CPUs is pretty huge, and allows for the amount of hardware parallelism needed to do a 64 bit multiply with such low latency. (It takes a lot of adders to make a large fast multiplier. How modern X86 processors actually compute multiplications?).

Being limited by power budget, not transistor budget, means that having dedicated hardware for many different functions is possible, as long as they can't all be switching at the same time (https://en.wikipedia.org/wiki/Dark_silicon). e.g. you can't saturate the pext/pdep unit, the integer multiplier, and the vector FMA units all at once on an Intel CPU, because many of them are on the same execution ports.

Fun fact: imul r64 is also 3c, so you can get a full 64*64 => 128b multiply result in 3 cycles. imul r32 is 4c latency and an extra uop, though. My guess is that the extra uop / cycle is splitting the 64bit result from the regular 64bit multiplier into two 32bit halves.


Compilers typically optimize for latency, and typically don't know how to optimize short independent dependency chains for throughput vs. long loop-carried dependency chains that bottleneck on latency.

gcc and clang3.8 and later use up to two LEA instructions instead of imul r, r/m, imm. I think gcc will use imul if the alternative is 3 or more instructions (not including mov), though.

That's a reasonable tuning choice, since a 3 instruction dep chain would be the same length as an imul on Intel. Using two 1-cycle instructions spends an extra uop to shorten the latency by 1 cycle.

clang3.7 and earlier tends to favour imul except for multipliers that only require a single LEA or shift. So clang very recently changed to optimizing for latency instead of throughput for multiplies by small constants. (Or maybe for other reasons, like not competing with other things that are only on the same port as the multiplier.)

e.g. this code on the Godbolt compiler explorer:

int foo (int a) { return a * 63; }
    # gcc 6.1 -O3 -march=haswell (and clang actually does the same here)
    mov     eax, edi  # tmp91, a
    sal     eax, 6    # tmp91,
    sub     eax, edi  # tmp92, a
    ret

clang3.8 and later makes the same code.

Mileage answered 20/6, 2016 at 14:36 Comment(6)
Does it mean then that an imul, while being faster, potentially consumes more energy than 4 ALU ops (since a multiplier activates a lot more transistors)?Colon
@rustyx: IDK. Tracking 4 ALU uops through the pipeline and forwarding their results to each other takes more energy than tracking one, and that may well cost more than operating the multiplier! Out-of-order execution is expensive. I wish I had some idea what the numbers were, so I could guess at the answer, but I haven't done any power optimization (other than obvious stuff like using 128bit AVX vector ops when I don't need the upper 128 of the result).Mileage
@rustyx: BTW, a high-latency multiplier using fewer transistors might still have a similar energy per multiply cost, since it still has to add all the partial sums. It just takes more cycles. (This is a massive over-simplification, and might actually be wrong. I'm not a gate-logic guy, and didn't really re-read those links in detail). A low-latency multiplier makes it tempting to use it more often, though (e.g. instead of two lea instructions to multiply by 10).Mileage
Do you have a source for CPUs using Wallace trees for their multipliers?Fiberglass
@Randomblue: no, I don't usually get into that level of detail of HW design. I don't know why Bulldozer has 4c latency / 2c throughput multiply vs. Ryzen and Intel's 3c latency / 1c throughput imul r32,r32, just that those are the performance numbers. IDK which microarchitectures use which techniques. Wikipedia mentions some other options, and has some links.Mileage
@Randomblue: How modern X86 processors actually compute multiplications? says all modern CPU ALUs use Dadda TreesMileage

© 2022 - 2024 — McMap. All rights reserved.