x86 Multiplication with 3: IMUL vs SHL + ADD
Asked Answered
L

1

1

I developed a program in x86-64 assembly which needs to iterate many times through the same operation:

IMUL rdx, 3   # rdx is always different

However, I need to make the runtime faster, so I thought of an optimization for that specific line from above:

MOV rcx, rdx
SHL rdx, 1
ADD rdx, rcx

Now I ask you guys: would this modification improve the runtime of the program (less clocks), or should I stick with the IMUL command?

Loser answered 19/7, 2019 at 10:22 Comment(9)
It doesn't work, add rdx, rdx just doubles rdx again instead of adding the original rdx, and shl rdx, 2 multiplies by 4 already, but how about lea rdx, [rdx + rdx * 2]?Chanticleer
I have seen this solution many times using lea, but I didn't manage to fully understand it. Would it be possible to clarify it? Also, would it be and improvement in performance?Loser
lea evaluates the address and then stores that in the destination, it's not a memory access despite looking like one.Chanticleer
Thanks! I also updated the question, but I guess now that is even less optimal. I will take a look on the explanation.Loser
Do you mean IMUL rdx, rdx, 3? There is no IMUL instruction with only one register and an immediate.Cephalothorax
Indeed, there is. Take a look into the developer's manual: google.com/url?sa=t&source=web&rct=j&url=https://… page 3-443Loser
@AndreasAbel : You are correct there is no 2 operand form that takes an immediate but what many assemblers do is generate the 3 operand form from the 2 operands. So imul rdx, 3 is encoded as imul rdx,rdx, 3 (the instruction would be encoded as 48 6b d2 03)Slaphappy
Are you trying to improve latency or throughput? Which microarchitecture are you targeting?Cephalothorax
The latency is what I am trying to improve. I am working on Intel Architecture, 64bitLoser
T
10

Both are terrible compared to lea rdx, [rdx + rdx*2], using a scaled-index addressing mode to get a total of *3, which is why compilers will always use LEA if you ask them to compile a function like

long foo(long x){ return x * 3; } (https://godbolt.org/z/6p4ynV)


LEA is a way to feed arbitrary numbers through x86 addressing modes without using the result for a load or store, just putting it in a register. Using LEA on values that aren't addresses / pointers?


On all modern x86 CPUs, LEA is a single uop. The only question is how much better it is than the alternatives. imul is also 1 uop, but mov+shl+add is 3 for the front-end. (This is true across all mainstream and low-power Intel/AMD that are still relevant. See https://agner.org/optimize/) 64-bit imul is extra slow on some older microarchitectures, like Bulldozer-family and Silvermont/Goldmont, or especially older Atom.

On AMD CPUs (Bulldozer/Ryzen), it has a scaled index so it's a "complex" LEA and has 2 cycle latency (vs. 3 for imul on Ryzen, or much worse on Bulldozer-family where 64-bit imul is slower and not fully pipelined). On Ryzen this LEA still has 2-per-clock throughput.

On Intel CPUs, it only has 2 components (one +), so it's a "simple" LEA with 1 cycle latency and can run with 2-per-clock throughput. So about the same cost as one shl instruction, but runs on different ports.

(Or on Ice Lake, 4-per-clock since they added LEA units to the other 2 integer ALU ports. So it's exactly as cheap as one add on Ice Lake.)


You'd only want mov ; shl ; sub or add when your multiplier was 2^n +- 1 for n > 3. Then it is worth considering imul for a tradeoff between latency and front-end throughput cost.

By shifting the original register, even CPUs without mov-elimination (before IvyBridge and Ryzen) can run your mov/shl/add sequence with 2 cycle latency critical path length.


Also related: C++ code for testing the Collatz conjecture faster than hand-written assembly - why? has some details about a problem with *3 vs. optimizing with LEA.

Other related:

Throe answered 19/7, 2019 at 14:0 Comment(2)
Thank you for the detailed answer! One more small question: In this project I also use the IDIV command, but for 2 reasons: to extract the number rounded to the smaller integer and to extract modulo with rdx. Are there any possible optimizations, or the command should be used as it is?Loser
@CosminAprodu: Again, ask a compiler. Yes, multiplicative inverses are much faster than div. Why does GCC use multiplication by a strange number in implementing integer division?. Given the quotient, you have to multiply and subtract to get the remainder. Look at compiler output to see how it's done (especially for signed integers where sign-handling is tricky).Throe

© 2022 - 2024 — McMap. All rights reserved.