Both are terrible compared to lea rdx, [rdx + rdx*2]
, using a scaled-index addressing mode to get a total of *3
, which is why compilers will always use LEA if you ask them to compile a function like
long foo(long x){ return x * 3; }
(https://godbolt.org/z/6p4ynV)
LEA is a way to feed arbitrary numbers through x86 addressing modes without using the result for a load or store, just putting it in a register. Using LEA on values that aren't addresses / pointers?
On all modern x86 CPUs, LEA is a single uop. The only question is how much better it is than the alternatives. imul
is also 1 uop, but mov+shl+add is 3 for the front-end. (This is true across all mainstream and low-power Intel/AMD that are still relevant. See https://agner.org/optimize/) 64-bit imul
is extra slow on some older microarchitectures, like Bulldozer-family and Silvermont/Goldmont, or especially older Atom.
On AMD CPUs (Bulldozer/Ryzen), it has a scaled index so it's a "complex" LEA and has 2 cycle latency (vs. 3 for imul
on Ryzen, or much worse on Bulldozer-family where 64-bit imul
is slower and not fully pipelined). On Ryzen this LEA still has 2-per-clock throughput.
On Intel CPUs, it only has 2 components (one +
), so it's a "simple" LEA with 1 cycle latency and can run with 2-per-clock throughput. So about the same cost as one shl
instruction, but runs on different ports.
(Or on Ice Lake, 4-per-clock since they added LEA units to the other 2 integer ALU ports. So it's exactly as cheap as one add
on Ice Lake.)
You'd only want mov
; shl
; sub
or add
when your multiplier was 2^n +- 1 for n > 3
. Then it is worth considering imul
for a tradeoff between latency and front-end throughput cost.
By shifting the original register, even CPUs without mov
-elimination (before IvyBridge and Ryzen) can run your mov/shl/add sequence with 2 cycle latency critical path length.
Also related: C++ code for testing the Collatz conjecture faster than hand-written assembly - why? has some details about a problem with *3
vs. optimizing with LEA.
Other related:
add rdx, rdx
just doublesrdx
again instead of adding the originalrdx
, andshl rdx, 2
multiplies by 4 already, but how aboutlea rdx, [rdx + rdx * 2]
? – Chanticleerlea
, but I didn't manage to fully understand it. Would it be possible to clarify it? Also, would it be and improvement in performance? – Loserlea
evaluates the address and then stores that in the destination, it's not a memory access despite looking like one. – ChanticleerIMUL rdx, rdx, 3
? There is noIMUL
instruction with only one register and an immediate. – Cephalothoraximul rdx, 3
is encoded asimul rdx,rdx, 3
(the instruction would be encoded as48 6b d2 03
) – Slaphappy