lea r32, [reg+disp8]
is 3 bytes, vs. mov r32, imm32
being 5 bytes.
See Tips for golfing in x86/x64 machine code and Nate's answer.
x86 is unfortunately missing a mov reg, sign_extended_imm8
. All else equal (or nearly equal), smaller code size is usually better, especially in "cold" code that might have to come from legacy decode. (And also for I-cache / iTLB footprint reasons.)
Cool, I didn't realize any compilers were using this code-size optimization for materializing constants in registers. Nice job, MSVC. GCC and Clang should be doing this, too, at least with -Os
. Probably even for -O2
/-O3
; there will be some cases where it's not a win but I expect it's good on average on most CPUs.
GCC/clang -Oz
use push imm8
/pop reg
for code-size optimization even at significant cost to performance; Godbolt. That's also 3 bytes, but much less efficient.
Intel since Ice Lake has 4/clock lea
(with simple addressing modes), and Zen has always had that. Previously 2/clock LEA throughput on Skylake and earlier, but still only 1 cycle latency. (https://uops.info/)
I have the feeling that the version using lea
would be slower since it adds a dependency between the lea instructions and the mov
instruction.
All 3 read the mov
-immediate result from RDX, so there's good instruction-level parallelism, not a chain of dependencies. And RDX started a new dependency chain, so it can execute as early as the cycle after the front-end issues it.
By the time instructions after the jmp
that read the results are in the pipeline, the lea
s can already have executed if there are any spare cycles on the execution units they're scheduled to. (Or if there's lots of independent work in the pipeline and we're just bottlenecked on back-end ALU throughput, then the instructions in the tailcalled function wouldn't get a cycle on an execution unit either. Unless maybe it was a load instead of ALU, or an execution port that wasn't busy... But then mov
-imm would have had the same problem, just waiting for ALU execution port throughput, not latency.)
(uops are scheduled oldest-ready first, so under normal conditions where the front-end is fairly far ahead of the oldest instructions being executed, independent work like this can usually find a gap.)
If any of the instructions using these constants use it with data coming from older instructions, it's very likely that latency of materializing the constants will be a non-issue. I think it's very unlikely that the extra latency before R8/R9/RCX are ready would end up costing cycles in a modern out-of-order exec x86.
It's a little odd that it put the lea
for ECX last, though; many functions look at their first arg first, so you'd want that to be the mov
-immediate or the first lea
. All three lea
s can execute in parallel, but the last ones might get issued by the front-end a cycle later. And with oldest-ready-first scheduling, if any get scheduled to the same port (because the number of uops waiting for all other ports are high) then they'll have a resource conflict and have to take turns.
I wonder if the compiler's algorithm was to pick a middle value to make it more likely that all the values were in range of [reg+disp8]
compact addressing modes. (Hopefully it also prefers to pick a "legacy" register so REX prefixes can be minimized; if it had picked R8, all three LEAs would have needed a REX.)
If execution-port pressure is fairly even, they might not all get scheduled to different ports when issuing in the same cycle. See x86_64 haswell instruction scheduled on already used port instead of unused one for details on how Haswell schedules multiple uops in the same cycle. So this could create a resource conflict, making one of the lea
results not ready until 2 cycles after the mov
result was ready. (2 cycles where that port was free, if there are even older uops in the ROB that just had some gaps.)
So that's not very definitive, but my intuition is that this won't be a problem in practice. I'd guess (and hope) that MSVC developers profiled it on some existing codebases and didn't find any serious performance regressions, and hopefully found some minor overall speedups on average.