Advantage of using LEA over MOV for passing parameters in Assembly compiled from C++

Asked 29/7, 2023 at 23:33 Answered 30/7, 2023 at 1:33

Solved c++assembly visual-c++x86-64 micro-optimization

I am experimenting with the way parameters are passed to a function when compiling C++ code. I tried to compile the following C++ code using the x64 msvc 19.35/latest compiler to see the resulting assembly:

#include <cstdint>

void f(std::uint32_t, std::uint32_t, std::uint32_t, std::uint32_t);

void test()
{
    f(1, 2, 3, 4);
}

and got this result:

void test(void) PROC
        mov     edx, 2
        lea     r9d, QWORD PTR [rdx+2]
        lea     r8d, QWORD PTR [rdx+1]
        lea     ecx, QWORD PTR [rdx-1]
        jmp     void f(unsigned int,unsigned int,unsigned int,unsigned int)
void test(void) ENDP

Result on godbolt.org

What I do not understand is why did the compiler chose to use lea instead of a simple mov for this example. I understand the mechanics of lea and how it results in the correct values in each register, but I would have expected something more straightforward like:

void test(void) PROC
        mov     ecx, 1
        mov     edx, 2
        mov     r8d, 3
        mov     r9d, 4
        jmp     void f(unsigned int,unsigned int,unsigned int,unsigned int)
void test(void) ENDP

Moreover, from my little understanding of how modern CPUs work, I have the feeling that the version using lea would be slower since it adds a dependency between the lea instructions and the mov instruction.

clang and gcc both gives the result I expect, i.e., 4x mov.

Floria answered 29/7, 2023 at 23:33 Comment(2)

Hint: assemble them, generate a listing file, and count the bytes. – Grotesquery 29/7, 2023 at 23:34

Decreasing the binary size would make sense. I'll try that. – Arresting 29/7, 2023 at 23:36

MSVC's code is smaller than the naive mov approach. (But as you point out, because of the dependency, it may potentially be slower; you would have to test that.)

     1                                          bits 64
     2 00000000 BA02000000                      mov     edx, 2
     3 00000005 448D4A02                        lea     r9d, QWORD [rdx+2]
     4 00000009 448D4201                        lea     r8d, QWORD [rdx+1]
     5 0000000D 8D4AFF                          lea     ecx, QWORD [rdx-1]
     6                                  
     7 00000010 B901000000                      mov     ecx, 1
     8 00000015 BA02000000                      mov     edx, 2
     9 0000001A 41B803000000                    mov     r8d, 3
    10 00000020 41B904000000                    mov     r9d, 4

mov ecx, 1 is 5 bytes: one byte for the opcode B8-BF which also encodes the register, and 4 bytes for the 32-bit immediate. In particular, unlike for some arithmetic instructions, there is no option for mov to encode a smaller immediate with fewer bytes using zero- or sign-extension.

lea ecx, [rdx-1] is 3 bytes. One byte for the opcode; one MOD R/M byte which encodes the destination register ecx and the base register rdx for the effective address of the memory operand; and (here is the key) one byte for an 8-bit sign-extended displacement.

The instructions using r8,r9 need one extra byte for a REX prefix; but that's true for both mov and lea so it's a wash.

Grotesquery answered 29/7, 2023 at 23:53 Comment(0)

lea r32, [reg+disp8] is 3 bytes, vs. mov r32, imm32 being 5 bytes.
See Tips for golfing in x86/x64 machine code and Nate's answer.

x86 is unfortunately missing a mov reg, sign_extended_imm8. All else equal (or nearly equal), smaller code size is usually better, especially in "cold" code that might have to come from legacy decode. (And also for I-cache / iTLB footprint reasons.)

Cool, I didn't realize any compilers were using this code-size optimization for materializing constants in registers. Nice job, MSVC. GCC and Clang should be doing this, too, at least with -Os. Probably even for -O2/-O3; there will be some cases where it's not a win but I expect it's good on average on most CPUs.

GCC/clang -Oz use push imm8/pop reg for code-size optimization even at significant cost to performance; Godbolt. That's also 3 bytes, but much less efficient.

Intel since Ice Lake has 4/clock lea (with simple addressing modes), and Zen has always had that. Previously 2/clock LEA throughput on Skylake and earlier, but still only 1 cycle latency. (https://uops.info/)

I have the feeling that the version using lea would be slower since it adds a dependency between the lea instructions and the mov instruction.

All 3 read the mov-immediate result from RDX, so there's good instruction-level parallelism, not a chain of dependencies. And RDX started a new dependency chain, so it can execute as early as the cycle after the front-end issues it.

By the time instructions after the jmp that read the results are in the pipeline, the leas can already have executed if there are any spare cycles on the execution units they're scheduled to. (Or if there's lots of independent work in the pipeline and we're just bottlenecked on back-end ALU throughput, then the instructions in the tailcalled function wouldn't get a cycle on an execution unit either. Unless maybe it was a load instead of ALU, or an execution port that wasn't busy... But then mov-imm would have had the same problem, just waiting for ALU execution port throughput, not latency.)

(uops are scheduled oldest-ready first, so under normal conditions where the front-end is fairly far ahead of the oldest instructions being executed, independent work like this can usually find a gap.)

If any of the instructions using these constants use it with data coming from older instructions, it's very likely that latency of materializing the constants will be a non-issue. I think it's very unlikely that the extra latency before R8/R9/RCX are ready would end up costing cycles in a modern out-of-order exec x86.

It's a little odd that it put the lea for ECX last, though; many functions look at their first arg first, so you'd want that to be the mov-immediate or the first lea. All three leas can execute in parallel, but the last ones might get issued by the front-end a cycle later. And with oldest-ready-first scheduling, if any get scheduled to the same port (because the number of uops waiting for all other ports are high) then they'll have a resource conflict and have to take turns.

I wonder if the compiler's algorithm was to pick a middle value to make it more likely that all the values were in range of [reg+disp8] compact addressing modes. (Hopefully it also prefers to pick a "legacy" register so REX prefixes can be minimized; if it had picked R8, all three LEAs would have needed a REX.)

If execution-port pressure is fairly even, they might not all get scheduled to different ports when issuing in the same cycle. See x86_64 haswell instruction scheduled on already used port instead of unused one for details on how Haswell schedules multiple uops in the same cycle. So this could create a resource conflict, making one of the lea results not ready until 2 cycles after the mov result was ready. (2 cycles where that port was free, if there are even older uops in the ROB that just had some gaps.)

So that's not very definitive, but my intuition is that this won't be a problem in practice. I'd guess (and hope) that MSVC developers profiled it on some existing codebases and didn't find any serious performance regressions, and hopefully found some minor overall speedups on average.

Portfolio answered 30/7, 2023 at 1:33 Comment(2)

re picking in a middle, would it be better to nullify ecx with xor, and set the rest relative to that? sure xor changes flags, but are there disadvantages if a tail call is performed after that anyway? – Lessielessing 20/8, 2023 at 12:2

@AlexGuteniev: If one of the desired values is zero, yes, that strongly favours picking that one for xor-zeroing instead of a mov, saving code-size (and a back-end uop on Intel Sandybridge-family CPUs.) In this case, none of the desired values were zero. I should have mentioned it anyway for other cases, though. – Portfolio 20/8, 2023 at 14:46

Recommended topics

Hot tags