Why does this function push RAX to the stack as the first operation?
Asked Answered
F

3

30

In the assembly of the C++ source below. Why is RAX pushed to the stack?

RAX, as I understand it from the ABI could contain anything from the calling function. But we save it here, and then later move the stack back by 8 bytes. So the RAX on the stack is, I think only relevant for the std::__throw_bad_function_call() operation ... ?

The code:-

#include <functional> 

void f(std::function<void()> a) 
{
  a(); 
}

Output, from gcc.godbolt.org, using Clang 3.7.1 -O3:

f(std::function<void ()>):                  # @f(std::function<void ()>)
        push    rax
        cmp     qword ptr [rdi + 16], 0
        je      .LBB0_1
        add     rsp, 8
        jmp     qword ptr [rdi + 24]    # TAILCALL
.LBB0_1:
        call    std::__throw_bad_function_call()

I'm sure the reason is obvious, but I'm struggling to figure it out.

Here's a tailcall without the std::function<void()> wrapper for comparison:

void g(void(*a)())
{
  a(); 
}

The trivial:

g(void (*)()):             # @g(void (*)())
        jmp     rdi        # TAILCALL
Fidget answered 12/6, 2016 at 11:39 Comment(0)
P
34

The 64-bit ABI requires that the stack is aligned to 16 bytes before a call instruction.

call pushes an 8-byte return address on the stack, which breaks the alignment, so the compiler needs to do something to align the stack again to a multiple of 16 before the next call.

(The ABI design choice of requiring alignment before a call instead of after has the minor advantage that if any args were passed on the stack, this choice makes the first arg 16B-aligned.)

Pushing a don't-care value works well, and can be more efficient than sub rsp, 8 on CPUs with a stack engine. (See the comments).

Prettypretty answered 12/6, 2016 at 12:53 Comment(11)
Ah - I was thinking about it wrong and not really believing that fact that RAX was a garbage value :) Mystery solved!Fidget
Actually it's the other way around. Before the call the stack must be aligned, so after the call it is unaligned and must be realigned.Larva
I have to concur with @Dani . When transfer is controlled via the call to function f, RSP is already misaligned by 8 because the return address was placed on the stack. It was aligned to a 16-byte boundary just before control was transferred to f. Possibly you meant that after push rax the stack is once again aligned to a 16-byte boundary. The code actually adds 8 back to RSP when the branch isn't taken just before the JMP. It is inefficient code to get this job done.Weasel
I wonder why not use sub rsp, 8 so there's no unnecessary memory write.Blankly
@Blankly I could see it if it was being optimized for space over speed.Weasel
@Gene: sub rsp, 8 requires an extra uop for the stack-engine to sync its offset value of rsp with the value in the out-of-order core. So on modern Intel CPUs (but not AMD), it's actually more efficient to do one push of garbage than to manually modify rsp by just 8. (The stack engine is what makes it possible for push / pop to be single-uop instructions, instead of needing an extra uop to modify rsp). See Agner Fog's microarch.pdf for the details. The stack engine was new in Pentium-M, but AMD has it too thanks to their patent-sharing agreement.Berniecebernier
...continued for Gene and @MichaelPetch: the memory write will almost always hit in cache, so that's not really a concern other than store-port bandwidth. IIRC, clang uses this trick but gcc doesn't (not even with -mtune=haswell or something.)Berniecebernier
@PeterCordes : On GCC is seems to change between versions. GCC 5.3 (and some early versions) seem to use push rax in the generated code (for this case) and then in 6.1 it uses sub. I still don't quite understand why CLANG just doesn't do the stack alignment before the call std::__throw_bad_function_call() rather than doing a push/pop in the path more likely to be executed.Weasel
@MichaelPetch: oh that's funky. Only in this std::function case, though, not in simple cases like return foo() + 1;. And -mtune=intel makes it use LEA to modify rsp. Maybe that comes from tuning for Atom, which is an Intel CPU. (and gcc's tuning for atom uses LEA for everything possible; IDK if that's actually good).Berniecebernier
Peter don't you think you should have let the OP update his own question? Basically there is nothing left of his answer. Might have been better to produce a new answer? It isn't like this OP has gone AWOL. Only been 12 hours since he answered. I would have waited to see if they would have updated it themselves (giving some time) based on comments.Weasel
@MichaelPetch: you mean this answer? Yeah, I guess it's only been 12 hours since Dani's comment pointing out the error. BeniBela: feel free to roll back and write it in your own words. I can post the extra commentary I added about efficiency and ABI-design as a separate answer if you want. I made the edit since the answer was fundamentally correct already, but just got the details wrong. I got impatient :PBerniecebernier
W
13

The reason push rax is there is to align the stack back to a 16-byte boundary to conform to the 64-bit System V ABI in the case where je .LBB0_1 branch is taken. The value placed on the stack isn't relevant. Another way would have been subtracting 8 from RSP with sub rsp, 8. The ABI states the alignment this way:

The end of the input argument area shall be aligned on a 16 (32, if __m256 is passed on stack) byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 (32) when control is transferred to the function entry point. The stack pointer, %rsp, always points to the end of the latest allocated stack frame.

Prior to the call to function f the stack was 16-byte aligned per the calling convention. After control was transferred via a CALL to f the return address was placed on the stack misaligning the stack by 8. push rax is a simple way of subtracting 8 from RSP and realigning it again. If the branch is taken to call std::__throw_bad_function_call()the stack will be properly aligned for that call to work.

In the case where the comparison falls through, the stack will appear just as it did at function entry once the add rsp, 8 instruction is executed. The return address of the CALLER to function f will now be back at the top of the stack and the stack will be misaligned by 8 again. This is what we want because a TAIL CALL is being made with jmp qword ptr [rdi + 24] to transfer control to the function a. This will JMP to the function not CALL it. When function a does a RET it will return directly back to the function that called f.

At a higher optimization level I would have expected that the compiler should be smart enough to do the comparison, and let it fall through directly to the JMP. What is at label .LBB0_1 could then align the stack to a 16-byte boundary so that call std::__throw_bad_function_call() works properly.


As @CodyGray pointed out, if you use GCC (not CLANG) with optimization level of -O2 or higher, the code produced does seem more reasonable. GCC 6.1 output from Godbolt is:

f(std::function<void ()>):
        cmp     QWORD PTR [rdi+16], 0     # MEM[(bool (*<T5fc5>) (union _Any_data &, const union _Any_data &, _Manager_operation) *)a_2(D) + 16B],
        je      .L7 #,
        jmp     [QWORD PTR [rdi+24]]      # MEM[(const struct function *)a_2(D)]._M_invoker
.L7:
        sub     rsp, 8    #,
        call    std::__throw_bad_function_call()        #

This code is more in line with what I would have expected. In this case it would appear that GCC's optimizer may handle this code generation better than CLANG.

Weasel answered 12/6, 2016 at 13:21 Comment(3)
Indeed, what you describe in the last paragraph is exactly what GCC does at either -O2 or -O3. Clang and ICC both align the stack at the top of the function. This is one of those rare cases where GCC's optimizer seems to be more effective than Clang's.Maressa
@CodyGray now that I have had coffee, I did toss it on Godbolt and you are correct GCC looks to generate better code in this situation. I amended my answer to reflect that finding. That also confirmed my comment about how I would have expected it to be optimized.Weasel
@daniel: your other edit to an answer to this question has been declined as you are changing the answer drastically with the edit. If you have your own answer to provide, then please feel free to do so, but you should avoid such drastic changes unless the answer is a community wiki type answer.Shirleenshirlene
B
7

In other cases, clang typically fixes up the stack before returning with a pop rcx.

Using push has an upside for efficiency in code-size (push is only 1 byte vs. 4 bytes for sub rsp, 8), and also in uops on Intel CPUs. (No need for a stack-sync uop, which you'd get if you access rsp directly because the call that brought us to the top of the current function makes the stack engine "dirty").

This long and rambling answer discusses the worst-case performance risks of using push rax / pop rcx for aligning the stack, and whether or not rax and rcx are good choices of register. (Sorry for making this so long.)

(TL:DR: looks good, the possible downside is usually small and the upside in the common case makes this worth it. Partial-register stalls could be a problem on Core2/Nehalem if al or ax are "dirty", though. No other 64-bit capable CPU has big problems (because they don't rename partial regs, or merge efficiently), and 32-bit code needs more than 1 extra push to align the stack by 16 for another call unless it was already saving/restoring some call-preserved regs for its own use.)


Using push rax instead of sub rsp, 8 introduces a dependency on the old value of rax, so you'd think it might slow things down if the value of rax is the result of a long-latency dependency chain (and/or a cache miss).

e.g. the caller might have done something slow with rax that's unrelated to the function args, like var = table[ x % y ]; var2 = foo(x);

# example caller that leaves RAX not-ready for a long time

mov   rdi, rax              ; prepare function arg

div   rbx                   ; very high latency
mov   rax, [table + rdx]    ; rax = table[ value % something ], may miss in cache
mov   [rsp + 24], rax       ; spill the result.

call  foo                   ; foo uses push rax to align the stack

Fortunately out-of-order execution will do a good job here.

The push doesn't make the value of rsp dependent on rax. (It's either handled by the stack engine, or on very old CPUs push decodes to multiple uops, one of which updates rsp independently of the uops that store rax. Micro-fusion of the store-address and store-data uops let push be a single fused-domain uop, even though stores always take 2 unfused-domain uops.)

As long as nothing depends on the output push rax / pop rcx, it's not a problem for out-of-order execution. If push rax has to wait because rax isn't ready, it won't cause the ROB (ReOrder Buffer) to fill up and eventually block the execution of later independent instruction. The ROB would fill up even without the push because the instruction that's slow to produce rax, and whatever instruction in the caller consumes rax before the call are even older, and can't retire either until rax is ready. Retirement has to happen in-order in case of exceptions / interrupts.

(I don't think a cache-miss load can retire before the load completes, leaving just a load-buffer entry. But even if it could, it wouldn't make sense to produce a result in a call-clobbered register without reading it with another instruction before making a call. The caller's instruction that consumes rax definitely can't execute/retire until our push can do the same.)

When rax does become ready, push can execute and retire in a couple cycles, allowing later instructions (which were already executed out of order) to also retire. The store-address uop will have already executed, and I assume the store-data uop can complete in a cycle or two after being dispatched to the store port. Stores can retire as soon as the data is written to the store buffer. Commit to L1D happens after retirement, when the store is known to be non-speculative.

So even in the worst case, where the instruction that produces rax was so slow that it led to the ROB filling up with independent instructions that are mostly already executed and ready to retire, having to execute push rax only causes a couple extra cycles of delay before independent instructions after it can retire. (And some of the caller's instructions will retire first, making a bit of room in the ROB even before our push retires.)


A push rax that has to wait will tie up some other microarchitectural resources, leaving one fewer entry for finding parallelism between other later instructions. (An add rsp,8 that could execute would only be consuming a ROB entry, and not much else.)

It will use up one entry in the out-of-order scheduler (aka Reservation Station / RS). The store-address uop can execute as soon as there's a free cycle, so only the store-data uop will be left. The pop rcx uop's load address is ready, so it should dispatch to a load port and execute. (When the pop load executes, it finds that its address matches the incomplete push store in the store buffer (aka memory order buffer), so it sets up the store-forwarding which will happen after the store-data uop executes. This probably consumes a load buffer entry.)

Even an old CPUs like Nehalem has a 36 entry RS, vs. 54 in Sandybridge, or 97 in Skylake. Keeping 1 entry occupied for longer than usual in rare cases is nothing to worry about. The alternative of executing two uops (stack-sync + sub) is worse.

(off topic)
The ROB is larger than the RS, 128 (Nehalem), 168 (Sandybridge), 224 (Skylake). (It holds fused-domain uops from issue to retirement, vs. the RS holding unfused-domain uops from issue to execution). At 4 uops per clock max frontend throughput, that's over 50 cycles of delay-hiding on Skylake. (Older uarches are less likely to sustain 4 uops per clock for as long...)

ROB size determines the out-of-order window for hiding a slow independent operation. (Unless register-file size limits are a smaller limit). RS size determines the out-of-order window for finding parallelism between two separate dependency chains. (e.g. consider a 200 uop loop body where every iteration is independent, but within each iteration it's one long dependency chain without much instruction-level parallelism (e.g. a[i] = complex_function(b[i])). Skylake's ROB can hold more than 1 iteration, but we can't get uops from the next iteration into the RS until we're within 97 uops of the end of the current one. If the dep chain wasn't so much larger than RS size, uops from 2 iterations could be in flight most of the time.)


There are cases where push rax / pop rcx can be more dangerous:

The caller of this function knows that rcx is call-clobbered, so won't read the value. But it might have a false dependency on rcx after we return, like bsf rcx, rax / jnz or test eax,eax / setz cl. Recent Intel CPUs don't rename low8 partial registers anymore, so setcc cl has a false dep on rcx. bsf actually leaves its destination unmodified if the source is 0, even though Intel documents it as an undefined value. AMD documents leave-unmodified behaviour.

The false dependency could create a loop-carried dep chain. On the other hand, a false dependency can do that anyway, if our function wrote rcx with instructions dependent on its inputs.

It would be worse to use push rbx/pop rbx to save/restore a call-preserved register that we weren't going to use. The caller likely would read it after we return, and we'd have introduced a store-forwarding latency into the caller's dependency chain for that register. (Also, it's maybe more likely that rbx would be written right before the call, since anything the caller wanted to keep across the call would be moved to call-preserved registers like rbx and rbp.)


On CPUs with partial-register stalls (Intel pre-Sandybridge), reading rax with push could cause a stall or 2-3 cycles on Core2 / Nehalem if the caller had done something like setcc al before the call. Sandybridge doesn't stall while inserting a merging uop, and Haswell and later don't rename low8 registers separately from rax at all.

It would be nice to push a register that was less likely to have had its low8 used. If compilers tried to avoid REX prefixes for code-size reasons, they'd avoid dil and sil, so rdi and rsi would be less likely to have partial-register issues. But unfortunately gcc and clang don't seem to favour using dl or cl as 8-bit scratch registers, using dil or sil even in tiny functions where nothing else is using rdx or rcx. (Although lack of low8 renaming in some CPUs means that setcc cl has a false dependency on the old rcx, so setcc dil is safer if the flag-setting was dependent on the function arg in rdi.)

pop rcx at the end "cleans" rcx of any partial-register stuff. Since cl is used for shift counts, and functions do sometimes write just cl even when they could have written ecx instead. (IIRC I've seen clang do this. gcc more strongly favours 32-bit and 64-bit operand sizes to avoid partial-register issues.)


push rdi would probably be a good choice in a lot of cases, since the rest of the function also reads rdi, so introducing another instruction dependent on it wouldn't hurt. It does stop out-of-order execution from getting the push out of the way if rax is ready before rdi, though.


Another potential downside is using cycles on the load/store ports. But they are unlikely to be saturated, and the alternative is uops for the ALU ports. With the extra stack-sync uop on Intel CPUs that you'd get from sub rsp, 8, that would be 2 ALU uops at the top of the function.

Berniecebernier answered 22/8, 2017 at 17:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.