Why is GCC pushing an extra return address on the stack?

Asked 5/8, 2016 at 4:27 Answered 2/6, 2019 at 4:42

Solved assembly gcc x86 memory-alignment stack-memory

I am currently learning the basics of assembly and came across something odd when looking at the instructions generated by GCC(6.1.1).

Here is the source:

#include <stdio.h>

int foo(int x, int y){
    return x*y;
}

int main(){
    int a = 5;
    int b = foo(a, 0xF00D);
    printf("0x%X\n", b);
    return 0;
}

Command used to compile: gcc -m32 -g test.c -o test

When examining the functions in GDB I get this:

(gdb) set disassembly-flavor intel
(gdb) disas main
Dump of assembler code for function main:
   0x080483f7 <+0>:     lea    ecx,[esp+0x4]
   0x080483fb <+4>:     and    esp,0xfffffff0
   0x080483fe <+7>:     push   DWORD PTR [ecx-0x4]
   0x08048401 <+10>:    push   ebp
   0x08048402 <+11>:    mov    ebp,esp
   0x08048404 <+13>:    push   ecx
   0x08048405 <+14>:    sub    esp,0x14
   0x08048408 <+17>:    mov    DWORD PTR [ebp-0xc],0x5
   0x0804840f <+24>:    push   0xf00d
   0x08048414 <+29>:    push   DWORD PTR [ebp-0xc]
   0x08048417 <+32>:    call   0x80483eb <foo>
   0x0804841c <+37>:    add    esp,0x8
   0x0804841f <+40>:    mov    DWORD PTR [ebp-0x10],eax
   0x08048422 <+43>:    sub    esp,0x8
   0x08048425 <+46>:    push   DWORD PTR [ebp-0x10]
   0x08048428 <+49>:    push   0x80484d0
   0x0804842d <+54>:    call   0x80482c0 <printf@plt>
   0x08048432 <+59>:    add    esp,0x10
   0x08048435 <+62>:    mov    eax,0x0
   0x0804843a <+67>:    mov    ecx,DWORD PTR [ebp-0x4]
   0x0804843d <+70>:    leave  
   0x0804843e <+71>:    lea    esp,[ecx-0x4]
   0x08048441 <+74>:    ret    
End of assembler dump.
(gdb) disas foo
Dump of assembler code for function foo:
   0x080483eb <+0>:     push   ebp
   0x080483ec <+1>:     mov    ebp,esp
   0x080483ee <+3>:     mov    eax,DWORD PTR [ebp+0x8]
   0x080483f1 <+6>:     imul   eax,DWORD PTR [ebp+0xc]
   0x080483f5 <+10>:    pop    ebp
   0x080483f6 <+11>:    ret    
End of assembler dump.

The part that confuses me is what it is trying to do with the stack. From my understanding this is what it does:

It takes a reference to some memory address 4 bytes higher in the stack which from my knowledge should be the variables passed to main since esp currently pointed to the return address in memory.
It aligns the stack to a 0 boundary for performance reasons.
It pushes onto the new stack area ecx+4 which should translate to pushing the address we are suppose to be returning to on the stack.
It pushes the old frame pointer onto the stack and sets up the new one.
It pushes ecx (which is still pointing to would should be an argument to main) onto the stack.

Then the program does what it should and begins the process of returning:

It restores ecx by using a -0x4 offset on ebp which should access the first local variable.
It executes the leave instruction which really just sets esp to ebp and then pops ebp from the stack.

So now the next thing on the stack is the return address and the esp and ebp registers should be back to what they need to be to return right?

Well evidently not because the next thing it does is load esp with ecx-0x4 which since ecx is still pointing to that variable passed to main should put it at the address of return address on the stack.

This works just fine but raises the question: why did it bother to put the return address onto the stack in step 3 since it returned the stack to the original position at the end just before actually returning from the function?

Sustainer answered 5/8, 2016 at 4:27 Comment(6)

You should enable optimizations and compile with gcc -m32 -O -Wall -S -fverbose-asm test.c then look inside the generated test.s – Jehanna 5/8, 2016 at 4:36

Here is what it generated (pastebin.com/raw/1ZdhPLf6). From what I can tell it still has the extra return address. – Sustainer 5/8, 2016 at 4:44

Read also more about x86 calling conventions and ABI. They may dictate the way a call is done. – Jehanna 5/8, 2016 at 4:50

It's probably just so that debuggers can trace the stack back past main. – Illicit 5/8, 2016 at 5:16

@RossRidge: clang doesn't do it (at least in this case). Of course, clang tends to be less conservative / more trusting of the ABI. Does my theory about copying the return address to just above the saved ebp like a normal stack frame make sense? – Jett 5/8, 2016 at 7:16

@PeterCordes You can't reliably unwind the stack by following the chain of saved EBP values, since it's not part of the ABI, so it would only be useful stack traces. Accordingly I don't think this is being done for ABI reasons, just for debugging. – Illicit 5/8, 2016 at 7:35

Update: gcc8 simplifies this at least for normal use-cases (-fomit-frame-pointer, and no alloca or C99 VLAs that require variable-size allocation). Perhaps motivated by increasing usage of AVX leading to more functions wanting a 32-byte aligned local or array.

Except for main in 32-bit code, then it still does the full return address+frame-pointer backtrace-friendly version even with -O3 -fomit-frame-pointer. https://gcc.godbolt.org/z/6cehMP774

Also, probably a duplicate of What's up with gcc weird stack manipulation when it wants extra stack alignment?

This complicated prologue is fine if it only ever runs a couple times (e.g. at the start of main in 32-bit code), but the more it appears the more worthwhile it is to optimize it. GCC sometimes still over-aligns the stack in functions where all >16-byte aligned objects are optimized into registers, which is a missed optimization already but less bad when the stack alignment is cheaper.

gcc makes some clunky code when aligning the stack within a function, even with optimization enabled. I have a possible theory (see below) on why gcc might be copying the return address to just above where it saves ebp to make a stack frame (and yes, I agree that's what gcc is doing). It doesn't look necessary in this function, and clang doesn't do anything like that.

Besides that, the nonsense with ecx is probably just gcc not optimizing away unneeded parts of its align-the-stack boilerplate. (The pre-alignment value of esp is needed to reference args on the stack, so it makes sense that it puts the address of the first would-be arg into a register).

You see the same thing with optimization in 32-bit code (where gcc makes a main that doesn't assume 16B stack alignment, even though the current version of the ABI requires that at process startup, and the CRT code that calls main either aligns the stack itself or preserves the initial alignment provided by the kernel, I forget). You also see this in functions that align the stack to more than 16B (e.g. functions that use __m256 types, sometimes even if they never spill them to the stack. Or functions with an array declared with C++11 alignas(32), or any other way of requesting alignment.) In 64-bit code, gcc always seems to use r10 for this, not rcx.

There's nothing required for ABI compliance about the way gcc does it, because clang does something much simpler.

I added an aligned variable (with volatile as a simple way to force the compiler to actually reserve aligned space for it on the stack, instead of optimizing it away). I put your code on the Godbolt compiler explorer, to look at the asm with -O3. I see the same behaviour from gcc 4.9, 5.3, and 6.1, but different behaviour with clang.

int main(){
    __attribute__((aligned(32))) volatile int v = 1;
    return 0;
}

Clang3.8's -O3 -m32 output is functionally identical to its -m64 output. Note that -O3 enables -fomit-frame-pointer, but some functions make stack frames anyway.

    push    ebp
    mov     ebp, esp                # make a stack frame *before* aligning, so ebp-relative addressing can only access stack args, not aligned locals.
    and     esp, -32
    sub     esp, 32                 # esp is 32B aligned with 32 or 48B above esp reserved (depending on incoming alignment)
    mov     dword ptr [esp], 1      # store v
    xor     eax, eax                # return 0
    mov     esp, ebp                # leave
    pop     ebp
    ret

gcc's output is nearly the same between -m32 and -m64, but it puts v in the red-zone with -m64 so the -m32 output has two extra instructions:

    # gcc 6.1 -m32 -O3 -fverbose-asm.  Most of gcc's comment lines are empty.  I guess that means it has no idea why it's emitting those insns :P
    lea     ecx, [esp+4]      #,   get a pointer to where the first arg would be
    and     esp, -32  #,          align
    xor     eax, eax  #           return 0
    push    DWORD PTR [ecx-4]       #  No clue WTF this is for; this looks batshit insane, but happens even in 64bit mode.
    push    ebp     #             make a stackframe, even though -fomit-frame-pointer is on by default and we can already restore the original esp from ecx (unlike clang)
    mov     ebp, esp  #,
    push    ecx     #             save the old esp value (even though this function doesn't clobber ecx...)
    sub     esp, 52   #,          reserve space for v  (not present with -m64)
    mov     DWORD PTR [ebp-56], 1     # v,
    add     esp, 52   #,          unreserve (not present with -m64)
    pop     ecx       #           restore ecx (even though nothing clobbered it)
    pop     ebp       #           at least it knows it can just pop instead of `leave`
    lea     esp, [ecx-4]      #,  restore pre-alignment esp
    ret

It seems that gcc wants to make its stack frame (with push ebp) after aligning the stack. I guess that makes sense, so it can reference locals relative to ebp. Otherwise it would have to use esp-relative addressing, if it wanted aligned locals.

My theory on why gcc does this:

The extra copy of the return address after aligning but before pushing ebp means that the return address is copied to the expected place relative to the saved ebp value (and the value that will be in ebp when child functions are called). So this does potentially help code that wants to unwind the stack by following the linked list of stack frames, and looking at return-addresses to find out what function is involved.

I'm not sure whether this matters with modern stack-unwind info that allows stack-unwinding (backtraces / exception handling) with -fomit-frame-pointer. (It's metadata in the .eh_frame section. This is what the .cfi_* directives around every modification to esp are for.) I should look at what clang does when it has to align the stack in a non-leaf function.

The original value of esp would be needed inside the function to reference function args on the stack. I think gcc doesn't know how to optimize away unneeded parts of its align-the-stack method. (e.g. out main doesn't look at its args (and is declared not to take any))

This kind of code-gen is typical of what you see in a function that needs to align the stack; it's not extra weird because of using a volatile with automatic storage.

Jett answered 5/8, 2016 at 6:58 Comment(6)

The only advantage of aligning the stack the way GCC does it now that I can see is that it would allow the elimination of the frame pointer. With the normal stack alignment code, it's treated as variable length stack allocation forcing the use of frame pointer. With GCC's new code (4.8 didn't do this) the alignment is essentially done outside the function's stack frame. Since GCC isn't actually omitting the frame pointer I don't see what the point of this change is supposed to be. – Illicit 5/8, 2016 at 7:55

Thanks for the detailed answer! – Sustainer 5/8, 2016 at 13:30

-mpreferred-stack-boundary will help in eliminating the lea esp,[ecx-0x4] part. – Revenge 7/6, 2018 at 8:24

@sudhackar: That's not safe. It would make gcc not maintain the 16-byte alignment required by the i386 System V ABI (changed a few years ago). Now 16 bytes isn't just a good idea, it's the law, and functions are allowed to segfault if called with an under-aligned stack (e.g. with movaps to the stack without an and esp, -16 first). Since gcc only does this in main, and when over-alignment is required (e.g. for AVX2/AVX512), it's only harmful in cases where you actually need alignment + a couple extra instructions total for your whole program. – Jett 7/6, 2018 at 8:55

@PeterCordes yes, but by the question I felt that he's trying to learn how C translates to asm. Such artifacts only confuse people doing this the first time. – Revenge 7/6, 2018 at 9:36

@sudhackar: ah I see your point. Easier solution: don't call your function main so it doesn't get any of the extra "special sauce" compilers add. (e.g. ICC will add code to set up its default -ffast-math). See How to remove "noise" from GCC/clang assembly output? for more stuff about how C compiles simple functions. And BTW, I did mention in this answer that gcc -m64 doesn't do extra stack alignment in main, so you could just look at normal 64-bit code. (Except its calling convention is more complex, and the red-zone can be confusing.) – Jett 7/6, 2018 at 9:45

GCC copies the return address in order to create a normal looking stack frame that debuggers can walk through following chained saved frame pointer (EBP) values. Though part of the reason why GCC generates code like this is to handle the worst case of the function also having a variable length stack allocation, like can happen when a variable length array or alloca() is used.

Normally when code is compiled without optimization (or with the -fno-omit-frame-pointer option) the compiler creates a stack frame that includes a link back to the previous stack frame using the saved frame pointer value of the caller. Normally the compiler saves the previous frame pointer value as the first thing on the stack after the return address and then sets the frame pointer to point to this location on the stack. When all the functions in a program do this then the frame pointer register becomes a pointer to a linked list of stack frames, one that can be traced back all the way to the program's startup code. The return addresses in each frame show which function each frame belongs to.

However instead of saving the previous frame pointer, the first thing GCC does in a function that needs to align the stack is to preform that alignment, putting an unknown number padding bytes after the return address. So in order to create what looks like a normal stack frame, it copies the return address after those padding bytes and then saves the previous frame pointer. The problem with is that it's not really necessary copy the return address like this, as demonstrated by Clang and shown in Peter Cordes' answer. Like Clang, GCC could instead have immediately saved the previous frame pointer value (EBP) and then aligned the stack.

Essentially what both compilers do is create a split stack frame, one split in two by the the alignment padding created to align the stack. The top part, above the padding, is where the locale variables are stored. The bottom part, below the padding, is where the incoming arguments can be found. Clang uses ESP to access the top part, and EBP to access the bottom part. GCC uses EBP to access the bottom part, and uses the saved ECX value from the prologue on the stack to access the top part. In both cases EBP points to what looks like a normal stack frame, though only GCC's EBP can be used to access the function's local variable like with a normal frame.

So in the normal case Clang's strategy is clearly better, there's no need to copy the return address, and there's no need save an extra value (the ECX value) on stack. However in the case where the compiler needs to both align the stack and allocate something with variable size, an extra value does need to be stored somewhere. Since the variable allocation means that the stack pointer no longer has a fixed offset to the local variables, it can't be used access them anymore. There needs to be two separate values stored somewhere, one that points at the top part of the split frame and one that points at the bottom part.

If you look the code Clang generates when compiling a function that both requires aligning the stack and has a variable length allocation you'll see that it allocates a register that effectively becomes a second frame pointer, one that points to top part of the split frame. GCC doesn't need to this because its already using the EBP to point to the top part. Clang continues to use the EBP to point to the bottom part, while GCC uses the saved ECX value.

Clang isn't perfect here though, since it also allocates another register to restore the stack to the value it had before the variable length allocation when it goes out of scope. In many cases though this isn't necessary and the register used as the second frame pointer could be used instead to restore the stack.

GCC's strategy seems to be based on the desire to have a single set of boiler plate prologue and epilogue code sequences that that can be used for all functions that need stack alignment. It also avoids allocating any registers for the lifetime of the function, although the saved ECX value can be used directly from ECX if it hasn't been clobbered yet. I suspect that generating more flexible code like Clang does would difficult given how GCC generates function prologue and epilogue code.

(However, when generating 64-bit x86 code, GCC 8 and later do use a simpler prologue for functions that need to over-align the stack, if they don't need any variable length stack allocations. It's more like Clang's strategy.)

Illicit answered 2/6, 2019 at 4:42 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

My theory on why gcc does this:

Recommended topics

Hot tags