Why does GCC allocate more stack memory than needed?
Asked Answered
E

1

8

I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:

long call_proc() {
    long  x1 = 1;
    int   x2 = 2;
    short x3 = 3;
    char  x4 = 4;
    proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
    return (x1+x2)*(x3-x4);
}

The book gives the assembly code generated by GCC:

long call_proc()
call_proc:
    ; Set up arguments to proc
    subq    $32, %rsp           ; Allocate 32-byte stack frame
    movq    $1, 24(%rsp)        ; Store 1 in &x1
    movl    $2, 20(%rsp)        ; Store 2 in &x2
    movw    $3, 18(%rsp)        ; Store 3 in &x3
    movb    $4, 17(%rsp)        ; Store 4 in &x4
    leaq    17(%rsp), %rax      ; Create &x4
    movq    %rax, 8(%rsp)       ; Store &x4 as argument 8
    movl    $4, (%rsp)          ; Store 4 as argument 7
    leaq    18(%rsp), %r9       ; Pass &x3 as argument 6
    movl    $3, %r8d            ; Pass 3 as argument 5
    leaq    20(%rsp), %rcx      ; Pass &x2 as argument 4
    movl    $2, %edx            ; Pass 2 as argument 3
    leaq    24(%rsp), %rsi      ; Pass &x1 as argument 2
    movl    $1, %edi            ; Pass 1 as argument 1
    ; Call proc
    call    proc
    ; Retrieve changes to memory
    movslq  20(%rsp), %rdx      ; Get x2 and convert to long
    addq    24(%rsp), %rdx      ; Compute x1+x2
    movswl  18(%rsp), %eax      ; Get x3 and convert to int
    movsbl  17(%rsp), %ecx      ; Get x4 and convert to int
    subl    %ecx, %eax          ; Compute x3-x4
    cltq                        ; Convert to long
    imulq   %rdx, %rax          ; Compute (x1+x2) * (x3-x4)
    addq    $32, %rsp           ; Deallocate stack frame
    ret                         ; Return

I can understand this code: the compiler allocates 32 bytes of space on the stack, of which the first 16 bytes hold the arguments passed to proc and the last 16 bytes hold 4 local variables.

Then I tested this code on GCC 11.2, using the optimization flag -Og, and got this assembly code:

call_proc():
        subq    $24, %rsp
        movq    $1, 8(%rsp)
        movl    $2, 4(%rsp)
        movw    $3, 2(%rsp)
        movb    $4, 1(%rsp)
        leaq    1(%rsp), %rax
        pushq   %rax
        pushq   $4
        leaq    18(%rsp), %r9
        movl    $3, %r8d
        leaq    20(%rsp), %rcx
        movl    $2, %edx
        leaq    24(%rsp), %rsi
        movl    $1, %edi
        call    proc(long, long*, int, int*, short, short*, char, char*)
        movslq  20(%rsp), %rax
        addq    24(%rsp), %rax
        movswl  18(%rsp), %edx
        movsbl  17(%rsp), %ecx
        subl    %ecx, %edx
        movslq  %edx, %rdx
        imulq   %rdx, %rax
        addq    $40, %rsp
        ret

I noticed that gcc first allocated 24 bytes for 4 local variables. Then it uses pushq to add 2 arguments to the stack, so the final code uses addq $40, %rsp to free stack space.

Compared to the code in the book, GCC allocates 8 more bytes of space here, and it doesn't seem to use the extra space. Why does it need the extra space?

Ellita answered 2/2, 2022 at 9:33 Comment(7)
Looks like the GCC version used with Compiler Explorer is pretty consistent with +40 in all versions, but my Ubuntu 20.04 GCC 9.3.0 does pushq %rbx; ...; addq $32, %rsp; pop %rbx instead...Truculent
@klutt Actually I'd like to explore why the compiler allocates an extra 8 bytes of space.Ellita
Are you sure that the assembly output described in the book is accurate?Miki
The code in your book violates the ABI, not maintaining 16-byte stack alignment before the call. Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment? If it's from CS:APP global edition, it's because some incompetent people the publisher hired wrote bad practice problems, including falsely claiming their crap was actual compiler output when it wouldn't even assemble. The other portions of the book are still good, though.Conceivable
@PeterCordes I just read the corresponding code in the North American version of the book according to the link you gave, and it is exactly the same as the code in my book. I also looked up the errata and there is no information for this code. But the code in the book does violate the 16-byte alignment requirement. Thanks for your answer.Ellita
Strange, might want to report this to the authors so they can add it to the errata for the North American edition if you're sure it's the same there. I looked again more carefully at the book code, and yeah it's not making any other changes to RSP anywhere that would account for it an even multiple of 8. (Avoiding push was the default in older GCC. In modern GCC, you can enable it manually with -maccumulate-outgoing-args: godbolt.org/z/zbqs1MchE shows GCC just using one sub $40, %rsp.)Conceivable
Reopened because there is confusion about the original code that the dups don't resolve.Balikpapan
C
5

(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)

GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?

What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.

So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.

Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.

One difference from the book's asm is a movl $0, %eax before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc returns: it ends up using movslq %edx, %rdx instead of cltq (sign-extend with RAX).


CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.


Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.

Conceivable answered 2/2, 2022 at 9:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.