Compilers often choose mov
to store args instead of push
, if there's enough space already allocated (e.g. with a sub esp, 0x10
earlier in the function like you suggested).
Here's an example:
int f1(int);
int f2(int,int);
int foo(int a) {
f1(2);
f2(3,4);
return f1(a);
}
compiled by clang6.0 -O3 -march=haswell
on Godbolt
sub esp, 12 # reserve space to realign stack by 16
mov dword ptr [esp], 2 # store arg
call f1(int)
# reuse the same arg-passing space for the next function
mov dword ptr [esp + 4], 4
mov dword ptr [esp], 3
call f2(int, int)
add esp, 12
# now ESP is pointing to our own arg
jmp f1(int) # TAILCALL
clang's code-gen would have been even better with sub esp,8
/ push 2
, but then the rest of the function unchanged. i.e. let push
grow the stack because it has smaller code-size that mov
, especially mov
-immediate, and performance is not worse (because we're about to call
which also uses the stack engine). See What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? for more details.
I also included in the Godbolt link GCC output with/without -maccumulate-outgoing-args
that defers clearing the stack until the end of the function..
By default (without accumulate outgoing args) gcc does let ESP bounce around, and even uses 2x pop
to clear 2 args from the stack. (Avoiding a stack-sync uop, at the cost of 2 useless loads that hit in L1d cache). With 3 or more args to clear, gcc uses add esp, 4*N
. I suspect that reusing the arg-passing space with mov
stores instead of add esp / push would be a win sometimes for overall performance, especially with registers instead of immediates. (push imm8
is much more compact than mov imm32
.)
foo(int): # gcc7.3 -O3 -m32 output
push ebx
sub esp, 20
mov ebx, DWORD PTR [esp+28] # load the arg even though we never need it in a register
push 2 # first function arg
call f1(int)
pop eax
pop edx # clear the stack
push 4
push 3 # and write the next two args
call f2(int, int)
mov DWORD PTR [esp+32], ebx # store `a` back where we it already was
add esp, 24
pop ebx
jmp f1(int) # and tailcall
With -maccumulate-outgoing-args
, the output is basically like clang, but gcc still save/restores ebx
and keeps a
in it, before doing a tailcall.
Note that having ESP bounce around requires extra metadata in .eh_frame
for stack unwinding. Jan Hubicka writes in 2014:
There are still pros and cons of arg accumulation. I did quite extensive
testing on AMD chips and found it performance neutral. On 32bit code it saves
about 4% of code but with frame pointer disabled it expands unwind info quite a
lot, so resulting binary is about 8% bigger. (This is also current default for -Os
)
So a 4% code-size saving (in bytes; matters for L1i cache footprint) from using push for args and at least typically clearing them off the stack after each call
. I think there's a happy medium here that gcc could use more push
without using just push
/pop
.
There's a confounding effect of maintaining 16-byte stack alignment before call
, which is required by the current version of the i386 System V ABI. In 32-bit mode, it used to just be a gcc default to maintain -mpreferred-stack-boundary=4
. (i.e. 1<<4). I think you can still use
-mpreferred-stack-boundary=2
to violate the ABI and make code that only cares about 4B alignment for ESP.
I didn't try this on Godbolt, but you could.
ret 0x10
, theret
instruction will adjust theesp
register. So check the subroutine machine code, which kind ofret
it does use. EDIT: Or if this is book without code of subroutine, then pay attention to the calling convention definition, it may be defined as to use theret imm16
to adjust the stack in the subroutine. I can't recall which platform exactly does use this one (some windows?), but I really don't like it, luckily on linux the calling convention is different, so I don't care. – Trovillioncdecl
means caller clean up. Callee clean up is calledstdcall
orpascal
(depending on order). – Heckelphonecdecl
, the callee ends with a normalret
. That is the 32-bit Linux calling convention, probably compiled by gcc with-maccumulate-outgoing-args
that defers clearing the stack until the end of the function., and avoids usingpush
even for the initial growth. This used to be a good thing, but is now just a waste of code-size and instructions. – Demetri