What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?

Asked 26/3, 2018 at 6:42 Answered 23/7, 2019 at 22:2

Solved c++assembly x86 compiler-optimization micro-optimization

I belive push/pop instructions will result in a more compact code, maybe will even run slightly faster. This requires disabling stack frames as well though.

To check this, I will need to either rewrite a large enough program in assembly by hand (to compare them), or to install and study a few other compilers (to see if they have an option for this, and to compare the results).

Here is the forum topic about this and simular problems.

In short, I want to understand which code is better. Code like this:

sub esp, c
mov [esp+8],eax
mov [esp+4],ecx
mov [esp],edx
...
add esp, c

or code like this:

push eax
push ecx
push edx
...
add esp, c

What compiler can produce the second kind of code? They usually produce some variation of the first one.

Aged answered 26/3, 2018 at 6:42 Comment(25)

Your belief (that push or pop instructions are better) could be wrong. Did you actually benchmark? BTW, what compiler flags have you used? Did you enable optimizations (which ones, how?)? Also, that could depend upon calling conventions and ABIs – Lizzettelizzie 26/3, 2018 at 6:43

Don't comment your own question, but do edit your question to improve it – Lizzettelizzie 26/3, 2018 at 6:47

@BasileStarynkevitch Compiler flags are standard, Optimize for speed and Disable Stack Frame. I don't know vs2005 compiler flags well enough. Oh, also I changed entry point to void main(), instead of int WinMain(int,int,int,int), trying to cut runtime out. For some reason it saves edi register before using it, I'm not sure if os really needs it after exit. – Aged 26/3, 2018 at 6:48

Show some minimal reproducible example in your question, the exact compilation command, and the actual benchmark results. Without them your question is unclear and too broad (so meaningless). Be sure to have a useful benchmark (at least one second of CPU time), and run it several times. I think that your belief (that push is better) is wrong, or incredibly naive. At last cache misses matter much more. – Lizzettelizzie 26/3, 2018 at 6:54

The added code is not minimal. And it is Windows specific. Consider writing a simpler program (with a loop repeated many times, so that the run time is more than half a second) that uses standard C and is repeatable on other systems and compilers (e.g. my Linux/Debian desktop) – Lizzettelizzie 26/3, 2018 at 7:4

Modern code generators avoid using PUSH. It is inefficient on today's processors because it modifies the stack pointer, that gums-up a super-scalar core. Changing the register just once then using MOV gives much better odds for parallelizing and re-ordering options. – Vansickle 26/3, 2018 at 7:4

@HansPassant, thanks. Does it matter for single threaded programs? Can I get more info on that? – Aged 26/3, 2018 at 7:11

You're using a compiler that doesn't know anything about 99% of the CPU's currently in PC's, and you are wondering about micro-efficiencies? That makes zero sense. Any modern CPU is highly aware of the existence of the stack, any modern compiler is highly aware how modern CPU's actually handle the stack behind the scenes. In particular, instruction reordering will be a pain if ESP isn't stable. – Shipload 26/3, 2018 at 7:12

It does, this parallelization is for one instruction stream and unrelated to threading. A processor has multiple execution units that are capable of executing the same instruction. Haswell for example has 4 units that can do a comparison at the same time. The Intel processor manuals are a good resource, there is one dedicated to optimization (Intel 64 and IA-32 Architectures Optimization Reference Manual). Pro-tip: avoid praising Windows or VS in the [C++] tag, it upsets a lot of Linux programmers. The [visual-c++] tag is an alternative. – Vansickle 26/3, 2018 at 7:20

@HansPassant , hm, you are about data interdependency? There is an issue if I modify esp, and then access [esp+8] at the very next step? But that's not a problem if I fit everything I need in the registers, and use this [esp+something] addressing rarely enough? – Aged 26/3, 2018 at 7:29

@Alefun999Sss - If you believe that your compiler doesn't generate good enough code, one apparent option would be to upgrade to something 10+ years newer. And if you code for a 32-bit "runs everywhere" program, it seems odd to try to micro optimize when you don't know the exact target system. – Degenerate 26/3, 2018 at 11:10

" (works on every potato in the world)," --> most potatoes (billions per year) in 2018 are embedded processors that are neither x86 nor Windows. – Escent 26/3, 2018 at 11:18

I love this kind of talk, but, visit the forum link if you want to continue. – Aged 26/3, 2018 at 12:35

@HansPassant Did you forget about the Stack Engine which resolved the dependency problems coming from push and pop? It has been there since Sandy Bridge! – Jenine 26/3, 2018 at 16:51

@fuz: The stack engine was new in Pentium-M, so even the OP's decade-old compiler may know that push/pop are efficient on newer CPUs. But compilers typically lag behind CPUs in what they tune for. (This is why you ideally want to be using a compiler newer than your hardware, in general.) – Panthea 26/3, 2018 at 17:4

@Alefun999Sss: See this answer about how a single Haswell CPU core finds/exploits instruction-level parallelism in a single thread, while maintaining the illusion of running the instructions in program order (i.e. superscalar out-of-order execution). With a CPU block diagram and links to more details. – Panthea 26/3, 2018 at 17:8

@PeterCordes The hardware return stack was new in Pentium-M. The stack engine was new in Sandy Bridge. Please re-read the post. – Jenine 26/3, 2018 at 17:37

@fuz: The stack-pointer renaming engine is still present in Sandybridge, having been introduced in Pentium-M. On Agner's instruction table, Pentium-M decodes pop reg to a single uop which runs only on the load port. Thus it definitely has a stack engine. Pentium II/III decode it to 2 uops; 1 ALU and 1 load. I was pretty sure it was P-M, and mostly dug up that link for a description of what the stack engine did (for other readers like Hans Passant who don't know about it). I shouldn't have implied that that link showed it was new in P-M. – Panthea 26/3, 2018 at 17:46

@fuz: edited my answer on #36632076 to include that example, and explain it better. So now it does show that the stack-engine was new in P-M :) – Panthea 26/3, 2018 at 18:3

@BasileStarynkevitch: I think there's a real and interesting question here, buried until all the cruft. Any one of the examples from my answer would be much better than the random selection of C functions in the question. It was easy to think of some cases myself where compilers could do a better job. (I haven't tried to benchmark any real cases; it's mostly a minor code-size effect, which is hard to microbench. Hard to test without first teaching a compiler how to do this) – Panthea 27/3, 2018 at 2:15

There is nothing in your code which needs any form of micro optimization. – Manville 27/3, 2018 at 17:5

deleted this code, it only confuses you instead of helping. I only added it because some guy was asking for an example, even though this applies to any code ever. And for benchmark, even though I told him that this requires rewriting a compiler to properly test. – Aged 27/3, 2018 at 19:24

A proper example would be a C function, actual compiler asm output, and your suggested compiler asm output. Not just a bunch of random C functions without any discussion of how they compiled!! And no, it doesn't apply to every function. Some don't need to spill anything to the stack, and only use registers. – Panthea 28/3, 2018 at 3:43

There is still huge room for improvement in the question: how exactly are you suggesting compilers should use PUSH and POP? I answered with the missed of using PUSH for the initial creation of locals, but how / when are you suggesting they use POP? And why would you need to disable stack frames? This actually works better in some cases if you make EBP / RBP a frame pointer. I upvoted only because it doesn't deserve to be -7; it's still a poor question which doesn't explain what it's asking very well. – Panthea 28/3, 2018 at 3:47

You're right, push is a minor missed-optimization with all 4 major x86 compilers. There's some code-size, and thus indirectly performance to be had. Or maybe more directly a small amount of performance in some cases, e.g. saving a sub rsp instruction.

But if you're not careful, you can make things slower with extra stack-sync uops by mixing push with [rsp+x] addressing modes. pop doesn't sound useful, just push. As the forum thread you linked suggests, you only use this for the initial store of locals; later reloads and stores should use normal addressing modes like [rsp+8]. We're not talking about trying to avoid mov loads/stores entirely, and we still want random access to the stack slots where we spilled local variables from registers!

Modern code generators avoid using PUSH. It is inefficient on today's processors because it modifies the stack pointer, that gums-up a super-scalar core. (Hans Passant)

This was true 15 years ago, but compilers are once again using push when optimizing for speed, not just code-size. Compilers already use push/pop for saving/restoring call-preserved registers they want to use, like rbx, and for pushing stack args (mostly in 32-bit mode; in 64-bit mode most args fit in registers). Both of these things could be done with mov, but compilers use push because it's more efficient than sub rsp,8 / mov [rsp], rbx. gcc has tuning options to avoid push/pop for these cases, enabled for -mtune=pentium3 and -mtune=pentium, and similar old CPUs, but not for modern CPUs.

Intel since Pentium-M and AMD since Bulldozer(?) have a "stack engine" that tracks the changes to RSP with zero latency and no ALU uops, for PUSH/POP/CALL/RET. Lots of real code was still using push/pop, so CPU designers added hardware to make it efficient. Now we can use them (carefully!) when tuning for performance. See Agner Fog's microarchitecture guide and instruction tables, and his asm optimization manual. They're excellent. (And other links in the x86 tag wiki.)

It's not perfect; reading RSP directly (when the offset from the value in the out-of-order core is nonzero) does cause a stack-sync uop to be inserted on Intel CPUs. e.g. push rax / mov [rsp-8], rdi is 3 total fused-domain uops: 2 stores and one stack-sync.

On function entry, the "stack engine" is already in a non-zero-offset state (from the call in the parent), so using some push instructions before the first direct reference to RSP costs no extra uops at all. (Unless we were tailcalled from another function with jmp, and that function didn't pop anything right before jmp.)

It's kind of funny that compilers have been using dummy push/pop instructions just to adjust the stack by 8 bytes for a while now, because it's so cheap and compact (if you're doing it once, not 10 times to allocate 80 bytes), but aren't taking advantage of it to store useful data. The stack is almost always hot in cache, and modern CPUs have very excellent store / load bandwidth to L1d.

int extfunc(int *,int *);

void foo() {
    int a=1, b=2;
    extfunc(&a, &b);
}

compiles with clang6.0 -O3 -march=haswell on the Godbolt compiler explorer See that link for all the rest of the code, and many different missed-optimizations and silly code-gen (see my comments in the C source pointing out some of them):

 # compiled for the x86-64 System V calling convention: 
 # integer args in rdi, rsi  (,rdx, rcx, r8, r9)
    push    rax               # clang / ICC ALREADY use push instead of sub rsp,8
    lea     rdi, [rsp + 4]
    mov     dword ptr [rdi], 1      # 6 bytes: opcode + modrm + imm32
    mov     rsi, rsp                # special case for lea rsi, [rsp + 0]
    mov     dword ptr [rsi], 2
    call    extfunc(int*, int*)
    pop     rax                     # and POP instead of add rsp,8
    ret

And very similar code with gcc, ICC, and MSVC, sometimes with the instructions in a different order, or gcc reserving an extra 16B of stack space for no reason. (MSVC reserves more space because it's targeting the Windows x64 calling convention which reserves shadow space instead of having a red-zone).

clang saves code-size by using the LEA results for store addresses instead of repeating RSP-relative addresses (SIB+disp8). ICC and clang put the variables at the bottom of the space it reserved, so one of the addressing modes avoids a disp8. (With 3 variables, reserving 24 bytes instead of 8 was necessary, and clang didn't take advantage then.) gcc and MSVC miss this optimization.

But anyway, more optimal would be:

    push    2                       # only 2 bytes
    lea     rdi, [rsp + 4]
    mov     dword ptr [rdi], 1
    mov     rsi, rsp                # special case for lea rsi, [rsp + 0]
    call    extfunc(int*, int*)
      # ... later accesses would use [rsp] and [rsp+] if needed, not pop
    pop     rax                     # alternative to add rsp,8
    ret

The push is an 8-byte store, and we overlap half of it. This is not a problem, CPUs can store-forward the unmodified low half efficiently even after storing the high half. Overlapping stores in general are not a problem, and in fact glibc's well-commented memcpy implementation uses two (potentially) overlapping loads + stores for small copies (up to the size of 2x xmm registers at least), to load everything then store everything without caring about whether or not there's overlap.

Note that in 64-bit mode, 32-bit push is not available. So we still have to reference rsp directly for the upper half of of the qword. But if our variables were uint64_t, or we didn't care about making them contiguous, we could just use push.

We have to reference RSP explicitly in this case to get pointers to the locals for passing to another function, so there's no getting around the extra stack-sync uop on Intel CPUs. In other cases maybe you just need to spill some function args for use after a call. (Although normally compilers will push rbx and mov rbx,rdi to save an arg in a call-preserved register, instead of spilling/reloading the arg itself, to shorten the critical path.)

I chose 2x 4-byte args so we could reach a 16-byte alignment boundary with 1 push, so we can optimize away the sub rsp, ## (or dummy push) entirely.

I could have used mov rax, 0x0000000200000001 / push rax, but 10-byte mov r64, imm64 takes 2 entries in the uop cache, and a lot of code-size.
gcc7 does know how to merge two adjacent stores, but chooses not to do that for mov in this case. If both constants had needed 32-bit immediates, it would have made sense. But if the values weren't actually constant at all, and came from registers, this wouldn't work while push / mov [rsp+4] would. (It wouldn't be worth merging values in a register with SHL + SHLD or whatever other instructions to turn 2 stores into 1.)

If you need to reserve space for more than one 8-byte chunk, and don't have anything useful to store there yet, definitely use sub instead of multiple dummy PUSHes after the last useful PUSH. But if you have useful stuff to store, push imm8 or push imm32, or push reg are good.

We can see more evidence of compilers using "canned" sequences with ICC output: it uses lea rdi, [rsp] in the arg setup for the call. It seems they didn't think to look for the special case of the address of a local being pointed to directly by a register, with no offset, allowing mov instead of lea. (mov is definitely not worse, and better on some CPUs.)

An interesting example of not making locals contiguous is a version of the above with 3 args, int a=1, b=2, c=3;. To maintain 16B alignment, we now need to offset 8 + 16*1 = 24 bytes, so we could do

bar3:
    push   3
    push   2               # don't interleave mov in here; extra stack-sync uops
    push   1
    mov    rdi, rsp
    lea    rsi, [rsp+8]
    lea    rdx, [rdi+16]         # relative to RDI to save a byte with probably no extra latency even if MOV isn't zero latency, at least not on the critical path
    call   extfunc3(int*,int*,int*)
    add    rsp, 24
    ret

This is significantly smaller code-size than compiler-generated code, because mov [rsp+16], 2 has to use the mov r/m32, imm32 encoding, using a 4-byte immediate because there's no sign_extended_imm8 form of mov.

push imm8 is extremely compact, 2 bytes. mov dword ptr [rsp+8], 1 is 8 bytes: opcode + modrm + SIB + disp8 + imm32. (RSP as a base register always needs a SIB byte; the ModRM encoding with base=RSP is the escape code for a SIB byte existing. Using RBP as a frame pointer allows more compact addressing of locals (by 1 byte per insn), but takes an 3 extra instructions to set up / tear down, and ties up a register. But it avoids further access to RSP, avoiding stack-sync uops. It could actually be a win sometimes.)

One downside to leaving gaps between your locals is that it may defeat load or store merging opportunities later. If you (the compiler) need to copy 2 locals somewhere, you may be able to do it with a single qword load/store if they're adjacent. Compilers don't consider all the future tradeoffs for the function when deciding how to arrange locals on the stack, as far as I know. We want compilers to run quickly, and that means not always back-tracking to consider every possibility for rearranging locals, or various other things. If looking for an optimization would take quadratic time, or multiply the time taken for other steps by a significant constant, it had better be an important optimization. (IDK how hard it might be to implement a search for opportunities to use push, especially if you keep it simple and don't spend time optimizing the stack layout for it.)

However, assuming there are other locals which will be used later, we can allocate them in the gaps between any we spill early. So the space doesn't have to be wasted, we can simply come along later and use mov [rsp+12], eax to store between two 32-bit values we pushed.

A tiny array of long, with non-constant contents

int ext_longarr(long *);
void longarr_arg(long a, long b, long c) {
    long arr[] = {a,b,c};
    ext_longarr(arr);
}

gcc/clang/ICC/MSVC follow their normal pattern, and use mov stores:

longarr_arg(long, long, long):                     # @longarr_arg(long, long, long)
    sub     rsp, 24
    mov     rax, rsp                 # this is clang being silly
    mov     qword ptr [rax], rdi     # it could have used [rsp] for the first store at least,
    mov     qword ptr [rax + 8], rsi   # so it didn't need 2 reg,reg MOVs to avoid clobbering RDI before storing it.
    mov     qword ptr [rax + 16], rdx
    mov     rdi, rax
    call    ext_longarr(long*)
    add     rsp, 24
    ret

But it could have stored an array of the args like this:

longarr_arg_handtuned:
    push    rdx
    push    rsi
    push    rdi                 # leave stack 16B-aligned
    mov     rsp, rdi
    call    ext_longarr(long*)
    add     rsp, 24
    ret

With more args, we start to get more noticeable benefits especially in code-size when more of the total function is spent storing to the stack. This is a very synthetic example that does nearly nothing else. I could have used volatile int a = 1;, but some compilers treat that extra-specially.

Reasons for not building stack frames gradually

(probably wrong) Stack unwinding for exceptions, and debug formats, I think don't support arbitrary playing around with the stack pointer. So at least before making any call instructions, a function is supposed to have offset RSP as much as its going to for all future function calls in this function.

But that can't be right, because alloca and C99 variable-length arrays would violate that. There may be some kind of toolchain reason outside the compiler itself for not looking for this kind of optimization.

This gcc mailing list post about disabling -maccumulate-outgoing-args for tune=default (in 2014) was interesting. It pointed out that more push/pop led to larger unwind info (.eh_frame section), but that's metadata that's normally never read (if no exceptions), so larger total binary but smaller / faster code. Related: this shows what -maccumulate-outgoing-args does for gcc code-gen.

Obviously the examples I chose were trivial, where we're pushing the input parameters unmodified. More interesting would be when we calculate some things in registers from the args (and data they point to, and globals, etc.) before having a value we want to spill.

If you have to spill/reload anything between function entry and later pushes, you're creating extra stack-sync uops on Intel. On AMD, it could still be a win to do push rbx / blah blah / mov [rsp-32], eax (spill to the red zone) / blah blah / push rcx / imul ecx, [rsp-24], 12345 (reload the earlier spill from what's still the red-zone, with a different offset)

Mixing push and [rsp] addressing modes is less efficient (on Intel CPUs because of stack-sync uops), so compilers would have to carefully weight the tradeoffs to make sure they're not making things slower. sub / mov is well-known to work well on all CPUs, even though it can be costly in code-size, especially for small constants.

"It's hard to keep track of the offsets" is a totally bogus argument. It's a computer; re-calculating offsets from a changing reference is something it has to do anyway when using push to put function args on the stack. I think compilers could run into problems (i.e. need more special-case checks and code, making them compile slower) if they had more than 128B of locals, so you couldn't always mov store below RSP (into what's still the red-zone) before moving RSP down with future push instructions.

Compilers already consider multiple tradeoffs, but currently growing the stack frame gradually isn't one of the things they consider. push wasn't as efficient before Pentium-M introduce the stack engine, so efficient push even being available is a somewhat recent change as far as redesigning how compilers think about stack layout choices.

Having a mostly-fixed recipe for prologues and for accessing locals is certainly simpler.

Panthea answered 27/3, 2018 at 1:27 Comment(10)

Have you benchmarked to assess your claim that your "more optimal" code is really faster? You could get surprises. Notice that x86-64 ABI passes several arguments thru registers (not on the stack by push-ing them), and there is a reason for that. – Lizzettelizzie 27/3, 2018 at 4:56

@BasileStarynkevitch: Of course it's faster to keep things in registers. I'm only talking about replacing a mov insns you would have used. I haven't benchmarked push myself (or this way of using it), but I have benchmarked using pop to iterate over an array for code-golf Fibonacci (1000 digit extend-precision add). It's faster than lodsd, which is 2 uops on Skylake, and perf counters show only the expected occasional extra stack-sync uop when the internal offset in the stack engine overflows. – Panthea 27/3, 2018 at 5:30

There is pretty solid evidence to back up Agner Fog's numbers and micro-arch guide, which I'm basing my reasoning on. I did mention in the answer that too much mixing of push and [rsp+x] addressing modes will cause extra stack-sync uops. I'm not suggesting using pop as part of this, only doing the first stores to the stack frame using push as far as is worth it. Inside loops you'd certainly just use mov, not pop / push. – Panthea 27/3, 2018 at 5:34

Thank you, now I don't have to check every compiler in the world. – Aged 27/3, 2018 at 5:48

That's quite an extensive work you've done here @peter. Is it original or have you already done that research previously? – License 27/3, 2018 at 7:49

@YSC: I didn't have to look up any of the performance background details (except to find links to put in the question, since unfortunately x86 performance details are not well known, and people often don't realize that the old stuff they've read is no longer current), but yeah I just read the OP's forum thread link to figure out WTF they were talking about (question is terrible), and came up with the examples where it would help. – Panthea 27/3, 2018 at 7:51

@BasileStarynkevitch: I made some edits to improve this answer and justify why push is not a performance risk (when used just at the start of a function like I'm suggesting, not using pop inside). Compilers already make significant use of them, and have tuning options not to use them, but choose to use them for modern CPUs. Even more convincing: ICC and clang already use a dummy push instead of sub rsp,8, so in my first example I'm simply replacing a dummy register with an imm8 and doing one less store! Obviously there could be a hidden perf pothole, but I've tested overlap – Panthea 27/3, 2018 at 7:55

@BasileStarynkevitch: in 32-bit mode, gcc already defaults to generating code with lots of push/pop inside the function body. See #49514207. That's not evidence that it's actually optimal, though! But Jan Hubicka made that the default after testing that it was at least perf neutral on AMD, and made smaller code-size (by 4%). The main tradeoff is larger .eh_frame sections (by 8%). – Panthea 28/3, 2018 at 17:18

Utilizing this would also do away with any practical need for the SysV ABI redzone, I believe. If pushes with the stack engine undisturbed are as time-wise cheap as movs to the redzone, and my benchmark says they are (+there's the rather large code shrinkage benefit), then the redzone has no point besides being annoying in some contexts. – Upbear 26/10, 2023 at 13:8

@PetrSkocik: The red-zone would still have some uses, like for scratch arrays, e.g. for something like strspn where you make a bool array of accept/reject characters as you loop over the accept-set string, and then use it while looping over to other string. (Not a perfect example; if you use a 256-byte array you still need to reserve some space. Although you could use a 32-byte bitmap.) Or for multiple zero-initialized variables, where one or two 32-byte vmovdqu ymm can zero more efficiently than 4 to 8 qword pushes. – Panthea 26/10, 2023 at 13:21

This requires disabling stack frames as well though.

It doesn't, actually. Simple stack frame initialisation can use either enter or push ebp \ mov ebp, esp \ sub esp, x (or instead of the sub, a lea esp, [ebp - x] can be used). Instead of or additionally to these, values can be pushed onto the stack to initialise the variables, or just pushing any random register to move the stack pointer without initialising to any certain value.

Here's an example (for 16-bit 8086 real/V 86 Mode) from one of my projects: https://bitbucket.org/ecm/symsnip/src/ce8591f72993fa6040296f168c15f3ad42193c14/binsrch.asm#lines-1465

save_slice_farpointer:
[...]
.main:
[...]
    lframe near
    lpar word,  segment
    lpar word,  offset
    lpar word,  index
    lenter
    lvar word,  orig_cx
     push cx
    mov cx, SYMMAIN_index_size
    lvar word,  index_size
     push cx
    lvar dword, start_pointer
     push word [sym_storage.main.start + 2]
     push word [sym_storage.main.start]

The lenter macro sets up (in this case) only push bp \ mov bp, sp and then lvar sets up numeric defs for offsets (from bp) to variables in the stack frame. Instead of subtracting from sp, I initialise the variables by pushing into their respective stack slots (which also reserves the stack space needed).

Aerostatic answered 23/7, 2019 at 22:2 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Reasons for not building stack frames gradually

Recommended topics

Hot tags