Tiny C Compiler's generated code emits extra (unnecessary?) NOPs and JMPs

Asked 11/2, 2018 at 23:47 Answered 12/2, 2018 at 6:42

Solved c assembly x86 compiler-optimization tcc

Can someone explain why this code:

#include <stdio.h>

int main()
{
  return 0;
}

when compiled with tcc using tcc code.c produces this asm:

00401000  |.  55               PUSH EBP
00401001  |.  89E5             MOV EBP,ESP
00401003  |.  81EC 00000000    SUB ESP,0
00401009  |.  90               NOP
0040100A  |.  B8 00000000      MOV EAX,0
0040100F  |.  E9 00000000      JMP fmt_vuln1.00401014
00401014  |.  C9               LEAVE
00401015  |.  C3               RETN

I guess that

00401009  |.  90   NOP

is maybe there for some memory alignment, but what about

0040100F  |.  E9 00000000     JMP fmt_vuln1.00401014
00401014  |.  C9              LEAVE

I mean why would compiler insert this near jump that jumps to the next instruction, LEAVE would execute anyway?

I'm on 64-bit Windows generating 32-bit executable using TCC 0.9.26.

Wives answered 11/2, 2018 at 23:47 Comment(15)

It's actually the opposite of optimization, as you didn't turn it on. EDIT: apparently you can't control optimization with tcc. Ugh. Well, it's supposed to be tiny, not generating efficient code :) – Spectrograph 11/2, 2018 at 23:49

TCC isn't (and shouldn't be) used for highly optimized building. It doesn't do much optimizations. – Natalya 11/2, 2018 at 23:53

@Spectrograph well it would be tinier without it and all the unecessary 00 bytes at the end.. – Wives 11/2, 2018 at 23:54

@Someprogrammerdude i know, i'm just curious what's the point of this – Wives 11/2, 2018 at 23:56

It is tcc that is tiny and fast, not the code it produces. Tcc is good for compiling things which will not be executed enough times to make it worth spending time on generating good code. – Vendible 12/2, 2018 at 0:8

@MichaelPetch Win10 x64 – Wives 12/2, 2018 at 0:20

Strange I don't see this kind of code generated on Windows – Tease 12/2, 2018 at 0:22

Ok i examined it and saw that both .text and .data sections are 0x200 bytes long, so i guess i answered my 2nd question about all the extra 00s – Wives 12/2, 2018 at 0:23

@MichaelPetch that's weird, all my tcc programs have that jump-leave thing at functions epilogues – Wives 12/2, 2018 at 0:25

@MichaelPetch v0.9.26 – Wives 12/2, 2018 at 0:35

I installed 0.9.26 and see what you are referring to. Seems the jmp has been eliminated in 0.9.27 – Tease 12/2, 2018 at 0:41

is there win binary? – Wives 12/2, 2018 at 0:44

main() in general is not a function you should examine for this kind of thing, as compilers can/will look for it and add extra stuff. For these tests use some other non-standard name like test or fun or foo or bar. And tcc is not a compiler to really be looking at output of other than is it functional, not is it efficient. – Levona 12/2, 2018 at 1:0

did the code produced functionally match the high level language? – Levona 12/2, 2018 at 1:0

tcc and optimization are contradictory. Use an optimizing compiler (e.g. gcc -O2 with GCC) if you want optimizations – Commissionaire 12/2, 2018 at 6:39

Superfluous JMP before the Function Epilogue

The JMP at the bottom that goes to the next statement, this was fixed in a commit. Version 0.9.27 of TCC resolves this issue:

When 'return' is the last statement of the top-level block (very common and often recommended case) jump is not needed.

As for the reason it existed in the first place? The idea is that each function has a possible common exit point. If there is a block of code with a return in it at the bottom, the JMP goes to a common exit point where stack cleanup is done and the ret is executed. Originally the code generator also emitted the JMP instruction erroneously at the end of the function too if it appeared just before the final } (closing brace). The fix checks to see if there is a return statement followed by a closing brace at the top level of the function. If there is, the JMP is omitted

An example of code that has a return at a lower scope before a closing brace:

int main(int argc, char *argv[])
{
  if (argc == 3) {
      argc++;
      return argc;
  }
  argc += 3;
  return argc;
}

The generated code looks like:

  401000:       55                      push   ebp
  401001:       89 e5                   mov    ebp,esp
  401003:       81 ec 00 00 00 00       sub    esp,0x0
  401009:       90                      nop
  40100a:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]
  40100d:       83 f8 03                cmp    eax,0x3
  401010:       0f 85 11 00 00 00       jne    0x401027
  401016:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]
  401019:       89 c1                   mov    ecx,eax
  40101b:       40                      inc    eax
  40101c:       89 45 08                mov    DWORD PTR [ebp+0x8],eax
  40101f:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]

  ; Jump to common function exit point. This is the `return argc` inside the if statement
  401022:       e9 11 00 00 00          jmp    0x401038

  401027:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]
  40102a:       83 c0 03                add    eax,0x3
  40102d:       89 45 08                mov    DWORD PTR [ebp+0x8],eax
  401030:       8b 45 08                mov    eax,DWORD PTR [ebp+0x8]

  ; Jump to common function exit point. This is the `return argc` at end of the function 
  401033:       e9 00 00 00 00          jmp    0x401038

  ; Common function exit point
  401038:       c9                      leave
  401039:       c3                      ret

In versions prior to 0.9.27 the return argc inside the if statement would jump to a common exit point (function epilogue). As well the return argc at the bottom of the function also jumps to the same common exit point of the function. The problem is that the common exit point for the function happens to be right after the top level return argcso the side effect is an extra JMP that happens to be to the next instruction.

NOP after Function Prologue

The NOP isn't for alignment. Because of the way Windows implements guard pages for the stack (Programs that are in Portable Executable format) TCC has two types of prologues. If the local stack space required < 4096 (smaller than a single page) then you see this kind of code generated:

401000:       55                      push   ebp
401001:       89 e5                   mov    ebp,esp
401003:       81 ec 00 00 00 00       sub    esp,0x0

The sub esp,0 isn't optimized out. It is the amount of stack space needed for local variables (in this case 0). If you add some local variables you will see the 0x0 in the SUB instruction changes to coincide with the amount of stack space needed for local variables. This prologue requires 9 bytes. There is another prologue to handle the case where the stack space needed is >= 4096 bytes. If you add an array of 4096 bytes with something like:

char somearray[4096]

and look at the resulting instruction you will see the function prologue change to a 10 byte prologue:

401000:       b8 00 10 00 00          mov    eax,0x1000
401005:       e8 d6 00 00 00          call   0x4010e0

TCC's code generator assumes that the function prologue is always 10 bytes when targeting WinPE. This is primarily because TCC is a single pass compiler. The compiler doesn't know how much stack space a function will use until after the function is processed. To get around not knowing this ahead of time, TCC pre-allocates 10 bytes for the prologue to fit the largest method. Anything shorter is padded to 10 bytes.

In the case where stack space needed < 4096 bytes the instructions used total 9 bytes. The NOP is used to pad the prologue to 10 bytes. For the case where >= 4096 bytes are needed, the number of bytes is passed in EAX and the function __chkstk is called to allocate the required stack space instead.

Tease answered 12/2, 2018 at 2:24 Comment(3)

Might be worth summarizing the comments in your answer: TCC is not an optimizing compiler. The JMP is not the only silly thing here, almost all the rest of the instructions here are ridiculous. e.g. it doesn't even look for the peephole optimization of xor-zeroing, not to mention a useless sub esp,0 and even a NOP! Obviously it doesn't need to make / destroy a stack frame, but there are valid arguments for that even in optimized code. – Rosenwald 12/2, 2018 at 4:21

@PeterCordes I could discuss all those things but that is all secondary to the JMP instruction which was asked for.If you wish to summarize the other stuff in another answer be my guest. TCC does in fact do some level of optimization. Not usually on instructions but it will attempt to inline code in the same module among other things. To say it doesn't optimize at all would be incorrect. – Tease 12/2, 2018 at 4:24

Ok, thanks. I haven't looked at TCC or its asm output other than this question. – Rosenwald 12/2, 2018 at 4:25

TCC is not an optimizing compiler, at least not really. Every single instruction it emitted for main is sub-optimal or not needed at all, except the ret. IDK why you thought the JMP was the only instruction that might not make sense for performance.

This is by design: TCC stands for Tiny C Compiler. The compiler itself is designed to be simple, so it intentionally doesn't include code to look for many kinds of optimizations. Notice the sub esp, 0: this useless instruction clearly come from filling in a function-prologue template, and TCC doesn't even look for the special case where the offset is 0 bytes. Other function need stack space for locals, or to align the stack before any child function calls, but this main() doesn't. TCC doesn't care, and blindly emits sub esp,0 to reserve 0 bytes.

(In fact, TCC is truly one pass, laying out machine code as it does through the C statement by statement. It uses the imm32 encoding for sub so it will have room to fill in the right number (upon reaching the end of the function) even if it turns out the function uses more than 255 bytes of stack space. So instead of constructing a list of instructions in memory to finish assembling later, it just remembers one spot to fill in a uint32_t. That's why it can't omit the sub when it turns out not to be needed.)

Most of the work in creating a good optimizing compiler that anyone will use in practice is the optimizer. Even parsing modern C++ is peanuts compared to reliably emitting efficient asm (which not even gcc / clang / icc can do all the time, even without considering autovectorization). Just generating working but inefficient asm is easy compared to optimizing; most of gcc's codebase is optimization, not parsing. See Basile's answer on Why are there so few C compilers?

The JMP (as you can see from @MichaelPetch's answer) has a similar explanation: TCC (until recently) didn't optimize the case where a function only has one return path, and doesn't need to JMP to a common epilogue.

There's even a NOP in the middle of the function. It's obviously a waste of code bytes and decode / issue front-end bandwidth and out-of-order window size. (Sometimes executing a NOP outside a loop or something is worth it to align the top of a loop which is branched to repeatedly, but a NOP in the middle of a basic block is basically never worth it, so that's not why TCC put it there. And if a NOP did help, you could probably do even better by reordering instructions or choosing larger instructions to do the same thing without a NOP. Even proper optimizing compilers like gcc/clang/icc don't try to predict this kind of subtle front-end effect.)

@MichaelPetch points out that TCC always wants its function prologue to be 10 bytes, because it's a single-pass compiler (and it doesn't know how much space it needs for locals until the end of the function, when it comes back and fills in the imm32). But Windows targets need stack probes when modifying ESP / RSP by more than a whole page (4096 bytes), and the alternate prologue for that case is 10 bytes, instead of 9 for the normal one without the NOP. So this is another tradeoff favouring compilation speed over good asm.

An optimizing compiler would xor-zero EAX (because that's smaller and at least as fast as mov eax,0), and leave out all the other instruction. Xor-zeroing is one of the most well-known / common / basic x86 peephole optimizations, and has several advantages other than code-size on some modern x86 microarchitectures.

main:
    xor eax,eax
    ret

Some optimizing compilers might still make a stack frame with EBP, but tearing it down with pop ebp would be strictly better than leave on all CPUs, for this special case where ESP = EBP so the mov esp,ebp part of leave isn't needed. pop ebp is still 1 byte, but it's also a single-uop instruction on modern CPUs, unlike leave which is 2 or 3 on modern CPUs. (http://agner.org/optimize/, and see also other performance optimization links in the x86 tag wiki.) This is what gcc does. It's a fairly common situation; if you push some other registers after making a stack frame, you have to point ESP at the right place before pop ebx or whatever. (Or use mov to restore them.)

The benchmarks TCC cares about are compilation speed, not quality (speed or size) of the resulting code. For example, the TCC web site has a benchmark in lines/sec and MB/sec (of C source) vs. gcc3.2 -O0, where it's ~9x faster on a P4.

However, TCC is not totally braindead: it will apparently do some inlining, and as Michael's answer points out, a recent patch does leave out the JMP (but still not the useless sub esp, 0).

Rosenwald answered 12/2, 2018 at 6:42 Comment(7)

The NOP is not for alignment. – Tease 12/2, 2018 at 6:45

@MichaelPetch: I didn't say it was. I just said that alignment is not a plausible explanation for it being there, since the question was speculating about that as a reason. edited to make it cleared that I don't think TCC intended it for alignment purposes either. – Rosenwald 12/2, 2018 at 6:46

Yes, but you happen to mention possible uses for the NOP. There is a late addition to my answer that explains the reason why that NOP is there. TCC assumes that the function prologue is maximum 10 bytes (it doesn't know how much space is needed until after the function is parsed). There are two function prologues.One is 9 bytes, the other 10 (one is for the case where < 4096 bytes are needed and the other for >= 4096). The 9 byte version uses a NOP to pade to 10 bytes. And yes I am the upvote. Just pointing out why that NOP is there in TCC. – Tease 12/2, 2018 at 6:49

With Win32 PE/Win64PE when you need >= 4096 (a page) of stack space you need to touch each stack page one at a time (in order) to prevent the stack guard pages causing a fault. – Tease 12/2, 2018 at 6:51

The reason for the 10 bytes is to account for the largest needed stack prologue so the space for the prologue instructions can be pre-allocated before the actual function is parsed. TCC is single pass so it doesn't know until after how much stack space is needed and which method to allocate space will be used. 10 bytes is the maximum size for function prologue code. Any method shorter than 10 bytes is padded out with NOP. The compiler/code gen goes back after and updates the prologue with the required method. – Tease 12/2, 2018 at 7:5

@MichaelPetch: ah, single-pass compilation explains it. Thanks. – Rosenwald 12/2, 2018 at 7:8

I'm surprised now at the down vote. There is reasonable info in general in this answer. – Tease 12/2, 2018 at 7:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Superfluous JMP before the Function Epilogue

NOP after Function Prologue

Recommended topics

Hot tags