TCC is not an optimizing compiler, at least not really. Every single instruction it emitted for main
is sub-optimal or not needed at all, except the ret
. IDK why you thought the JMP was the only instruction that might not make sense for performance.
This is by design: TCC stands for Tiny C Compiler. The compiler itself is designed to be simple, so it intentionally doesn't include code to look for many kinds of optimizations. Notice the sub esp, 0
: this useless instruction clearly come from filling in a function-prologue template, and TCC doesn't even look for the special case where the offset is 0 bytes. Other function need stack space for locals, or to align the stack before any child function calls, but this main() doesn't. TCC doesn't care, and blindly emits sub esp,0
to reserve 0 bytes.
(In fact, TCC is truly one pass, laying out machine code as it does through the C statement by statement. It uses the imm32
encoding for sub
so it will have room to fill in the right number (upon reaching the end of the function) even if it turns out the function uses more than 255 bytes of stack space. So instead of constructing a list of instructions in memory to finish assembling later, it just remembers one spot to fill in a uint32_t
. That's why it can't omit the sub when it turns out not to be needed.)
Most of the work in creating a good optimizing compiler that anyone will use in practice is the optimizer. Even parsing modern C++ is peanuts compared to reliably emitting efficient asm (which not even gcc / clang / icc can do all the time, even without considering autovectorization). Just generating working but inefficient asm is easy compared to optimizing; most of gcc's codebase is optimization, not parsing. See Basile's answer on Why are there so few C compilers?
The JMP (as you can see from @MichaelPetch's answer) has a similar explanation: TCC (until recently) didn't optimize the case where a function only has one return path, and doesn't need to JMP to a common epilogue.
There's even a NOP in the middle of the function. It's obviously a waste of code bytes and decode / issue front-end bandwidth and out-of-order window size. (Sometimes executing a NOP outside a loop or something is worth it to align the top of a loop which is branched to repeatedly, but a NOP in the middle of a basic block is basically never worth it, so that's not why TCC put it there. And if a NOP did help, you could probably do even better by reordering instructions or choosing larger instructions to do the same thing without a NOP. Even proper optimizing compilers like gcc/clang/icc don't try to predict this kind of subtle front-end effect.)
@MichaelPetch points out that TCC always wants its function prologue to be 10 bytes, because it's a single-pass compiler (and it doesn't know how much space it needs for locals until the end of the function, when it comes back and fills in the imm32). But Windows targets need stack probes when modifying ESP / RSP by more than a whole page (4096 bytes), and the alternate prologue for that case is 10 bytes, instead of 9 for the normal one without the NOP. So this is another tradeoff favouring compilation speed over good asm.
An optimizing compiler would xor-zero EAX (because that's smaller and at least as fast as mov eax,0
), and leave out all the other instruction. Xor-zeroing is one of the most well-known / common / basic x86 peephole optimizations, and has several advantages other than code-size on some modern x86 microarchitectures.
main:
xor eax,eax
ret
Some optimizing compilers might still make a stack frame with EBP, but tearing it down with pop ebp
would be strictly better than leave
on all CPUs, for this special case where ESP = EBP so the mov esp,ebp
part of leave
isn't needed. pop ebp
is still 1 byte, but it's also a single-uop instruction on modern CPUs, unlike leave
which is 2 or 3 on modern CPUs. (http://agner.org/optimize/, and see also other performance optimization links in the x86 tag wiki.) This is what gcc does. It's a fairly common situation; if you push some other registers after making a stack frame, you have to point ESP at the right place before pop ebx
or whatever. (Or use mov
to restore them.)
The benchmarks TCC cares about are compilation speed, not quality (speed or size) of the resulting code. For example, the TCC web site has a benchmark in lines/sec and MB/sec (of C source) vs. gcc3.2 -O0
, where it's ~9x faster on a P4.
However, TCC is not totally braindead: it will apparently do some inlining, and as Michael's answer points out, a recent patch does leave out the JMP (but still not the useless sub esp, 0
).
tcc
. Ugh. Well, it's supposed to be tiny, not generating efficient code :) – Spectrographjmp
has been eliminated in 0.9.27 – Teasetcc
and optimization are contradictory. Use an optimizing compiler (e.g.gcc -O2
with GCC) if you want optimizations – Commissionaire