On Linux, gcc and clang pad with 0x90 (NOP) to align functions. (Even the linker does this, when linking .o
with sections of uneven size).
There's not usually any particular advantage, except maybe when the CPU has no branch-prediction for the RET instruction at the end of a function. In that case, NOP doesn't get the CPU started on anything that takes time to recover from when the correct branch target is discovered.
The last instruction of a function might not be a RET; it might be an indirect JMP (e.g. tail-call through a function pointer). In that case, branch prediction is more likely to fail. (CALL/RET pairs are specially predicted by a return stack. Note that RET is an indirect JMP in disguise; it's basically a jmp [rsp]
and an add rsp, 8
(without modifying FLAGS), see also What is the x86 "ret" instruction equivalent to?).
The default prediction for an indirect JMP or CALL (when no Branch Target Buffer prediction is available) is to jump to the next instruction. (Apparently making no prediction and stalling until the correct target is known is either not an option, or the default prediction is usable enough for jump tables.)
If the default prediction leads to speculatively executing something that the CPU can't abort easily, like an FP sqrt or maybe something microcoded, this increases the branch misprediction penalty. Even worse if the speculatively-executed instruction causes a TLB miss, triggering a hardware pagewalk, or otherwise pollutes the cache.
An instruction like INT 3 that just generates an exception can't have any of these problems. The CPU won't try to execute the INT before it should, so nothing bad will happen. IIRC, it's recommended to place something like that after an indirect JMP if the next-instruction default-prediction isn't useful.
With random garbage between functions, even pre-decoding the 16B block of machine code that includes the RET could slow down. Modern CPUs decode in parallel in groups of 4 instructions, so they can't detect a RET until after following instructions are already decoded. (This is different from speculative execution). It could be useful to avoid slow-to-decode Length-Changing-Prefixes in the bytes after an unconditional branch (like RET), since that might delay decoding of the branch. (I'm not 100% sure this can happen on real CPUs; it's hard to measure since you'd need to create a microbenchmark where the uop cache doesn't work and pre-decode is the bottleneck, not the regular decoders.)
LCP stalls only affect Intel CPUs: AMD marks instruction boundaries in their L1 cache, and decodes in larger groups. (Intel uses a decoded-uop cache to get high throughput without the power cost of actually decoding every time in a loop.)
Note that in Intel CPUs, instruction-length finding happens in an earlier stage than actual decoding. For example, the Sandybridge frontend looks like this:
(Diagram copied from David Kanter's Haswell write-up. I linked to his Sandybridge writeup, though. They're both excellent.)
See also Agner Fog's microarch pdf, and more links in the x86 tag wiki, for the details on what I described in this answer (and much more).
dec / jnz
(do{}while(--i)
) instead ofdec / jg
(do{}while(--i > 0)
). I guess it would be "safer" to write code that might still work if a bit flipped in the counter, but apparently it's not necessary. (And of course, a flipped bit inside an out-of-order execution CPU is unlikely to simply flip a bit in the architectural state; more likely you'll get something more weird.) – Bilection