Why Do Compilers Insert INT3 Instructions Between Subroutines?

Asked 19/10, 2016 at 9:40 Answered 22/12, 2023 at 14:10

Solved assembly compiler-construction x86

While debugging some software I noticed INT3 instructions are inserted in between subroutines in many cases.

I assume these are not technically inserted 'between' functions, but instead after them, in order to pause execution if a subroutine does not execute retn at the end for whichever reason.

Are my assumptions correct? What's the purpose of these instructions, if not?

Nigro answered 19/10, 2016 at 9:40 Comment(2)

It most certainly is empty space between functions. A very basic x86 optimization is to have functions start at an address that is a multiple of 16. If you have to come up with some byte value to fill up the gaps then 0xcc is by far the best choice. Catches the corner-case of a program jumping into oblivion. – Endospore 19/10, 2016 at 10:12

It sometimes amazes me how billions of tiny transistors can work so reliably that it's safe to write loop conditions like dec / jnz (do{}while(--i)) instead of dec / jg (do{}while(--i > 0)). I guess it would be "safer" to write code that might still work if a bit flipped in the counter, but apparently it's not necessary. (And of course, a flipped bit inside an out-of-order execution CPU is unlikely to simply flip a bit in the architectural state; more likely you'll get something more weird.) – Bilection 19/10, 2016 at 11:34

On Linux, gcc and clang pad with 0x90 (NOP) to align functions. (Even the linker does this, when linking .o with sections of uneven size).

There's not usually any particular advantage, except maybe when the CPU has no branch-prediction for the RET instruction at the end of a function. In that case, NOP doesn't get the CPU started on anything that takes time to recover from when the correct branch target is discovered.

The last instruction of a function might not be a RET; it might be an indirect JMP (e.g. tail-call through a function pointer). In that case, branch prediction is more likely to fail. (CALL/RET pairs are specially predicted by a return stack. Note that RET is an indirect JMP in disguise; it's basically a jmp [rsp] and an add rsp, 8 (without modifying FLAGS), see also What is the x86 "ret" instruction equivalent to?).

The default prediction for an indirect JMP or CALL (when no Branch Target Buffer prediction is available) is to jump to the next instruction. (Apparently making no prediction and stalling until the correct target is known is either not an option, or the default prediction is usable enough for jump tables.)

If the default prediction leads to speculatively executing something that the CPU can't abort easily, like an FP sqrt or maybe something microcoded, this increases the branch misprediction penalty. Even worse if the speculatively-executed instruction causes a TLB miss, triggering a hardware pagewalk, or otherwise pollutes the cache.

An instruction like INT 3 that just generates an exception can't have any of these problems. The CPU won't try to execute the INT before it should, so nothing bad will happen. IIRC, it's recommended to place something like that after an indirect JMP if the next-instruction default-prediction isn't useful.

With random garbage between functions, even pre-decoding the 16B block of machine code that includes the RET could slow down. Modern CPUs decode in parallel in groups of 4 instructions, so they can't detect a RET until after following instructions are already decoded. (This is different from speculative execution). It could be useful to avoid slow-to-decode Length-Changing-Prefixes in the bytes after an unconditional branch (like RET), since that might delay decoding of the branch. (I'm not 100% sure this can happen on real CPUs; it's hard to measure since you'd need to create a microbenchmark where the uop cache doesn't work and pre-decode is the bottleneck, not the regular decoders.)

LCP stalls only affect Intel CPUs: AMD marks instruction boundaries in their L1 cache, and decodes in larger groups. (Intel uses a decoded-uop cache to get high throughput without the power cost of actually decoding every time in a loop.)

Note that in Intel CPUs, instruction-length finding happens in an earlier stage than actual decoding. For example, the Sandybridge frontend looks like this:

(Diagram copied from David Kanter's Haswell write-up. I linked to his Sandybridge writeup, though. They're both excellent.)

See also Agner Fog's microarch pdf, and more links in the x86 tag wiki, for the details on what I described in this answer (and much more).

Bilection answered 19/10, 2016 at 11:24 Comment(10)

You'd hope that the speculative execution stops at the INT 3 so it blocks speculative execution of the next function. Executing the prologue of the next function probably is mostly harmless, but a waste. – Luby 19/10, 2016 at 12:2

@Peter Cordes, along a similar topic, why does branch prediction continue to decode instructions after an unconditional branching instruction is encountered (i.e. RET, JMP)? Many times you'd expect these bytes to be garbage and padding and thus a waste of cycles. – Landgrabber 19/10, 2016 at 15:14

@byteptr: That's not branch prediction, that's just parallel decoding. Instruction-length marking happens in blocks of 16B in Intel CPUs, before any of the real decoders look at the block and detect an unconditional branch. – Bilection 19/10, 2016 at 19:48

@MSalters: Yes, IIRC INT is a serializing instruction, or at least will stop speculative execution. – Bilection 19/10, 2016 at 19:49

UD2 and INT3 are probably similar in stopping speculation. Fog doesn't have much to say on this (that I could find) but from the Intel Optimization Reference Manual: Assembly/Compiler Coding Rule 14. (M impact, L generality) When indirect branches are present, try to put the most likely target of an indirect branch immediately following the indirect branch. Alternatively, if indirect branches are common but they cannot be predicted by branch prediction hardware, then follow the indirect branch with a UD2 instruction, which will stop the processor from decoding down the fall-through path. – Riviera 20/3, 2017 at 23:1

BTW, does this speculation only happen from the decoders? Once the branch has been decoded+cached in the ustore wouldn't it also have a (non-default) prediction? Also, I should add that UB2 is 2 bytes and INT3 is a single byte. – Riviera 20/3, 2017 at 23:37

@Olsonist: I expect that it's possible for an indirect branch to still be hot in the uop cache but have had its target-prediction data evicted from the BTB by other branches that are exactly 4k away or something. My guess is that Intel's manual should have said "stop the processor from speculating down the fall-through path", not "decoding". Thanks for digging up that quote, that's exactly what I was talking about. – Bilection 21/3, 2017 at 16:39

Thanks. BTW, the best info I can find on BTB size+ways is from Godbolt, Inside the Ivy Bridge and Haswell BTB, link 4096 entries 4x1024. – Riviera 22/3, 2017 at 15:2

In your ret effects example you used add rsp but add would have an effect on the arithmetic flags. May I suggest lea ? – Ogee 22/12, 2023 at 15:9

@ecm: I'd rather just write "without modifying FLAGS" instead of using a more obfuscated lea rsp, [rsp+8] that will take beginners longer to understand. But sure, I sympathize with add seeming sloppy to some readers so I also linked a Q&A about actually emulating ret. – Bilection 22/12, 2023 at 17:22

Incorrect assumptions.

They're padding between functions, not after. And a CPU that randomly decides to skip instructions is broken and should be thrown away.

The reason for INT 3 is twofold. It's a single-byte instruction, which means you can use it even if there's just a single byte of space. The vast majority of instructions is unsuitable because they're too long. Furthermore, it's the "debug break" instruction. This means a debugger can catch the attempt to execute code between functions. That's not caused by ignoring retn, but for more simple reasons such as using an uninitialized function pointer.

Luby answered 19/10, 2016 at 10:0 Comment(2)

Since it should never executed, in theory 0x00 would be fine. But in practice, CPUs just decode bytes as x86 instructions without knowing where function boundaries are, so having the padding between functions be valid instructions that won't slow the CPU down while decoding (or speculatively executing) is also an advantage. But good point that INT3 causes early / noisy failure in the rare case where an indirect jump or corrupted return address takes you into the padding; that's probably usually better than silently falling into the next function with NOP padding (like is typical on Linux). – Bilection 19/10, 2016 at 11:28

Minor comment: int3 is one byte, int 3 is two bytes. Both instructions behave slightly different. – Yaws 8/12, 2016 at 14:19

I can add that int3 instruction after ret is used to mitigate a speculative cache side-channel vulnerability called Straight-line speculation (SLS).

Here is an article about SLS mitigation in Linux kernel: Blocking straight-line speculation — eventually.

Conversazione answered 22/12, 2023 at 14:10 Comment(0)

Recommended topics

Hot tags