Yes, partly because compilers do stack allocation for the whole function once in prologue/epilogue, not moving the stack pointer around as they enter/leave block scopes.
and each inlined call to inlineme() would need its own buffer.
No, I'm pretty sure compilers are smart enough to reuse the same stack space for different instances of the same function, because only one instance of that C variable can ever be in-scope at once.
Optimization after inlining can merge some of the operations of the inline function into calling code, but I think it would be rare for the compiler to end up with 2 versions of the array it wanted to keep around simultaneously.
I don't see why that would be a concern for inlineing. Can you give an example of how functions that require a lot of stack would be problematic to inline?
A real example of a problem it could create (which compiler heuristics mostly avoid):
Inlining if (rare_special_case) use_much_stack()
into a recursive function that otherwise doesn't use much stack would be an obvious problem for performance (more cache and TLB misses), and even correctness if you recurse deep enough to actually overflow the stack.
(Especially in a constrained environment like Linux kernel stacks, typically 8kiB or 16kiB per thread, up from 4k on 32-bit platforms in older Linux versions. https://elinux.org/Kernel_Small_Stacks has some info and historical quotes about trying to get away with 4k stacks so the kernel didn't have to find 2 contiguous physical pages per task).
Compilers normally make functions allocate all the stack space they'll ever need up front (except for VLAs and alloca
). Inlining an error-handling or special-case handling function instead of calling it in the rare case where it's needed will put a large stack allocation (and often save/restore of more call-preserved registers) in the main prologue/epilogue, where it affects the fast path, too. Especially if the fast path didn't make any other function calls.
If you don't inline the handler, that stack space will never be used if there aren't errors (or the special case didn't happen). So the fast-path can be faster, with fewer push/pop instructions and not allocating any big buffers before going on to call another function. (Even if the function itself isn't actually recursive, having this happen in multiple functions in a deep call tree could waste a lot of stack.)
I've read that the Linux kernel does manually do this optimization in a few key places where gcc's inlining heuristics make an unwanted decision to inline: break a function up into fast-path with a call to the slow path, and use __attribute__((noinline))
on the bigger slow-path function to make sure it doesn't inline.
In some cases not doing a separate allocation inside a conditional block is a missed optimization, but more stack-pointer manipulation makes stack unwinding metadata to support exceptions (and backtraces) more bloated (especially saving/restoring of call-preserved registers that stack unwinding for exceptions has to restore).
If you were doing a save and/or allocate inside a conditional block before running some common code that's reached either way (with another branch to decide which registers to restore in the epilogue), then there'd be no way for the exception handler machinery to know whether to load just R12, or R13 as well (for example) from where this function saved them, without some kind of insanely complicated metadata format that could signal a register or memory location to be tested for some condition. The .eh_frame
section in ELF executables / libraries is bloated enough as is! (It's non-optional, BTW. The x86-64 System V ABI (for example) requires it even in code that doesn't support exceptions, or in C. In some ways that's good, because it means backtraces usually work, even passing an exception back up through a function would cause breakage.)
You can definitely adjust the stack pointer inside a conditional block, though. Code compiled for 32-bit x86 (with crappy stack-args calling conventions) can and does use push
even inside conditional branches. So as long as you clean up the stack before leaving the block that allocated space, it's doable. That's not saving/restoring registers, just moving the stack pointer. (In functions built without a frame pointer, the unwind metadata has to record all such changes, because the stack pointer is the only reference for finding saved registers and the return address.)
I'm not sure exactly what the details are on why compiler can't / don't want to be smarter allocating large extra stack space only inside a block that uses it. Probably a good part of the problem is that their internals just aren't set up to be able to even look for this kind of optimization.
Related: Raymond Chen posted a blog about the PowerPC calling convention, and how there are specific requirements on function prologues / epilogues that make stack unwinding work. (And the rules imply / require the existence of a red zone below the stack pointer that's safe from async clobber. A few other calling conventions use red zones, like x86-64 System V, but Windows x64 doesn't. Raymond posted another blog about red zones)
inlineme
a bunch of times, each inlined instance can have it's own block scope, so only 1 such array needs to reside on the stack at once. And whether or not the function is inlined, that array needs to be put on the stack. – Aymarafna() { inlineme(); fnb(); }
thenfnb() { inlineme(); fnc(); }
and each inlined call toinlineme()
would need its own buffer. – Holton