Does stack space required by a function affect inlining decisions in C/C++?

Asked 14/2, 2019 at 15:56 Answered 15/2, 2019 at 2:51

Solved c++c gcc inline micro-optimization

Would a large amount of stack space required by a function prevent it from being inlined? Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?

int inlineme(int args) {
  char svar[10000];

  return stringyfunc(args, svar);
}

I'm more concerned about gcc, but icc and llvm would also be nice to know.

I know this isn't ideal, but I'm very curious. The code is probable also pretty bad on cache too.

Holton answered 14/2, 2019 at 15:56 Comment(6)

I don't see why that would be a concern for inlineing. Can you give an example of how functions that require a lot of stack would be problematic to inline? If you called inlineme a bunch of times, each inlined instance can have it's own block scope, so only 1 such array needs to reside on the stack at once. And whether or not the function is inlined, that array needs to be put on the stack. – Aymara 14/2, 2019 at 15:59

If it was a commonly used function (such as helping with an output buffer), and there could be many functions on the call stack that used it, the stack could grow rapidly. I can't fund much info on gcc inlining decisions / how to help the optimizer in this regard. – Holton 14/2, 2019 at 16:2

fna() { inlineme(); fnb(); } then fnb() { inlineme(); fnc(); } and each inlined call to inlineme() would need its own buffer. – Holton 14/2, 2019 at 16:3

Does block scoping dealloc the space (i,e adjust the stack pointer)? I thought all the space to the func is always allocated on entry and while destructors may be called or variables lose scoping, the stack isn't shrunk. – Holton 14/2, 2019 at 16:5

@FrançoisAndrieux: turned those comments into an answer. It doesn't directly answer the question, but discussing how compilers allocate stack is certainly relevant. – Piercy 15/2, 2019 at 2:52

This has been one of the most productive SO questions I've ever asked (bonus: not a single - your doing it wrong/over optimizing answer either!) – Holton 16/2, 2019 at 6:29

Yes, the decision to inline or not depends on the complexity of the function, its stack and registers usage and the context in which the call is made. The rules are compiler- and target platform-dependent. Always check the generated assembly when performance matters.

Compare this version with a 10000-char array not being inlined (GCC 8.2, x64, -O2):

inline int inlineme(int args) {
  char svar[10000];

  return stringyfunc(args, svar);
}

int test(int x) {
    return inlineme(x);
}

Generated assembly:

inlineme(int):
        sub     rsp, 10008
        mov     rsi, rsp
        call    stringyfunc(int, char*)
        add     rsp, 10008
        ret
test(int):
        jmp     inlineme(int)

with this one with a much smaller 10-char array, which is inlined:

inline int inlineme(int args) {
  char svar[10];

  return stringyfunc(args, svar);
}

int test(int x) {
    return inlineme(x);
}

Generated assembly:

test(int):
        sub     rsp, 24
        lea     rsi, [rsp+6]
        call    stringyfunc(int, char*)
        add     rsp, 24
        ret

Dithionite answered 14/2, 2019 at 16:7 Comment(4)

1- thanks. 2- How does godbolt demangle everything and pretty print the as wtih the .text and `\\` checkboxes? – Holton 14/2, 2019 at 16:23

@JasonN Because Godbolt's Compiler Explorer is a masterpiece. But more seriously, you can always check out the source code or look through his blog. – Aymara 14/2, 2019 at 16:31

Out of interest for this particular case gcc inlines when the array is 256 bytes or less and clang always inlines regardless of size – Machutte 14/2, 2019 at 17:7

visual studio inlines up to 8128 bytes – Machutte 14/2, 2019 at 17:13

Such as if I had a 10k automatic buffer on the stack, would that make the function less likely to be inlined?

Not necessarily in general. In fact, inline expansion can sometimes reduce stack space usage due to not having to set up space for function arguments.

Expanding a "wide" call into a single frame which calls other "wide" functions can be a problem though, and unless the optimiser guards against that separately, it may have to avoid expansion of "wide" functions in general.

In case of recursion: Most likely yes.

An example of LLVM source:

if (IsCallerRecursive &&
         AllocatedSize > InlineConstants::TotalAllocaSizeRecursiveCaller) {
       InlineResult IR = "recursive and allocates too much stack space";

From GCC source:

For stack growth limits we always base the growth in stack usage of the callers. We want to prevent applications from segfaulting on stack overflow when functions with huge stack frames gets inlined.

Controlling the limit, from GCC manual:

--param name=value

large-function-growth

Specifies maximal growth of large function caused by inlining in percents. For example, parameter value 100 limits large function growth to 2.0 times the original size.

large-stack-frame

The limit specifying large stack frames. While inlining the algorithm is trying to not grow past this limit too much.

large-stack-frame-growth

Specifies maximal growth of large stack frames caused by inlining in percents. For example, parameter value 1000 limits large stack frame growth to 11 times the original size.

Irenairene answered 14/2, 2019 at 16:7 Comment(3)

The calls I'm worried about are not recursive, but they appear in many functions on the call stack. I'm not worried about blowing the stack so much as this function not being inlined because soon we would be talking about 100k and larger stack frames. And I love LLVM source code. Some of the cleanest I've ever seen. – Holton 14/2, 2019 at 16:11

thanks for the gcc manual references. I glanced at that, but I didn't know if iit was referring to stack or code size. I assumed code size. – Holton 14/2, 2019 at 16:31

@JasonN insns limit is for instructions, growth is for stack. You can see the usage in source – Irenairene 14/2, 2019 at 16:35

Yes, partly because compilers do stack allocation for the whole function once in prologue/epilogue, not moving the stack pointer around as they enter/leave block scopes.

and each inlined call to inlineme() would need its own buffer.

No, I'm pretty sure compilers are smart enough to reuse the same stack space for different instances of the same function, because only one instance of that C variable can ever be in-scope at once.

Optimization after inlining can merge some of the operations of the inline function into calling code, but I think it would be rare for the compiler to end up with 2 versions of the array it wanted to keep around simultaneously.

I don't see why that would be a concern for inlineing. Can you give an example of how functions that require a lot of stack would be problematic to inline?

A real example of a problem it could create (which compiler heuristics mostly avoid):

Inlining if (rare_special_case) use_much_stack() into a recursive function that otherwise doesn't use much stack would be an obvious problem for performance (more cache and TLB misses), and even correctness if you recurse deep enough to actually overflow the stack.

(Especially in a constrained environment like Linux kernel stacks, typically 8kiB or 16kiB per thread, up from 4k on 32-bit platforms in older Linux versions. https://elinux.org/Kernel_Small_Stacks has some info and historical quotes about trying to get away with 4k stacks so the kernel didn't have to find 2 contiguous physical pages per task).

Compilers normally make functions allocate all the stack space they'll ever need up front (except for VLAs and alloca). Inlining an error-handling or special-case handling function instead of calling it in the rare case where it's needed will put a large stack allocation (and often save/restore of more call-preserved registers) in the main prologue/epilogue, where it affects the fast path, too. Especially if the fast path didn't make any other function calls.

If you don't inline the handler, that stack space will never be used if there aren't errors (or the special case didn't happen). So the fast-path can be faster, with fewer push/pop instructions and not allocating any big buffers before going on to call another function. (Even if the function itself isn't actually recursive, having this happen in multiple functions in a deep call tree could waste a lot of stack.)

I've read that the Linux kernel does manually do this optimization in a few key places where gcc's inlining heuristics make an unwanted decision to inline: break a function up into fast-path with a call to the slow path, and use __attribute__((noinline)) on the bigger slow-path function to make sure it doesn't inline.

In some cases not doing a separate allocation inside a conditional block is a missed optimization, but more stack-pointer manipulation makes stack unwinding metadata to support exceptions (and backtraces) more bloated (especially saving/restoring of call-preserved registers that stack unwinding for exceptions has to restore).

If you were doing a save and/or allocate inside a conditional block before running some common code that's reached either way (with another branch to decide which registers to restore in the epilogue), then there'd be no way for the exception handler machinery to know whether to load just R12, or R13 as well (for example) from where this function saved them, without some kind of insanely complicated metadata format that could signal a register or memory location to be tested for some condition. The .eh_frame section in ELF executables / libraries is bloated enough as is! (It's non-optional, BTW. The x86-64 System V ABI (for example) requires it even in code that doesn't support exceptions, or in C. In some ways that's good, because it means backtraces usually work, even passing an exception back up through a function would cause breakage.)

You can definitely adjust the stack pointer inside a conditional block, though. Code compiled for 32-bit x86 (with crappy stack-args calling conventions) can and does use push even inside conditional branches. So as long as you clean up the stack before leaving the block that allocated space, it's doable. That's not saving/restoring registers, just moving the stack pointer. (In functions built without a frame pointer, the unwind metadata has to record all such changes, because the stack pointer is the only reference for finding saved registers and the return address.)

I'm not sure exactly what the details are on why compiler can't / don't want to be smarter allocating large extra stack space only inside a block that uses it. Probably a good part of the problem is that their internals just aren't set up to be able to even look for this kind of optimization.

Related: Raymond Chen posted a blog about the PowerPC calling convention, and how there are specific requirements on function prologues / epilogues that make stack unwinding work. (And the rules imply / require the existence of a red zone below the stack pointer that's safe from async clobber. A few other calling conventions use red zones, like x86-64 System V, but Windows x64 doesn't. Raymond posted another blog about red zones)

Piercy answered 15/2, 2019 at 2:51 Comment(2)

The legality of "goto"ing into or out of a block biases toward keeping the stack pointer at the same place throughout the entire function. Otherwise, every jump could potentially require a stack pointer adjustment. This is a real pain for conditional branches. Instead of "branch to label2 if equal" you'd have to do "branch to temp if not equal; adjust stack pointer; jump to label2; temp:". (That said, there's a technique known as "shrink-wrapping" which does adjust the stack dynamically through the lifetime of a function.) – Legate 15/2, 2019 at 3:9

@RaymondChen: yes, I was only picturing doing block-scoped alloc/dealloc for blocks that have a single entry and exit point. (e.g. an if() body that didn't get merged with something else.) This can be more than 1 basic block, but the compiler knows what's going on and could in theory look for a place to alloc and a place to dealloc that avoided problems. Thanks for the reminder on the Shrink Wrap name for the optimization of delaying the prologue. reviews.llvm.org/D9210 presents a shrink-wrap optimization pass with an AArch64 example. – Piercy 15/2, 2019 at 3:13

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

A real example of a problem it could create (which compiler heuristics mostly avoid):

Recommended topics

Hot tags