Considering that you're trying solely to optimize for speed, what are good heuristics for deciding whether to inline a function or not? Obviously code size should be important, but are there any other factors typically used when (say) gcc or icc is determining whether to inline a function call? Has there been any significant academic work in the area?
Wikipedia has a few paragraphs about this, with some links at the bottom:
- In addition to memory size and cache issues, another consideration is register pressure. From the compiler's point of view "the added variables from the inlined procedure may consume additional registers, and in an area where register pressure is already high this may force spilling, which causes additional RAM accesses."
Languages with JIT compilers and runtime class loading have other tradeoffs since the virtual methods aren't known statically, yet the JIT can collect runtime profiling information, such as method call frequency:
Design, Implementation, and Evaluation of Optimizations in a Just-in-Time Compiler (for Java) talks about method inlining of static methods and dynamically loaded classes and its improvements on performance.
Practicing JUDO: Java Under Dynamic Optimizations claims that their "inlining policy is based on the code size and profiling information. If the execution frequency of a method entry is below a certain threshold, the method is then not inlined because it is regarded as a cold method. To avoid code explosion, we do not inline a method with a bytecode size of more than 25 bytes. . . . To avoid inlining along a deep call chain, inlining stops when the accumulated inlined bytecode size along the call chain exceeds 40 bytes." Although they have runtime profiling information (method call frequency) they are still careful to avoid inlining large functions or chains of functions to prevent bloat.
A search on Google Scholar reveals a number of papers, such as
- The effect of code expanding optimizations on instruction cache design
- Function Inlining under Code Size Constraints for Embedded Processors
A search on Google Books reveals quite a number of books with papers or chapters about function inlining in various contexts.
The Compiler Design Handbook: Optimizations and Machine Code Generation has a chapter about Statisical and Machine Learning Techniques in Compiler Design, with heuristics to set various parameters, profiling the results. This chapter references the Vaswani et al paper Microarchitecture Sensitive Empirical Models for Compiler Optimizations where they propose "the use of empirical modeling techniques for building microarchitecture sensitive models for compiler optimizations".
(Some other books talk about inling from the programmer's point of view, such as C++ for Game Programmers, which talks about the dangers of inlining functions too often and the differences between inlining and macros. Compilers often ignore the programmer's inline requests if they can determine that they would do more harm than good; this can be overridden with macros as a last resort.)
A function call implies some additional code (the function prologue, where the new stack frame is set up, and the function epilogue, where it's cleaned up). If your compiler sees that the function code is small in comparison to the prologue and epilogue, it can decide it's not worth it to make an actual call, and will inline the function.
The only benefit I see of calling a function instead of inlining it are size-related. I guess inlining a function then unrolling a loop can result in a significant size increase.
as far as I have saw, function size is the only factor compilers used to determine inline. However if you do profile guided optimization (PGO), i believe compiler is able to use other variables, such as number of calls/call setup time.
In .NET is is mostly based on size. Measure the size of the parent function and child function in compiled bytes. Then measure the size of the combined function. If the combined function is smaller, then inlining is a good idea.
The reason for this is to make it possible to shove as much code into the CPU's cache as possible. Cache misses are far more expensive than function calls in modern CPUs.
© 2022 - 2024 — McMap. All rights reserved.