It is possible that the confusion about why inlining can hurt i-cache hit rate or cause thrashing lies in the difference between static instruction count and dynamic instruction count. Inlining (almost always) reduces the latter but often increases the former.
Let us briefly examine those concepts.
Static Instruction Count
Static instruction count for some execution trace is the number of unique instructions0 that appear in the binary image. Basically, you just count the instruction lines in an assembly dump. The following snippet of x86 code has a static instruction count of 5 (the .top:
line is a label which doesn't translate to anything in the binary):
mov eci, 10
mov eax, 0
.top:
add eax, eci
dec eci
jnz .top
The static instruction count is mostly important for binary size, and caching considerations.
Static instruction count may also be referred to simply as "code size" and I'll sometimes use that term below.
Dynamic Instruction Count
The dynamic instruction count, on the other hand, depends on the actual runtime behavior and is the number of instructions executed. The same static instruction can be counted multiple times due to loops and other branches, and some instructions included in the static count may never execute at all and so no count in the dynamic case. The snippet as above, has a dynamic instruction count of 2 + 30 = 32
: the first two instructions are executed once, and then the loop executes 10 times with 3 instructions each iteration.
As a very rough approximation, dynamic instruction count is primarily important for runtime performance.
The Tradeoff
Many optimizations such as loop unrolling, function cloning, vectorization and so on increase code size (static instruction count) in order to improve runtime performance (often strongly correlated with dynamic instruction count).
Inlining is also such an optimization, although with the twist that for some call sites inlining reduces both dynamic and static instruction count.
How does inlining affect the instruction cache hit rate?
The article mentioned too much inlining, and the basic idea here is that a lot of inlining increases the code footprint by increasing the working set's static instruction count while usually reducing its dynamic instruction count. Since a typical instruction cache1 caches static instructions, a larger static footprint means increased cache pressure and often results in a worse cache hit rate.
The increased static instruction count occurs because inlining essentially duplicates the function body at each the call site. So rather than one copy of the function body an a few instructions to call the function N
times, you end up with N
copies of the function body.
Now this is a rather naive model of how inlining works since after inlining, it may be the case that further optimizations can be done in the context of a particular call-site, which may dramatically reduce the size of the inlined code. In the case of very small inlined functions or a large amount of subsequent optimization, the resulting code may be even smaller after inlining, since the remaining code (if any) may be smaller than the overhead involved in the calling the function2.
Still, the basic idea remains: too much inlining can bloat the code in the binary image.
The way the i-cache works depends on the static instruction count for some execution, or more specifically the number of instruction cache lines touched in the binary image, which is largely a fairly direct function of the static instruction count. That is, the i-cache caches regions of the binary image so the more regions and the larger they are, the larger the cache footprint, even if the dynamic instruction count happens to be lower.
How does inlining increase the size of the binary executable file?
It's exactly the same principle as the i-cache case above: larger static footprint means that more distinct pages need to paged in, potentially leading to more pressure on the VM system. Now we usually measure code sizes in megabytes, while memory on servers, desktops, etc are usually measured in gigabytes, so it's highly unlikely that excessive inlining is going to meaningfully contribute to the thrashing on such systems. It could perhaps be a concern on much smaller or embedded systems (although the latter often don't have a MMU at at all).
0 Here unique refers, for example, to the IP of the instruction, not to the actual value of the encoded instruction. You might find inc eax
in multiple places in the binary, but each are unique in this sense since they occur at a different location.
1 There are exceptions, such as some types of trace caches.
2 On x86, the necessary overhead is pretty much just the call
instruction. Depending on the call site, there may also be other overhead, such as shuffling values into the correct registers to adhere to the ABI, and spilling caller-saved registers. More generally, there may be a large cost to a function call simply because the compiler has to reset many of its assumptions across a function call, such as the state of memory.
performance
orx86
(if you're on x86) are good ones for low level questions like this. You can always add the language tags later once the potential for knee-jerk DVs has passed. – Calistacalisthenics