Inlining and Instruction Cache Hit Rates and Thrashing

Asked 17/3, 2018 at 9:30 Answered 18/3, 2018 at 17:52

Solved caching memory inline executable cpu-architecture

In this article, https://www.geeksforgeeks.org/inline-functions-cpp/, it states that the disadvantages of inlining are:

3) Too much inlining can also reduce your instruction cache hit rate, thus reducing the speed of instruction fetch from that of cache memory to that of primary memory.

How does inlining affect the instruction cache hit rate?

6) Inline functions might cause thrashing because inlining might increase size of the binary executable file. Thrashing in memory causes performance of computer to degrade.

How does inlining increase the size of the binary executable file? Is it just that it increases the code base length? Moreover, it is not clear to me why having a larger binary executable file would cause thrashing as the two dont seem linked.

Hokkaido answered 17/3, 2018 at 9:30 Comment(9)

Why the downvote?????? its a perfectly reasonable question – Hokkaido 17/3, 2018 at 9:43

Not my downvote. But SO caters to rather specific problems with solutions and in general does not provide "I do not understand this, explain it to me" kind of answers. You got some from ppl who are kind and got a downvote from someone who probably thinks you could have put some more effort in reasearching this topic yourself before posting a question. – Stephniestepladder 17/3, 2018 at 10:5

Thrashing of swap space / virtual memory due to massive code-size is not plausible on modern desktop/laptop computers except with an extremely unreasonable amount of inlining, producing an executable that was hundreds or thousands of megabytes, or if the whole system was compiled with a crazy compiler so all programs were huge, and you had several running at once, and you "only" had a few GiB of physical RAM. – Counterchange 17/3, 2018 at 10:53

@Jurassic - The downvote might be because geeksforgeeks is known to contain a lot of crap, and is not a good place to learn from. In this case some of the info looks like quotes from the docs for Turbo-C++, 25 years ago. Proper use of inlining makes the code smaller and faster. – Historicism 17/3, 2018 at 11:24

@BoPersson That might be a fair comment, but geeksforgeeks is an easy resource to find for the unknowing – Hokkaido 17/3, 2018 at 20:48

@PatrickArtner I have edited the question to make it better. However, I think that from some of the discussion outlined on this question page this is clealy a non trivial topic to consider/find problematic – Hokkaido 17/3, 2018 at 20:49

@Jurassic - I am not the DV (I UVed) but in general you have a crowd of people who hate implementation or performance related questions based on "the standard doesn't say" or "you don't need to know that" or "just write simple code and the compiler will sort it out" or "it's implementation specific" or "the compiler is smarter than you" or "does it need to be fast" or whatever. A large number of these folk hang out on the C/C++ tags and you get instant DVs if you ask that type of quesiton there. One tip is just to leave those tags off, and instead include other tags. – Calistacalisthenics 17/3, 2018 at 23:22

performance or x86 (if you're on x86) are good ones for low level questions like this. You can always add the language tags later once the potential for knee-jerk DVs has passed. – Calistacalisthenics 17/3, 2018 at 23:23

@Calistacalisthenics changed – Hokkaido 18/3, 2018 at 9:52

It is possible that the confusion about why inlining can hurt i-cache hit rate or cause thrashing lies in the difference between static instruction count and dynamic instruction count. Inlining (almost always) reduces the latter but often increases the former.

Let us briefly examine those concepts.

Static Instruction Count

Static instruction count for some execution trace is the number of unique instructions⁰ that appear in the binary image. Basically, you just count the instruction lines in an assembly dump. The following snippet of x86 code has a static instruction count of 5 (the .top: line is a label which doesn't translate to anything in the binary):

  mov eci, 10
  mov eax, 0
.top:
  add eax, eci
  dec eci
  jnz .top

The static instruction count is mostly important for binary size, and caching considerations.

Static instruction count may also be referred to simply as "code size" and I'll sometimes use that term below.

Dynamic Instruction Count

The dynamic instruction count, on the other hand, depends on the actual runtime behavior and is the number of instructions executed. The same static instruction can be counted multiple times due to loops and other branches, and some instructions included in the static count may never execute at all and so no count in the dynamic case. The snippet as above, has a dynamic instruction count of 2 + 30 = 32: the first two instructions are executed once, and then the loop executes 10 times with 3 instructions each iteration.

As a very rough approximation, dynamic instruction count is primarily important for runtime performance.

The Tradeoff

Many optimizations such as loop unrolling, function cloning, vectorization and so on increase code size (static instruction count) in order to improve runtime performance (often strongly correlated with dynamic instruction count).

Inlining is also such an optimization, although with the twist that for some call sites inlining reduces both dynamic and static instruction count.

How does inlining affect the instruction cache hit rate?

The article mentioned too much inlining, and the basic idea here is that a lot of inlining increases the code footprint by increasing the working set's static instruction count while usually reducing its dynamic instruction count. Since a typical instruction cache¹ caches static instructions, a larger static footprint means increased cache pressure and often results in a worse cache hit rate.

The increased static instruction count occurs because inlining essentially duplicates the function body at each the call site. So rather than one copy of the function body an a few instructions to call the function N times, you end up with N copies of the function body.

Now this is a rather naive model of how inlining works since after inlining, it may be the case that further optimizations can be done in the context of a particular call-site, which may dramatically reduce the size of the inlined code. In the case of very small inlined functions or a large amount of subsequent optimization, the resulting code may be even smaller after inlining, since the remaining code (if any) may be smaller than the overhead involved in the calling the function².

Still, the basic idea remains: too much inlining can bloat the code in the binary image.

The way the i-cache works depends on the static instruction count for some execution, or more specifically the number of instruction cache lines touched in the binary image, which is largely a fairly direct function of the static instruction count. That is, the i-cache caches regions of the binary image so the more regions and the larger they are, the larger the cache footprint, even if the dynamic instruction count happens to be lower.

How does inlining increase the size of the binary executable file?

It's exactly the same principle as the i-cache case above: larger static footprint means that more distinct pages need to paged in, potentially leading to more pressure on the VM system. Now we usually measure code sizes in megabytes, while memory on servers, desktops, etc are usually measured in gigabytes, so it's highly unlikely that excessive inlining is going to meaningfully contribute to the thrashing on such systems. It could perhaps be a concern on much smaller or embedded systems (although the latter often don't have a MMU at at all).

⁰ Here unique refers, for example, to the IP of the instruction, not to the actual value of the encoded instruction. You might find inc eax in multiple places in the binary, but each are unique in this sense since they occur at a different location.

¹ There are exceptions, such as some types of trace caches.

² On x86, the necessary overhead is pretty much just the call instruction. Depending on the call site, there may also be other overhead, such as shuffling values into the correct registers to adhere to the ABI, and spilling caller-saved registers. More generally, there may be a large cost to a function call simply because the compiler has to reset many of its assumptions across a function call, such as the state of memory.

Calistacalisthenics answered 18/3, 2018 at 17:52 Comment(0)

Lets say you have a function thats 100 instructions long and it takes 10 instructions to call it whenever its called.

That means for 10 calls it uses up 100 + 10 * 10 = 200 instructions in the binary.

Now lets say its inlined everywhere its used. That uses up 100*10 = 1000 instructions in your binary.

So for point 3 this means that it will take significantly more space in the instructions cache (different invokations of an inline function are not 'shared' in the i-cache)

And for point 6 your total binary size is now bigger, and a bigger binary size can lead to thrashing

Felicidadfelicie answered 17/3, 2018 at 9:36 Comment(3)

These calculations are not correct. When a function gets inlined, its code gets optimized together the code of the function in which it's inlined. So that resulting total number of instructions can be very different from the sum of the number of instructions of individual functions. – Signatory 17/3, 2018 at 18:12

I dont see why bigger binary size would lead to thrashing. Doesnt even seem related as I dont see how binary size and memory management are linked – Hokkaido 17/3, 2018 at 20:38

The binary size here is related also to i-cache working set, or static instruction footprint. Basically your i-cache has a limited size so executing more total code (as measured in cache lines used) means that you'll have more pressure on your various cache levels. This occurs because inlining is fundamentally about duplicating code. Thrashing is the same thing except "i-cache" is replaced by "page cache" or "virtual memory/swap" or something like that. I don't find thrashing to disk very likely based on the relatively small sizes we are talking about though. – Calistacalisthenics 17/3, 2018 at 20:56

If compilers inlined everything they could, most functions would be gigantic. (Although you might just have one gigantic main function that calls library functions, but in the most extreme case all functions within your program would be inlined into main).

Imagine if everything was a macro instead of a function, so it fully expanded everywhere you used it. This is the source-level version of inlining.

Most functions have multiple call-sites. The code-size to call a function scales a bit with the number of args, but is generally pretty small compared to a medium to large function. So inlining a large function at all of its call sites will increase total code size, reducing I-cache hit rates.

But these days its common practice to write lots of small wrapper / helper functions, especially in C++. The code for a stand-alone version of a small function is often not much bigger than code necessary to call it, especially when you include the side-effects of a function call (like clobbering registers). Inlining small functions can often save code size, especially when further optimizations become possible after inlining. (e.g. the function computes some of the same stuff that code outside the function also computes, so CSE is possible).

So for a compiler, the decision of whether to inline into any specific call site or not should be based on the size of called function, and maybe whether its called inside a loop. (Optimizing away the call/ret overhead is more valuable if the call site runs more often.) Profile-guided optimization can help a compiler make better decisions, by "spending" more code-size on hot functions, and saving code-size in cold functions (e.g. many functions only run once over the lifetime of the program, while a few hot ones take most of the time).

If compilers didn't have good heuristics for when to inline, or you override them to be way too aggressive, then yes, I-cache misses would be the result.

But modern compilers do have good inlining heuristics, and usually this makes programs significantly faster but only a bit larger. The article you read is talking about why there need to be limits.

The above code-size reasoning should make it obvious that executable size increases, because it doesn't shrink the data any. Many functions will still have a stand-alone copy in the executable as well as inlined (and optimized) copies at various call sites.

There are a few factors that mitigate the I-cache hit rate problem. Better locality (from not jumping around as much) let code prefetch do a better job. Many programs spend most of their time in a small part of their total code, which usually still fits in I-cache after a bit of inlining.

But larger programs (like Firefox or GCC) have lots of code, and call the same functions from many call sites in large "hot" loops. Too much inlining bloating the total code size of each hot loop would hurt I-cache hit rates for them.

Thrashing in memory causes performance of computer to degrade.

https://en.wikipedia.org/wiki/Thrashing_(computer_science)

On modern computers with multiple GiB of RAM, thrashing of virtual memory (paging) is not plausible unless every program on the system was compiled with extremely aggressive inlining. These days most memory is taken up by data, not code (especially pixmaps in a computer running a GUI), so code would have to explode by a few orders of magnitude to start to make a real difference in overall memory pressure.

Thrashing the I-cache is pretty much the same thing as having lots of I-cache misses. But it would be possible to go beyond that into thrashing the larger unified caches (L2 and L3) that cache code + data.

Counterchange answered 18/3, 2018 at 13:57 Comment(1)

Inlining can also increase reuse distance and reduce use frequency, reducing the likelihood code will be in shared levels of cache. This is somewhat independent of capacity effects; as noted data use can dominate capacity. (Sharing across multiple threads or even processes with shared libraries would tend to increase this effect.) – Telamon 19/3, 2018 at 0:6

Generally speaking, inlining tends to increase the emitted code size due to call sites being replaced with larger pieces of code. Consequently, more memory space may be required to hold the code, which may cause thrashing. I'll discuss this in a little more detail.

How does inlining affect the instruction cache hit rate?

The impact that inlining can have on performance is very difficult to statically characterize in general without actually running the code and measuring its performance.

Yes, inlining may impact the code size and typically makes the emitted native code larger. Let's consider the following cases:

The code executed within a particular period of time fits within a particular level of the memory hierarchy (say L1I) in both cases (with or without inlining). So performance with respect to that particular level will not change.
The code executed within a particular period of time fits within a particular level of the memory hierarchy in the case of no inlining, but doesn't fit with inlining. The impact this can have on performance depends on the locality of the executed. Essentially, if the hottest pieces of code first within that level of memory, then the miss ratio at the level might increase slightly. Features of modern processors such as speculative execution, out-of-order execution, prefetching can hide or reduce the penalty of the additional misses. It's important to note that inlining does improve the locality of code, which can result in a net positive imapct on performance despite of the increased code size. This is particularly true when the code inlined at a call site is frequently executed. Partial inlining techniques have been developed to inline only the parts of the function that are deemed hot.
The code executed within a particular period of time doesn't fit within a particular level of the memory hierarchy in both cases. So performance with respect to that particular level will not change.

Moreover, it is not clear to me why having a larger binary executable file would cause thrashing as the two dont seem linked.

Consider the main memory level on a resource-constrained system. Even a mere 5% increase in code size can cause thrashing at main memory, which can result in significant performance degradations. On other resource-rich systems (desktops, workstations, servers), thrashing usually only occurs at caches when the total size of the hot instructions is too large to fit within one or more of the caches.

Signatory answered 17/3, 2018 at 17:31 Comment(2)

Yeah downvoting should be accompanied with the reason – Hokkaido 17/3, 2018 at 19:31

@Calistacalisthenics I did want to know why increased code size generally reduces i-cache hit rate exactly – Hokkaido 17/3, 2018 at 20:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Static Instruction Count

Dynamic Instruction Count

The Tradeoff

Recommended topics

Hot tags