Why compilers doesn't inline everything, analyze it and then generate their own optimized functions?

Asked 13/10, 2023 at 15:29 Answered 22/10, 2023 at 19:48

gcc compiler-construction compiler-optimization inlining native-aot

Problem summary

I'm trying to do an in-memory database with transactions. I'm reaching bottlenecks on compiler level, specifically inlining limit. I don't have much knowledge about compilers and I would like to know why some things are done the way they are.

The goal

The absolute priority of my in-memory database is time performance. It must be super fast, this is the goal. I want to have everything in shared memory. Every database access is direct access to memory. To resolve race-conditions and transactions, spinlocks are implemented on memory level.

Expected result

When I have this pseudo code:

var garage = DB.GetGarage(123);
var car = DB.CreateCar();
car.Color = 2;
garage.ConcurrentBag.Add(car);

For implementation of all these (automatically generated) methods GetGarage, CreateCar, ConcurrentBad.Add I have inlining enabled. Why? I found out that it is faster.

But the reason, why it is faster, is probably not the overhead of calling the function. It seems that when it is inlined, compiler can figure out better machine code than when it is not inlined. In other words, these methods have common things which the compiler probably can simplify, but only if they are inlined, otherwise it cannot simplify them.

Now I'm coming to the merit of the problem. So I make everything inline and I expect the compiler will inline everything.

Actual result

Well, that is not how it works. Compiler may not inline and and lets be specific here:

.NET C# - AggressiveInlining - The attribute might cause implementation limits to be encountered that will result in slower generated code.

C99, C++ - inline - C++ and C99, but not its predecessors K&R C and C89, have support for inline functions, though with different semantics. In both cases, inline does not force inlining; the compiler is free to choose not to inline the function at all, or only in some cases.

What I have tried?

I have tried with C# and reached the limit. After that limit, everything was not inlined. I think something similar would happen with GCC:

max-inline-insns-single: Several parameters control the tree inliner used in gcc. This number sets the maximum number of instructions (counted in GCC's internal representation) in a single function that the tree inliner will consider for inlining. This only affects functions declared inline and methods implemented in a class declaration (C++). The default value is 500.

Atleast with GCC I can choose the limit, with .NET I cannot.

Why compilers doesn't inline everything, analyze it and then generate their own optimized functions?

I would really like to have a feedback on this question please. I know that not everything can be inlined (for example recursive calls). Lets skips cases where this is not possible at all.

I also know that inlining is not gauranteed to have better performance. But I think that all these issues mentioned in the link could be negated by compiler generated functions.

I also know that when inlining, several factors are considered:

We will measure the quality of inlining along three axes: the time spent generating code (aka throughput -- TP) abbreviate as TP), the time spent executing the code (aka code quality -- CQ), and the size of the generated code (CS).

I think that the reason why compilers don't do this could be the time spent generating the code. But what if I don't care... okay I don't want to wait a year, but I can wait a day if I get 20% faster code.

What do you think about it? Is there any compiler for any programming language which can do this (via some flags, or something like that)?

EDIT: According to @RaymondChen (see comments) it is similar to 'inline everything and then have another step to de-inline things':

De-inlining (also known as "common subexpression elimination") is something compilers already do.

But according to my research, CSE doesn't involve generating a new function, but rather use of saved data:

Common-subexpression elimination is a transformation that removes the recomuptations of common subexpressions and replaces them with uses of saved data.

I cannot find anything about compiler generated functions other than some related to C++ class constructors, destructors and operators. So, I'm still looking for an answer and hope somebody can provide some sources.

@RaymondChen also mentions:

finding de-inlining opportunities becomes harder the bigger the code being analyzed. The number of things to check grows (naively) as the fourth power of the code size. The time required for a large program will probably exceed your human lifetime, and the compiler itself will run out of memory long before then.

This could be a good to answer my question, but it is also something I'm struggling to accept. If finding de-inlining opportunities for large program by compiler will take more than human lifetime, how is that possible that I as a human can do it by myself in reasonable time, just by looking at the (high level, not machine) code and refactoring.

I understand that some tasks (pattern recognition, language translation, etc.) are really hard for computers to do. But today, we have neural networks. Would it be possible to use neural network for such a thing as finding de-inlining opportunities?

@PeterCordes mentions:

(in real-world compilers which don't try to re-factor straight-line code back into functions or loops)

I'm again asking why? I'm sure the compiler can figure out better functions than me. Why the compiler just take my functions and at most optimize or inlines them, but never (except from C++ constructor, destructor, etc.) generates a new one?

Hour answered 13/10, 2023 at 15:29 Comment(13)

Suppose a function is used in 1000 places. For example, malloc. And each of those places is in functions that are themselves used in 1000 places. For example, std::vector::vector(). You now need to inline a million copies of the malloc function. Also, your main function would probably now have several megabytes of local variables, since it has to hold all the variables from all the potential code paths due to inlining, even if most of them are never executed. – Trolley 13/10, 2023 at 15:55

Okay, that's a reason why not to inline everything. But what if compiler then reduces the code by generating his own malloc function. Or he can maybe generate few different variants of malloc functions based on the context where they are used. – Hour 13/10, 2023 at 15:57

Sure, but that's still a million customized versions of malloc that are being generated. Programs would explode in size, memory usage would skyrocket, and you lose locality and branch prediction. – Trolley 13/10, 2023 at 16:8

Why million? The compiler should identify the common parts and only create the functions when needed. Maybe I should write it in description, sorry for that. – Hour 13/10, 2023 at 16:10

Oh, I see, so inline everything and then have another step to de-inline things. De-inlining (also known as "common subexpression elimination") is something compilers already do, but finding de-inlining opportunities becomes harder the bigger the code being analyzed. The number of things to check grows (naively) as the fourth power of the code size. The time required for a large program will probably exceed your human lifetime, and the compiler itself will run out of memory long before then. – Trolley 13/10, 2023 at 16:14

Thanks for explanation, didn't know about that. I'm going to read about it. – Hour 13/10, 2023 at 16:16

Just for the record since it didn't get mentioned explicitly, instruction-cache locality is the main performance reason for limits on inlining (in real-world compilers which don't try to re-factor straight-line code back into functions or loops). Modern desktops/servers have lots of RAM, but L1i and L2 caches are still small. But also total size of executables. Having shared libraries share pages across processes is also useful (for system-wide memory pressure), and wouldn't work if even library functions participated in this inline-everything. – Lustig 13/10, 2023 at 16:18

Not an answer to your actual question, but an attempt to address your stated goal ("speed"). If inlining really achieves so much, this is a hint that your implementation is not optimal: e.g. as you said, you might have common code in different funcs. You want long streaks of code with almost no jumps (not even if's). So perhaps replace var garage = etc. etc. with a single function garage = DB.AddNewColouredCarToGarage(123, 2);, and have a couple dozen Add[New]SomethingToSomethingElse() Macros can be used to generate code more efficiently, sort of a manual inlining. – Swash 15/10, 2023 at 23:8

"How is that possible that I as a human can do it by myself in reasonable time, just by looking at the (high level, not machine) code and refactoring." You're saying that if you're given a single million-line function (where everything has been inlined), you can optimally de-inline it in a reasonable amount of time? – Trolley 15/10, 2023 at 23:59

I think even if the functions were inlined (but still high level) I would find common parts. When I get spaghetti code from beginners, I often turn it into a class and method architecture. Whether that's optimal is debatable. Of course, million lines is too much for one person. But if Windows has 50 million lines, why the same team couldn't de-inline 1 million lines? – Hour 16/10, 2023 at 10:35

"he can maybe generate few different variants of malloc functions based on the context where they are used" compilers already do this, cloning functions for instance if they are called with a constant argument. It can also happen that a function is partially inlined, with the rest ending up as a separate new function. Compilers use a lot of heuristics to determine if inlining is likely to enable worthwhile simplifications, and profiling data can also help a lot. – Dietary 20/10, 2023 at 20:20

Implementing what you ask is hard, and compilers already do some of that (I try to end my funtions/switch cases with the same code sequence as others if possible, and it gets "merged"). You can also read this thesis titled "Reducing Code Size with Function Merging" by Rodrigo Caeteano de Oliveira Rocha to get an idea of the state of the art. – Starchy 22/10, 2023 at 20:19

Inlining all functions would typically result in a significant increase of the executable's size. This can harm performance due to the increasing pressure on the CPU and memory systems, such as cache misses and page faults. It increases the compile time, too, as it makes the intermediate representation much larger and slower to process. Furthermore, many optimizations do not scale well with code size, further exacerbating compile times. The benefit from improved optimization opportunities might not always outweigh these costs. – Gillen 22/10, 2023 at 22:38

You suggest that the compiler should inline everything, and then de-inline certain functions (possibly ones it synthesized itself).

But what criteria should it use for de-inlining? It doesn't know what your data look like at runtime, or which paths you want to optimize for.

The compiler can tell when your inlined super function exceeds the instruction cache size, but it doesn't have much information about which code can be extracted again without slowing everything down.

The information needed to do this is:

Knowing which paths need to be fast and which don't

This knowledge can come from your deep understanding of the problem domain, and you can communicate it by manually splitting code into inline and non-inline functions, annotating with [[likely]], [[unlikely]], [[gnu::noinline]], [[gnu::flatten]], [[gnu::always_inline]], [[gnu::hot]], [[gnu::cold]], etc.

Or it can come from running a profiled build with actual data and then using profile-guided optimization.

Just to clarify a technical detail that was picked up in comments:

instructions are data, and all data access latencies are heavily affected by cache behaviour. See this question for lots of detail, but briefly, inlining a function in, say, two places means you have 2 copies of the code.

benefits:
1. both inline copies avoid some function call overhead
2. both inline copies can be more aggressively optimized for their specific calling context
costs:
1. both copies occupy different memory addresses, increasing the opportunities for a cache miss when executing the function
2. equivalently, if the function code is in cache, it must have expelled something else, and that will cause a cache miss elsewhere.

In a non-trivial program, we'd expect at least some functions to be better not inlined. There's no way for the compiler to know for certain which functions though.

Concord answered 21/10, 2023 at 15:7 Comment(5)

Or it can come from running a profiled build with actual data and then using profile-guided optimization. - But note that current compilers don't attempt to factor straight-line code into loops and functions, even with PGO. (Perhaps because it's computationally expensive to look for such things, and/or because most builds don't use PGO, so less effort from compiler devs has gone into things that could only work well with PGO data. I think computational expense is a big part of it, though, because even JITs, which effectively always have profiling data if they want it, don't do it.) – Lustig 21/10, 2023 at 19:28

Thank you for your answer. I understand that it would be helpful to know paths that need to be fast. But is it really necessary? What if every path should be as fast as possible? Can the compiler optimize all paths equally? I mean inline everything, de-inline only when necessary (due to cache limit) and with the least possible impact on all paths equally. – Hour 22/10, 2023 at 17:51

@user1576055: What if every path should be as fast as possible? - then the compiler can't make useful choices for branch layout, like making the fast path a straight line (no taken branches). (That's what [[likely]] vs. [[unlikely]] hints are for, in the absence of PGO; without those compilers do use heuristics to guess because it does matter.) There are also size vs. speed tradeoffs. If the useful (hot) parts of the code are spread out by code that never runs most of the time, I-cache footprint and iTLB footprint will both be worse. – Lustig 22/10, 2023 at 20:29

@user1576055: Many real-world programs have a total code size larger than 32 KiB of machine code when you include libraries, and uop caches are even smaller. And context switches are a thing in real life. The first 32KiB isn't "free" even for small programs. Smaller is still better. – Lustig 22/10, 2023 at 20:30

Ok, now I understand it is important to know hot paths (didn't know about uop cache, interesting). What I still don't understand is that even if the compiler knows hot paths, for example .NET JIT compiler, it relies on user defined functions when the inline limit is reached. As you mentioned compilers don't attempt to factor straight-line code into loops and functions. – Hour 22/10, 2023 at 20:58

I'm going to summarize an answer, although I won't mark it as accepted because I'm still looking for a better one.

Compilers don't inline everything, because of L1/2/3 cache sizes, which are still small these days. My L1 is 960kB (Core i7-12700k), not much for a large in-lined program. Loading code outside of cache slows down the program and negates the benefits of inlining.

@PeterCordes: Just for the record since it didn't get mentioned explicitly, instruction-cache locality is the main performance reason for limits on inlining ... Modern desktops/servers have lots of RAM, but L1i and L2 caches are still small.

Compilers don't inline everything and then analyze it, because analyzing a large in-lined program and looking for common parts is very time consuming. It remains a question whether AI could be used for that.

@RaymondChen: The number of things to check grows (naively) as the fourth power of the code size.

Compilers don't generate their own optimized functions, because they rely on the programmer knowing which paths need to be fast and which don't.

@Useless: This knowledge can come from your deep understanding of the problem domain, and you can communicate it by manually splitting code into inline and non-inline functions

I'll allow myself a little reflection at the end. Every program can be written in an infinite number of ways. If you think about it, a function is just a separation of code. Adding a function instead of inlining doesn't change the logic of the code. You can add (de-inline) as many functions as you like, if they are equivalent to inline code, the logical result doesn't change.

What changes is the performance. Today's compilers rely on programmers to design functions, even though functions are not needed* for application logic**. Functions are needed for performance reasons, and compilers have very limited ability to refactor (due to time) or inline (due to cache size) them. But instead of adopting often poorly designed functions, I think a better solution is needed.

*They are needed to make code readable and maintainable, but this is something we don't care about on machine code level.

**With some exceptions, for example recursion and library invocation.

Hour answered 22/10, 2023 at 19:48 Comment(2)

My L1 is 960kB That's not a meaningful number. Each P-core has its own private 32 KiB of L1i cache, and 48 KiB of L1d cache. Unless each thread runs totally non-overlapping code and touches different data, adding them up across cores makes no sense. It makes sense that marketing does that to get bigger numbers, but it's useless for understanding footprint thresholds, especially for code, but also in most cases for data. A useful way to express it could be L1i = 32K x 12, L1d = 48K x 8 (P-cores) + 64K x 4 (E-cores). en.wikichip.org/wiki/intel/microarchitectures/alder_lake – Lustig 23/10, 2023 at 15:26

(Actually I think that's an error in the Alder Lake wikichip page. en.wikichip.org/wiki/intel/microarchitectures/gracemont says E cores bumped up the I-cache size to 64K, with D-cache still being 32 K in the E-core, same as Tremont. And anandtech.com/show/16881/… agrees with that.) – Lustig 23/10, 2023 at 15:30

-1

While inlining is a powerful tool to optimize programs as it unlocks additional optimizations at the call site it's still a trade-off as it increases program size.

This then increased program size may have a negative impact on performance as it can reduce locality. Locality is also important because computer memory is much quicker if the data is close to each other in memory.

Some compilers even have flags you can set to optimize for code size or for performance.

TLDR: inlining is not a one size fits all solution.

Dunne answered 16/10, 2023 at 3:38 Comment(8)

This question is explicitly asking about factoring the code back into separate asm functions after inlining everything. If the compiler made good choices (which would be very hard / computationally expensive), the OP is expecting it could get similar code size to current builds with real compilers, but perhaps with better choices for where the function boundaries are than the human programmer made. This is possible in theory, but I think the obstacle is that it probably requires AI or something to do in a reasonable amount of compile time. – Lustig 16/10, 2023 at 3:46

Simply inlining everything might increase program size so much it might not even fit on disk. AI might do a better job than current compilers at guessing what to inline. Even then there's still a trade-off so there's no one size fits all. – Dunne 16/10, 2023 at 3:54

Yes, that's why the OP is proposing that a compiler could identify similar chunks of code and turn those into functions, regardless of what was a separate function or not in the source. Doing that until the code footprint is back down to a reasonable size is hypothetically possible. It's not something current compilers do. I'd suggest reading the whole question before answering it. (Also discussion in comments, where the OP responded, if that helps clarify what they were thinking.) – Lustig 16/10, 2023 at 3:57

(This idea of the compiler re-factoring back into functions it invents is only introduced in the second half of the question, but has been in the title the whole time. So you're only answering the first half of the question, which as you say has a pretty obvious answer, that code-size would be a problem, including for the CPU's code cache.) – Lustig 16/10, 2023 at 4:0

As @PeterCordes mentions, I would also like to have an answer for the second part of the question. Maybe compilers could generate these functions with respect to L1/2/3 cache size, so locality would also be optimized. – Hour 16/10, 2023 at 9:14

@user1576055: That would require some profiling data for typical runs, otherwise the compiler wouldn't know which code is actually called during normal executions. (Otherwise it would have to guess, e.g. with heuristics to identify the error-handling side of if() statements. If it can see all the way to an abort() call or something on most of those paths of execution, then maybe that becomes easier, but error-handling that just leads to printing a warning and continuing might be rare, too.) – Lustig 16/10, 2023 at 17:35

@user1576055: I think this might be an answer: the programmer has information about typical use cases / inputs that isn't embodied in the source code. So your idea might only be viable even in theory for profile-guided optimization, separate from the computational complexity that would need to be tamed. – Lustig 16/10, 2023 at 17:35

Well this is possible automatically actually, .NET has a feature called dynamic pgo. Basically first time some code is run an unoptimized but quick to JIT version is used. If the runtime sees this code is used more often it can decide to further optimize the code taking runtime data into account. In .NET 8 this will be on by default. – Dunne 17/10, 2023 at 6:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Problem summary

The goal

Expected result

Actual result

What I have tried?

Why compilers doesn't inline everything, analyze it and then generate their own optimized functions?

Recommended topics

Hot tags