Is there a reason why not to use link-time optimization (LTO)?

Asked 19/5, 2014 at 11:24 Answered 15/11, 2021 at 5:17

Solved c++c performance compilation compiler-optimization

GCC, MSVC, LLVM, and probably other toolchains have support for link-time (whole program) optimization to allow optimization of calls among compilation units.

Is there a reason not to enable this option when compiling production software?

Footling answered 19/5, 2014 at 11:24 Comment(14)

See Why not always use compiler optimization?. The answers there are equally applicable here. – Gentilis 19/5, 2014 at 11:43

AFAIK, lto for gcc makes your executable bigger and incompatible with ld, ld is able to handle your compiled object because a plugin for ld from the gcc project does indeed exists, but this kind of optimizations are "not standard" according to the linker viewpoint. this general idea about an compiled object that is not packed as the other ones that are "standard" can possibly lead to all kinds of problems . – Nonjoinder 19/5, 2014 at 11:51

@Gentilis He asks "when compiling production software" so most of the answers there doesn't apply. – Tablecloth 19/5, 2014 at 11:52

@user2485710: Do you have documentation for incompatibility with ld? What I read in the current gcc docs (gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) and in a somewhat old wiki (gcc.gnu.org/wiki/LinkTimeOptimization) either says nothing about ld incompatibilities (gcc docs) or explicitly states compatibility (wiki). Judging from the mode of lto operation, namely having additional information in the object files, my guess would be that the object files maintain compatibility. – Xylotomous 19/5, 2014 at 12:5

The same answer as for skipping other optimization efforts -- it's not worth it. – Xylotomous 19/5, 2014 at 12:8

The question surely was not realated to debug builds but to one that should be fast, stable and given to customers. The compile time shouldn't be an issue here. – Footling 19/5, 2014 at 12:24

@PeterSchneider you keep looking in the gcc docs, you should take a look into the gnu binutils documentation ( ld is part of binutils, not gcc ), anyway the existence of a plugin alone for the linker confirms that this is a non-standard approach; I don't know if they plan to integrate this into the main trunk or make both gcc and binutils LTO capable by default, but certainly it's not something that is only about gcc . follow the documentation and the development about ld if you want to know more. – Nonjoinder 19/5, 2014 at 12:25

@user2485710: As you surely know gcc is actually a driver for the different stages needed to produce a desired output (e.g. a preprocessed file, an object file, an executable, a library). The wiki states "Despite the "link time" name, LTO does not need to use any special linker features. The basic mechanism needed is the detection of GIMPLE sections inside object files. This is currently implemented in collect2. Therefore, LTO will work on any linker already supported by GCC." It's just that this page is from 2009 so I'm still interested in specific recent documentation. – Xylotomous 19/5, 2014 at 13:1

Enabling -O2 makes a difference of ca. +5 seconds on a 10 minute build here. Enabling LTO makes a difference of ca +3 minutes, and sometimes ld runs out of address space. This is a good reason to always compile with -O2 (so the executables that you debug are binary-identical with the ones you'll ship!) and not to use LTO until it is mature enough (which includes acceptable speed). Your mileage may vary. – Checkerboard 19/5, 2014 at 13:23

@Damon: If your code is used by a million+ users, your 3 minutes saving do not justify their daily millisecond wait. – Tequila 19/5, 2014 at 14:20

@Checkerboard Thank you for mentioning specific resource requirements of LTO. For bigger projects there surely should be dedicated build station powerful enough to do all work. – Footling 19/5, 2014 at 14:21

@MSalters: Yes certainly, especially if you only do LTO for the release build. But my point is that stuff like -O2 is so cheap (and works fine with the debugger) that you can debug the same, identical binary that you'll ship without even knowing a difference, since you never use anything different. LTO during development is a noticeable extra cost (time = cost). OTOH, a LTOed "release" binary, would be different from the one you've been debugging (possibly exhibiting some UB that you have in your code, or some compiler or linker bug). – Checkerboard 19/5, 2014 at 14:37

@Damon: The release build is not the build I've been debugging, but the build which survived testing. Test gets a separate build anyhow, installed on a clean machine (so I know the install package isn't missing any dependencies). – Tequila 19/5, 2014 at 14:41

I believe enabling link-time optimisation caused this error: https://mcmap.net/q/16703/-loop_apply-o-file-not-recognized-file-format-not-recognized – Missend 7/9, 2017 at 20:49

I assume that by "production software" you mean software that you ship to the customers / goes into production. The answers at Why not always use compiler optimization? (kindly pointed out by Mankarse) mostly apply to situations in which you want to debug your code (so the software is still in the development phase -- not in production).

6 years have passed since I wrote this answer, and an update is necessary. Back in 2014, the issues were:

Link time optimization occasionally introduced subtle bugs, see for example Link-time optimization for the kernel. I assume this is less of an issue as of 2020. Safeguard against these kinds of compiler and linker bugs: Have appropriate tests to check the correctness of your software that you are about to ship.
Increased compile time. There are claims that the situation has significantly improved since 2014, for example thanks to slim objects.
Large memory usage. This post claims that the situation has drastically improved in recent years, thanks to partitioning.

As of 2020, I would try to use LTO by default on any of my projects.

Tablecloth answered 19/5, 2014 at 12:8 Comment(14)

I agree with such answer. I also have no clue why not to use LTO by default. Thanks for confirmation. – Footling 19/5, 2014 at 12:24

@Honza: Probably because it tends to use massive amounts of resources. Try compiling Chromium, Firefox, or LibreOffice with LTO... (FYI: At least one of them is not even compilable on 32-bit machines with GNU ld, even without LTO, simply because the working set does not fit in virtual address space!) – Idiomorphic 19/5, 2014 at 12:47

@Footling I am glad you find my answer useful. I use LTO as the default option in my projects and it gives a significant performance boost to my programs (up to 2.5x). That's a pretty good deal given that I only have to pass -flto and that's it. :) – Tablecloth 19/5, 2014 at 13:13

May introduce? Unless the compiler is broken, it won't. May uncover? Sure. As can any other optimization of broken code. – Gilson 14/10, 2018 at 17:42

Developers are impatient and LTO is slow. That's why – Lawana 24/5, 2020 at 11:21

@Gilson You do realize that the answer was written in 2014, right? At the time, the implementation of LTO was still somewhat buggy; see also the article I linked to. – Tablecloth 24/5, 2020 at 11:37

@Lawana In my experience, developers don't have to wait for the compilation of the release build to finish. Building the release version should be part of the release process or the CI/CD pipeline. Even if LTO is slow, it should not matter to the developers as they are not waiting for it. Long release build times should not block them in their daily work. – Tablecloth 24/5, 2020 at 11:43

@Ali: Both gcc and llvm are prone to make assumptions about program behavior which will often hold true, but have no justification in the Standard. For example, if they can see that if an object is written with some bit pattern, and some other action will cause the storage to be written with that same bit pattern, they will sometimes treat the latter write as setting the Effective Type of the object to that of the first write. LTO expands the number of cases where those compilers may make such unjustifiable "optimizations", and the output of an LTO build shouldn't be trusted without testing. – Tanaka 23/3, 2021 at 19:40

@Tanaka Agreed. However, there is no evidence that the kind of things you describe would happen in any of my projects; LTO does not affect the program behavior (at least not in the tested cases). Thanks for the info, I find it interesting! – Tablecloth 23/3, 2021 at 20:37

@Ali: Such things are unlikely to adversely affect program behavior, but I'd rather have a compiler which is designed to avoid optimizations that might adversely avoid program behavior, than one which attempts to perform unsound optimizations which won't "usually" affect program behavior, but may occasionally do so in ways that cannot be predicted nor reasoned about. – Tanaka 23/3, 2021 at 21:5

@Tanaka I have run into situations in which compiler optimizations caused nasty bugs, but those had nothing to do with LTO. For example, the compiler optimizations messed up the floating point arithmetic, which in turn caused an infinite loop. The code was correct but the compiler was trying to be overly clever. Based on your comment, LTO is probably not for you. – Tablecloth 23/3, 2021 at 21:40

@Ali: LTO would be fine if performed by a compiler whose optimization philosophy seeks to maximize the range of tasks that can be accomplished with reasonable efficiently and 100% reliability, rather than seeking to maximize the efficiency with which a vaguely-designed subset of tasks may be performed. – Tanaka 23/3, 2021 at 21:44

@Tanaka Agreed. However, I am afraid LTO is not the only optimization like that, sadly. The one that I mentioned concerns floating-point computations. What you write is true for those too. :( – Tablecloth 23/3, 2021 at 21:52

@Ali: Floating-point is IMHO both over-specified and underspecified. The Standard shouldn't impose judgments about what floating-point semantics implementations must provide, but should provide more means by which implementations can indicate what semantic guarantees they do provide, allowing programs to refuse to run on implementations whose guarantees are insufficient to meet requirements, but then assume that implementations will uphold all the tested guarantees. – Tanaka 23/3, 2021 at 21:58

This recent question raises another possible (but rather specific) case in which LTO may have undesirable effects: if the code in question is instrumented for timing, and separate compilation units have been used to try to preserve the relative ordering of the instrumented and instrumenting statements, then LTO has a good chance of destroying the necessary ordering.

I did say it was specific.

Debbi answered 14/6, 2016 at 9:53 Comment(0)

If you have well written code, it should only be advantageous. You may hit a compiler/linker bug, but this goes for all types of optimisation, this is rare.

Biggest downside is it drastically increases link time.

Cellule answered 8/11, 2018 at 13:37 Comment(10)

Why does it increase compile time? Isn't it the case that the compiler stops compilation at a certain point (it generates some internal representation of the code, and puts this into the object file instead of the fully compiled code), so it should be faster instead? – Fruitage 8/11, 2018 at 13:46

Because the compiler must now create the GIMPLE bytecode as well as the object file so the linker has enough information to optimise. Creating this GIMPLE bytecode has overhead. – Cellule 8/11, 2018 at 15:16

As far as I know, when using LTO, the compiler generates only the bytecode, i.e., no processor specific assembly is emitted. So it should be faster. – Fruitage 8/11, 2018 at 17:12

The GIMPLE is part of the object file alright gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html – Cellule 8/11, 2018 at 17:50

It has additional compile time overhead on any codebase if you time it – Cellule 8/11, 2018 at 17:50

The assembly is also emitted – Cellule 8/11, 2018 at 17:51

Wait your right just GIMPLE, regardless there is significant overhead if you time just the compile time alone, without the linking – Cellule 8/11, 2018 at 17:52

Here are the compilation times for a project: with LTO: 88 sec. without LTO: 230 sec. A 2.6x difference. If I turn off optimization, the difference is much less, but still there, 1.3x difference. This result is not surprising, as I've said, with LTO, the compiler has (much) less work to do. – Fruitage 8/11, 2018 at 21:32

I'll change the answer to just link time, maybe just my codebase – Cellule 9/11, 2018 at 9:0

Link times for LTO are MUCH slower. – Entomology 22/5, 2019 at 18:5

Apart from to this,

Consider a typical example from embedded system,

void function1(void) { /*Do something*/} //located at address 0x1000 
void function2(void) { /*Do something*/} //located at address 0x1100
void function3(void) { /*Do something*/} //located at address 0x1200

With predefined addressed functions can be called through relative addresses like below,

 (*0x1000)(); //expected to call function2
 (*0x1100)(); //expected to call function2
 (*0x1200)(); //expected to call function3

LTO can lead to unexpected behavior.

updated:

In automotive embedded SW development,Multiple parts of SW are compiled and flashed on to a separate sections. Boot-loader, Application/s, Application-Configurations are independently flash-able units. Boot-loader has special capabilities to update Application and Application-configuration. At every power-on cycle boot-loader ensures the SW application and application-configuration's compatibility and consistence via Hard-coded location for SW-Versions and CRC and many more parameters. Linker-definition files are used to hard-code the variable location and some function location.

Shcherbakov answered 30/5, 2019 at 8:49 Comment(3)

This is an interesting comment because LTO could potentially cause the linker to inline small and rarely used functions. I tested a slightly different example with GCC 9.2.1 and Clang 8.0.0 on Fedora and it worked fine. The only difference was that I used an array of function pointers: ``` typedef int FUNC(); FUNC *ptr[3] = {func1, func2, func3}; return (*ptr)() + (*(ptr+1))() + (*(ptr+2))(); ``` – Barbabra 29/10, 2019 at 9:26

You used a linker script to make sure those symbols end up at those addresses? And you call from elsewhere using hard-coded absolute addresses, instead of symbols? That seems inefficient for most ISAs. Or are you talking about function pointers? – Delegate 15/5, 2022 at 13:2

@PeterCordes: Yes, functions and certain constant variables such as sw versions, CRC are placed at a fixed location. Bootloader will ensure the compatibility of the sw before jumping to appropriate section. Linker definition files are used to place the symbols at fixed location. – Shcherbakov 16/5, 2022 at 6:17

Given that the code is implemented correctly, then link time optimization should not have any impact on the functionality. However, there are scenarios where not 100% correct code will typically just work without link time optimization, but with link time optimization the incorrect code will stop working. There are similar situations when switching to higher optimization levels, like, from -O2 to -O3 with gcc.

That is, depending on your specific context (like, age of the code base, size of the code base, depth of tests, are you starting your project or are you close to final release, ...) you would have to judge the risk of such a change.

One scenario where link-time-optimization can lead to unexpected behavior for wrong code is the following:

Imagine you have two source files read.c and client.c which you compile into separate object files. In the file read.c there is a function read that does nothing else than reading from a specific memory address. The content at this address, however, should be marked as volatile, but unfortunately that was forgotten. From client.c the function read is called several times from the same function. Since read only performs one single read from the address and there is no optimization beyond the boundaries of the read function, read will always when called access the respective memory location. Consequently, every time when read is called from client.c, the code in client.c gets a freshly read value from the address, just as if volatile had been used.

Now, with link-time-optimization, the tiny function read from read.c is likely to be inlined whereever it is called from client.c. Due to the missing volatile, the compiler will now realize that the code reads several times from the same address, and may therefore optimize away the memory accesses. Consequently, the code starts to behave differently.

Rumpf answered 27/8, 2019 at 17:7 Comment(1)

Another more relevant issue is code which is non-portable but correct when processed by implementations that, as a form of "conforming language extension", specify their behavior in more situations than mandated by the Standard. – Tanaka 23/3, 2021 at 22:1

Rather than mandating that all implementations support the semantics necessary to accomplish all tasks, the Standard allows implementations intended to be suitable for various tasks to extend the language by defining semantics in corner cases beyond those mandated by the C Standard, in ways that would be useful for those tasks.

An extremely popular extension of this form is to specify that cross-module function calls will be processed in a fashion consistent with the platform's Application Binary Interface without regard for whether the C Standard would require such treatment.

Thus, if one makes a cross-module call to a function like:

uint32_t read_uint32_bits(void *p)
{
  return *(uint32_t*)p;
}

the generated code would read the bit pattern in a 32-bit chunk of storage at address p, and interpret it as a uint32_t value using the platform's native 32-bit integer format, without regard for how that chunk of storage came to hold that bit pattern. Likewise, if a compiler were given something like:

uint32_t read_uint32_bits(void *p);
uint32_t f1bits, f2bits;
void test(void)
{
  float f;
  f = 1.0f;
  f1bits = read_uint32_bits(&f);
  f = 2.0f;
  f2bits = read_uint32_bits(&f);
}

the compiler would reserve storage for f on the stack, store the bit pattern for 1.0f to that storage, call read_uint32_bits and store the returned value, store the bit pattern for 2.0f to that storage, call read_uint32_bits and store that returned value.

The Standard provides no syntax to indicate that the called function might read the storage whose address it receives using type uint32_t, nor to indicate that the pointer the function was given might have been written using type float, because implementations intended for low-level programming already extended the language to supported such semantics without using special syntax.

Unfortunately, adding in Link Time Optimization will break any code that relies upon that popular extension. Some people may view such code as broken, but if one recognizes the Spirit of C principle "Don't prevent programmers from doing what needs to be done", the Standard's failure to mandate support for a popular extension cannot be viewed as intending to deprecate its usage if the Standard fails to provide any reasonable alternative.

Tanaka answered 22/3, 2021 at 22:0 Comment(14)

How is this relevant? Type punning is a C language feature completely unrelated to LTO. – Litman 4/8, 2021 at 19:9

@MattF.: In the absence of LTO, abstract and physical machine states will be synchronized whenever execution crosses compilation-unit boundaries. If code stores a value to a 64-bit unsigned long and passes its address as a void* to a function in a different compilation unit that casts it to a 64-bit unsigned long long* and dereferences it, then unless the implementation uses LTO behavior would be defined in terms of the platform ABI without regard for whether the called function accesses storage using the same type as the caller. – Tanaka 4/8, 2021 at 19:17

@MattF.: Basically, my point is that the Committees saw no need for the Standard to let programmers demand that compilers do things that programmers might need them to do, but which they'd have no way of avoiding doing, but then compilers were changed so that compilers could avoid such things without regard for whether programmers might need them. – Tanaka 4/8, 2021 at 19:20

would be defined in terms of the platform ABI without regard for whether the called function accesses storage using the same type as the caller.

That's true regardless of LTO. By definition a pointer cast reinterprets the type regardless of its actual data. – Litman 4/8, 2021 at 21:3

@MattF.: If a compiler can see that a function only writes to pointers of type unsigned long long, and never dereferences any pointers of type unsigned long, it may refrain from synchronizing the abstract and physical values of objects of type unsigned long before/after calling the function, thus breaking any code that would rely upon the operations on type unsigned long being processed according to the platform ABI. – Tanaka 4/8, 2021 at 21:14

Are you referring to optimizations of the value that the pointer points to? Or of optimizations relating to whether or not the value could have changed? I'm pretty sure the compiler will be conservative (and actually check the value after calling) if you pass pointers to functions that are not part of the translation unit. – Litman 4/8, 2021 at 21:19

@MattF.: Compilers will be conservative when code calls functions that are not part of the translation unit, when they're not using LTO. The problem is that LTO effectively causes functions outside the translation unit to be treated as though they were inside it. – Tanaka 4/8, 2021 at 21:25

...in which case the linker will know that it's not a problem to make that optimization. I still don't see your point. – Litman 4/8, 2021 at 21:25

@MattF.: The optimization will yield behavior which is contrary to the platform ABI. The fact that the linker views that as "not a problem" is the problem. – Tanaka 4/8, 2021 at 21:36

A pointer is a pointer. How the pointer is interpreted by the software is not part of the ABI. The most information an ABI could include about pointers would be their size when pushed to the stack (4 or 8 bytes depending on 32- or 64-bit). The type of a pointer is certainly part of the API, but the ABI couldn't care less. – Litman 5/8, 2021 at 3:24

@MattF.: On a platform where long and long long are both stored using the platform's natural 64-bit representation, if a calling function writes storage using a long*, a called function increments the storage using a long long*, and the calling function reads it back using a long*, a compiler that respects platform ABI conventions without regard for whether the C Standard requires it to do so will treat the operations using long long* as affecting the same storage as those using long* even though the C Standard would allow the calling code to cache its values elsewhere... – Tanaka 5/8, 2021 at 14:38

...such as in registers, if the called function doesn't use long* or a character pointer to access the storage. The maintainers of clang and gcc view such caching, which would be allowed by the C Standard but not the ABI, as being one of the purposes of LTO, and thus regard any program which is incompatible with such treatment as "broken". – Tanaka 5/8, 2021 at 14:41

a compiler ... will treat the operations using long long* as affecting the same storage as those using long* because they can be (and in your example are) the same pointer, therefore by definition they affect the same storage when one is modified. – Litman 5/8, 2021 at 18:27

@MattF.: The Standard grants compilers permission to behave in arbitrary fashion if a pointer of type long* is used to read storage written using a pointer of type long long*, even on platforms where both types happen to have the same representation, and the effect of writing one type and reading the other would be defined by the platform ABI. When LTO is enabled, compilers like clang and gcc are designed to exploit this permission even in cases where no individual compilation unit ever uses more than one type to access the storage. – Tanaka 5/8, 2021 at 19:6

LTO could also reveal edge-case bugs in code-signing algorithms. Consider a code-signing algorithm based on certain expectations about the TEXT portion of some object or module. Now LTO optimizes the TEXT portion away, or inlines stuff into it in a way the code-signing algorithm was not designed to handle. Worst case scenario, it only affects one particular distribution pipeline but not another, due to a subtle difference in which encryption algorithm was used on each pipeline. Good luck figuring out why the app won't launch when distributed from pipeline A but not B.

Lightfoot answered 15/11, 2021 at 5:17 Comment(0)

-1

LTO support is buggy and LTO related issues has lowest priority for compiler developers. For example: mingw-w64-x86_64-gcc-10.2.0-5 works fine with lto, mingw-w64-x86_64-gcc-10.2.0-6 segfauls with bogus address. We have just noticed that windows CI stopped working.

Please refer the following issue as an example.

Rebut answered 6/3, 2021 at 10:21 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

updated:

Recommended topics

Hot tags