GCC, MSVC, LLVM, and probably other toolchains have support for link-time (whole program) optimization to allow optimization of calls among compilation units.
Is there a reason not to enable this option when compiling production software?
GCC, MSVC, LLVM, and probably other toolchains have support for link-time (whole program) optimization to allow optimization of calls among compilation units.
Is there a reason not to enable this option when compiling production software?
I assume that by "production software" you mean software that you ship to the customers / goes into production. The answers at Why not always use compiler optimization? (kindly pointed out by Mankarse) mostly apply to situations in which you want to debug your code (so the software is still in the development phase -- not in production).
6 years have passed since I wrote this answer, and an update is necessary. Back in 2014, the issues were:
As of 2020, I would try to use LTO by default on any of my projects.
-flto
and that's it. :) –
Tablecloth This recent question raises another possible (but rather specific) case in which LTO may have undesirable effects: if the code in question is instrumented for timing, and separate compilation units have been used to try to preserve the relative ordering of the instrumented and instrumenting statements, then LTO has a good chance of destroying the necessary ordering.
I did say it was specific.
If you have well written code, it should only be advantageous. You may hit a compiler/linker bug, but this goes for all types of optimisation, this is rare.
Biggest downside is it drastically increases link time.
Apart from to this,
Consider a typical example from embedded system,
void function1(void) { /*Do something*/} //located at address 0x1000
void function2(void) { /*Do something*/} //located at address 0x1100
void function3(void) { /*Do something*/} //located at address 0x1200
With predefined addressed functions can be called through relative addresses like below,
(*0x1000)(); //expected to call function2
(*0x1100)(); //expected to call function2
(*0x1200)(); //expected to call function3
LTO can lead to unexpected behavior.
In automotive embedded SW development,Multiple parts of SW are compiled and flashed on to a separate sections. Boot-loader, Application/s, Application-Configurations are independently flash-able units. Boot-loader has special capabilities to update Application and Application-configuration. At every power-on cycle boot-loader ensures the SW application and application-configuration's compatibility and consistence via Hard-coded location for SW-Versions and CRC and many more parameters. Linker-definition files are used to hard-code the variable location and some function location.
Given that the code is implemented correctly, then link time optimization should not have any impact on the functionality. However, there are scenarios where not 100% correct code will typically just work without link time optimization, but with link time optimization the incorrect code will stop working. There are similar situations when switching to higher optimization levels, like, from -O2 to -O3 with gcc.
That is, depending on your specific context (like, age of the code base, size of the code base, depth of tests, are you starting your project or are you close to final release, ...) you would have to judge the risk of such a change.
One scenario where link-time-optimization can lead to unexpected behavior for wrong code is the following:
Imagine you have two source files read.c
and client.c
which you compile into separate object files. In the file read.c
there is a function read
that does nothing else than reading from a specific memory address. The content at this address, however, should be marked as volatile
, but unfortunately that was forgotten. From client.c
the function read
is called several times from the same function. Since read
only performs one single read from the address and there is no optimization beyond the boundaries of the read
function, read
will always when called access the respective memory location. Consequently, every time when read
is called from client.c
, the code in client.c
gets a freshly read value from the address, just as if volatile
had been used.
Now, with link-time-optimization, the tiny function read
from read.c
is likely to be inlined whereever it is called from client.c
. Due to the missing volatile
, the compiler will now realize that the code reads several times from the same address, and may therefore optimize away the memory accesses. Consequently, the code starts to behave differently.
Rather than mandating that all implementations support the semantics necessary to accomplish all tasks, the Standard allows implementations intended to be suitable for various tasks to extend the language by defining semantics in corner cases beyond those mandated by the C Standard, in ways that would be useful for those tasks.
An extremely popular extension of this form is to specify that cross-module function calls will be processed in a fashion consistent with the platform's Application Binary Interface without regard for whether the C Standard would require such treatment.
Thus, if one makes a cross-module call to a function like:
uint32_t read_uint32_bits(void *p)
{
return *(uint32_t*)p;
}
the generated code would read the bit pattern in a 32-bit chunk of storage at address p
, and interpret it as a uint32_t
value using the platform's native 32-bit integer format, without regard for how that chunk of storage came to hold that bit pattern. Likewise, if a compiler were given something like:
uint32_t read_uint32_bits(void *p);
uint32_t f1bits, f2bits;
void test(void)
{
float f;
f = 1.0f;
f1bits = read_uint32_bits(&f);
f = 2.0f;
f2bits = read_uint32_bits(&f);
}
the compiler would reserve storage for f
on the stack, store the bit pattern for 1.0f to that storage, call read_uint32_bits
and store the returned value, store the bit pattern for 2.0f to that storage, call read_uint32_bits
and store that returned value.
The Standard provides no syntax to indicate that the called function might read the storage whose address it receives using type uint32_t
, nor to indicate that the pointer the function was given might have been written using type float
, because implementations intended for low-level programming already extended the language to supported such semantics without using special syntax.
Unfortunately, adding in Link Time Optimization will break any code that relies upon that popular extension. Some people may view such code as broken, but if one recognizes the Spirit of C principle "Don't prevent programmers from doing what needs to be done", the Standard's failure to mandate support for a popular extension cannot be viewed as intending to deprecate its usage if the Standard fails to provide any reasonable alternative.
unsigned long
and passes its address as a void*
to a function in a different compilation unit that casts it to a 64-bit unsigned long long*
and dereferences it, then unless the implementation uses LTO behavior would be defined in terms of the platform ABI without regard for whether the called function accesses storage using the same type as the caller. –
Tanaka would be defined in terms of the platform ABI without regard for whether the called function accesses storage using the same type as the caller.
That's true regardless of LTO. By definition a pointer cast reinterprets the type regardless of its actual data. –
Litman unsigned long long
, and never dereferences any pointers of type unsigned long
, it may refrain from synchronizing the abstract and physical values of objects of type unsigned long
before/after calling the function, thus breaking any code that would rely upon the operations on type unsigned long
being processed according to the platform ABI. –
Tanaka long
and long long
are both stored using the platform's natural 64-bit representation, if a calling function writes storage using a long*
, a called function increments the storage using a long long*
, and the calling function reads it back using a long*
, a compiler that respects platform ABI conventions without regard for whether the C Standard requires it to do so will treat the operations using long long*
as affecting the same storage as those using long*
even though the C Standard would allow the calling code to cache its values elsewhere... –
Tanaka long*
or a character pointer to access the storage. The maintainers of clang and gcc view such caching, which would be allowed by the C Standard but not the ABI, as being one of the purposes of LTO, and thus regard any program which is incompatible with such treatment as "broken". –
Tanaka a compiler ... will treat the operations using long long* as affecting the same storage as those using long*
because they can be (and in your example are) the same pointer, therefore by definition they affect the same storage when one is modified. –
Litman long*
is used to read storage written using a pointer of type long long*
, even on platforms where both types happen to have the same representation, and the effect of writing one type and reading the other would be defined by the platform ABI. When LTO is enabled, compilers like clang and gcc are designed to exploit this permission even in cases where no individual compilation unit ever uses more than one type to access the storage. –
Tanaka LTO could also reveal edge-case bugs in code-signing algorithms. Consider a code-signing algorithm based on certain expectations about the TEXT portion of some object or module. Now LTO optimizes the TEXT portion away, or inlines stuff into it in a way the code-signing algorithm was not designed to handle. Worst case scenario, it only affects one particular distribution pipeline but not another, due to a subtle difference in which encryption algorithm was used on each pipeline. Good luck figuring out why the app won't launch when distributed from pipeline A but not B.
LTO support is buggy and LTO related issues has lowest priority for compiler developers. For example: mingw-w64-x86_64-gcc-10.2.0-5
works fine with lto, mingw-w64-x86_64-gcc-10.2.0-6
segfauls with bogus address. We have just noticed that windows CI stopped working.
Please refer the following issue as an example.
© 2022 - 2024 — McMap. All rights reserved.
-O2
makes a difference of ca. +5 seconds on a 10 minute build here. Enabling LTO makes a difference of ca +3 minutes, and sometimesld
runs out of address space. This is a good reason to always compile with -O2 (so the executables that you debug are binary-identical with the ones you'll ship!) and not to use LTO until it is mature enough (which includes acceptable speed). Your mileage may vary. – Checkerboard-O2
is so cheap (and works fine with the debugger) that you can debug the same, identical binary that you'll ship without even knowing a difference, since you never use anything different. LTO during development is a noticeable extra cost (time = cost). OTOH, a LTOed "release" binary, would be different from the one you've been debugging (possibly exhibiting some UB that you have in your code, or some compiler or linker bug). – Checkerboard