An LTO build of a rather large shared library (many template instantiations) takes rather long (>10min). Now I know a few things about the library, and could specify some kind of "blacklist" in the form of object files that do not need to be analyzed together (because there are no calls among them that should be inlined or so), or I could specify groups of object files that should be analyzed together. Is this possible somehow (without splitting up the lib)?
There is a little-used feature of ld
called -r
/--relocatable
that can be used to combine multiple object files into one, that can later be linked into the final product. If one can get LTO to happen here, but not later, you can have the kind of "partial" LTO you're looking for.
Sadly ld -r
won't work; it just combines all the LTO information to be processed later. But invoking it via the gcc driver (gcc -r
) seems to work:
a.c
int a() { return 42; }
b.c
int a(void); int b() { return a(); }
c.c
int b(void); int c() { return b(); }
d.c
int c(void); int main() { return c(); }
$ gcc -O3 -flto -c [a-d].c
$ gcc -O3 -r -nostdlib a.o b.o -o g1.o
$ gcc -O3 -r -nostdlib c.o d.o -o g2.o
$ gcc -O3 -fno-lto g1.o g2.o
$ objdump -d a.out
...
00000000000004f0 <main>:
4f0: e9 1b 01 00 00 jmpq 610 <b>
...
0000000000000610 <b>:
610: b8 2a 00 00 00 mov $0x2a,%eax
615: c3 retq
...
So main()
got optimized to return b();
, and b()
got optimized to return 42;
, but there were no interprocedural optimizations between the two groups.
Assume that you want to optimize a.c
and b.c
together as one group and c.c
and d.c
as another group. You can use the -combine
GCC switch as follows:
$ gcc -O3 -c -combine a.c b.c -o group1.o
$ gcc -O3 -c -combine c.c d.c -o group2.o
Note that you don't need to use LTO because the -combine
switch combines multiple source code files before optimizing the code.
Edit
-combine
currently is only supported for C code. An alternative way to achieve this would be using the #include
directive as follows:
// file group1.cpp
#include "a.cpp"
#include "b.cpp"
// file group2.cpp
#include "c.cpp"
#include "d.cpp"
Then they can be compiled without using LTO as follows:
g++ -O3 group1.cpp group2.cpp
This effectively emulates grouped or partial LTO.
However, it's not clear whether this technique or the one proposed in another answer is faster to compile. Also the code may not be optimized in the same exact way. So the performance of the resulting code using each technique should be compared. Then the preferred technique can be used.
combine
only works for C code, not for C++. –
Fornax ld -r
works with LTO though. –
Kutchins ld -r
works with LTO but not really the way OP wants. It seems to just concatenate the LTO information, and the final link will do LTO over the whole program anyway. On the other hand, gcc -r -nostdlib
seems to do what OP wants. –
Pointsman gcc -r -nostdlib
do? Does it produce a native object file? Does it work with LTO? –
Kutchins gcc -r
is supposed to be a wrapper for ld -r
I believe, in the same sense that you can do your final link with gcc
instead of ld
. The -nostdlib
is so gcc
doesn't get confused and pass a bunch of -lc -lm -lgcc_s
stuff to ld
. shitwefoundout.com/wiki/Combining_object_files –
Pointsman gcc -r
will LTO-optimize all the input source files and produce an optimized native object file? Because this is exactly what OP is looking for. –
Kutchins gcc -r
would be an ideal solution I think. –
Kutchins gcc -r
not work for some reason. –
Fornax You can exclude object file from link time optimization process completely by just building it without -flto
.
© 2022 - 2024 — McMap. All rights reserved.
LTO
while developing and only turn it on for a release candidate? – Mucro-flto=8
(or whatever number or-flto=jobserver
) to get some parallelism? – Acetyl-flto=40
:) The operation of LTO is described here: gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html There are three phases: LGEN, WPA, LTRANS. WPA partitions the code, and LTRANS then runs in parallel on the partitions. I can see around 15 threads running during the LTRANS phase, but it should be more. I would need to explicitly guide the partitioning of WPA to change that. – Fornax