gcc LTO: Limit scope of optimization
Asked Answered
F

3

6

An LTO build of a rather large shared library (many template instantiations) takes rather long (>10min). Now I know a few things about the library, and could specify some kind of "blacklist" in the form of object files that do not need to be analyzed together (because there are no calls among them that should be inlined or so), or I could specify groups of object files that should be analyzed together. Is this possible somehow (without splitting up the lib)?

Fornax answered 27/2, 2018 at 22:5 Comment(4)
You could just not build with LTO while developing and only turn it on for a release candidate?Mucro
Repeated local builds are also necessary when analyzing and fixing performance problems.Fornax
I am not sure you would win much. Did you try -flto=8 (or whatever number or -flto=jobserver) to get some parallelism?Acetyl
I'm already using -flto=40 :) The operation of LTO is described here: gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html There are three phases: LGEN, WPA, LTRANS. WPA partitions the code, and LTRANS then runs in parallel on the partitions. I can see around 15 threads running during the LTRANS phase, but it should be more. I would need to explicitly guide the partitioning of WPA to change that.Fornax
P
5

There is a little-used feature of ld called -r/--relocatable that can be used to combine multiple object files into one, that can later be linked into the final product. If one can get LTO to happen here, but not later, you can have the kind of "partial" LTO you're looking for.

Sadly ld -r won't work; it just combines all the LTO information to be processed later. But invoking it via the gcc driver (gcc -r) seems to work:

a.c

int a() {
    return 42;
}

b.c

int a(void);

int b() {
    return a();
}

c.c

int b(void);

int c() {
    return b();
}

d.c

int c(void);

int main() {
    return c();
}
$ gcc -O3 -flto -c [a-d].c
$ gcc -O3 -r -nostdlib a.o b.o -o g1.o
$ gcc -O3 -r -nostdlib c.o d.o -o g2.o
$ gcc -O3 -fno-lto g1.o g2.o
$ objdump -d a.out
...
00000000000004f0 <main>:
 4f0:   e9 1b 01 00 00          jmpq   610 <b>
...
0000000000000610 <b>:
 610:   b8 2a 00 00 00          mov    $0x2a,%eax
 615:   c3                      retq   
...

So main() got optimized to return b();, and b() got optimized to return 42;, but there were no interprocedural optimizations between the two groups.

Pointsman answered 7/3, 2018 at 21:18 Comment(1)
Thanks for posting this. I've updated my answer as well.Kutchins
K
3

Assume that you want to optimize a.c and b.c together as one group and c.c and d.c as another group. You can use the -combine GCC switch as follows:

$ gcc -O3 -c -combine a.c b.c -o group1.o
$ gcc -O3 -c -combine c.c d.c -o group2.o

Note that you don't need to use LTO because the -combine switch combines multiple source code files before optimizing the code.

Edit

-combine currently is only supported for C code. An alternative way to achieve this would be using the #include directive as follows:

// file group1.cpp
#include "a.cpp"
#include "b.cpp"

// file group2.cpp
#include "c.cpp"
#include "d.cpp"

Then they can be compiled without using LTO as follows:

g++ -O3 group1.cpp group2.cpp

This effectively emulates grouped or partial LTO.

However, it's not clear whether this technique or the one proposed in another answer is faster to compile. Also the code may not be optimized in the same exact way. So the performance of the resulting code using each technique should be compared. Then the preferred technique can be used.

Kutchins answered 6/3, 2018 at 17:46 Comment(9)
Thanks for the suggestion. As far as I understand, combineonly works for C code, not for C++.Fornax
@MartinRichtarsky I just checked the manual, you're right. I missed that bit, sorry. Check this alternative technique. Otherwise, I'm not sure whether ld -r works with LTO though.Kutchins
ld -r works with LTO but not really the way OP wants. It seems to just concatenate the LTO information, and the final link will do LTO over the whole program anyway. On the other hand, gcc -r -nostdlib seems to do what OP wants.Pointsman
@TavianBarnes What does gcc -r -nostdlib do? Does it produce a native object file? Does it work with LTO?Kutchins
gcc -r is supposed to be a wrapper for ld -r I believe, in the same sense that you can do your final link with gcc instead of ld. The -nostdlib is so gcc doesn't get confused and pass a bunch of -lc -lm -lgcc_s stuff to ld. shitwefoundout.com/wiki/Combining_object_filesPointsman
@TavianBarnes So you're saying that gcc -r will LTO-optimize all the input source files and produce an optimized native object file? Because this is exactly what OP is looking for.Kutchins
It seems like it yes. I tested with four files in two groups like you, and checked the disassembly to see that inlining had occurred between {a,b}.o, and between {c,d}.o, but not across the groups.Pointsman
@TavianBarnes Well, if that worked, then you should post an answer and get the bounty. gcc -r would be an ideal solution I think.Kutchins
@HadiBrais Thanks for the answer. I think this will work in principle, but I fear it will cause problems when compiling some of our source files together. But it will be an alternative should gcc -r not work for some reason.Fornax
T
0

You can exclude object file from link time optimization process completely by just building it without -flto.

Towandatoward answered 5/3, 2018 at 16:40 Comment(2)
I do not want to fully exclude files, I just want to guide the optimizer with annotations like "optimize these object files together".Fornax
AFAIK, you can't do that. You either build object file with structures for LTO (i.e. some gcc bytecode) alongside with regular machine code or without them.Towandatoward

© 2022 - 2024 — McMap. All rights reserved.