Is optimisation level -O3 dangerous in g++?
Asked Answered
E

5

298

I have heard from various sources (though mostly from a colleague of mine), that compiling with an optimisation level of -O3 in g++ is somehow 'dangerous', and should be avoided in general unless proven to be necessary.

Is this true, and if so, why? Should I just be sticking to -O2?

Equilateral answered 18/7, 2012 at 16:29 Comment(8)
It's only dangerous if you're relying on undefined behaviour. And even then I'd be surprised if it was the optimisation level that messed something up.Drusie
It adds "-finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize, -ftree-partial-pre and -fipa-cp-clone options" gcc.gnu.org/onlinedocs/gcc/…Betweenwhiles
The compiler is still constrained to produce a program that behaves "as if" it compiled your code exactly. I don't know that -O3 is considered particularly buggy? I think perhaps it can make undefined behavior "worse" as it may do weird and wonderful things based on certain assumptions, but that would be your own fault. So generally, I'd say it's fine.Inion
It's true that higher optimizations levels are more prone to compiler bugs. I've hit a few cases myself, but in general they are still pretty rare.Jefferson
-O2 turns on -fstrict-aliasing, and if your code survives that then it'll probably survive other optimizations, since that's one that people get wrong over and over. That said, -fpredictive-commoning is only in -O3, and enabling that might enable bugs in your code caused by incorrect assumptions about concurrency. The less wrong your code is, the less dangerous optimization is ;-)Reliable
Oh and lets not forget that gcc now has -Ofast which enables even more optimizations (that partly seem to depend on stricter handling on the language rules)Kauffmann
@PlasmaHH, I don't think "stricter" is a good description of -Ofast, it turns off IEEE-compliant handling of NaNs for exampleInappetence
@JonathanWakely: indeed, its not a good word. as non-native speaker translation sometimes fails.I can't really come up with a good word.Kauffmann
K
292

In the early days of gcc (2.8 etc.) and in the times of egcs, and redhat 2.96 -O3 was quite buggy sometimes. But this is over a decade ago, and -O3 is not much different than other levels of optimizations (in buggyness).

It does however tend to reveal cases where people rely on undefined behavior, due to relying more strictly on the rules, and especially corner cases, of the language(s).

As a personal note, I am running production software in the financial sector for many years now with -O3 and have not yet encountered a bug that would not have been there if I would have used -O2.

By popular demand, here an addition:

-O3 and especially additional flags like -funroll-loops (not enabled by -O3) can sometimes lead to more machine code being generated. Under certain circumstances (e.g. on a cpu with exceptionally small L1 instruction cache) this can cause a slowdown due to all the code of e.g. some inner loop now not fitting anymore into L1I. Generally gcc tries quite hard to not to generate so much code, but since it usually optimizes the generic case, this can happen. Options especially prone to this (like loop unrolling) are normally not included in -O3 and are marked accordingly in the manpage. As such it is generally a good idea to use -O3 for generating fast code, and only fall back to -O2 or -Os (which tries to optimize for code size) when appropriate (e.g. when a profiler indicates L1I misses).

If you want to take optimization into the extreme, you can tweak in gcc via --param the costs associated with certain optimizations. Additionally note that gcc now has the ability to put attributes at functions that control optimization settings just for these functions, so when you find you have a problem with -O3 in one function (or want to try out special flags for just that function), you don't need to compile the whole file or even whole project with O2.

otoh it seems that care must be taken when using -Ofast, which states:

-Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard compliant programs.

which makes me conclude that -O3 is intended to be fully standards compliant.

Kauffmann answered 18/7, 2012 at 16:40 Comment(11)
I just use something like the opposite. I always use -Os or -O2 (sometimes O2 generates a smaller executable).. after profiling I use O3 on parts of the code that takes more execution time and that alone can give up to 20% more speed.Szechwan
@DarioOO: It all depends on your goals, I never care about binary sizes.Kauffmann
I do that for speed. O3 most times makes things slower. Don't know exactly why, I suspect it pollutes Instruction cache.Szechwan
@DarioOO I feel like pleading "code bloat" is a popular thing to do, but I almost never see it backed with benchmarks. It depends a lot on architecture, but every time I see published benchmarks (e.g. phoronix.com/…) it shows O3 being faster in the vast majority of cases. I've seen the profiling and careful analysis required to prove that code bloat was actually an issue, and it usually only happens for people that embrace templates in an extreme way.Villosity
@NirFriedman: It tends to get an issue when the inlining cost model of the compiler has bugs, or when you optimize for a totally different target than you run on. Intrestingly this applies to all optimization levels...Kauffmann
#28875825 has an example of slower with -O3. I never got around to writing up an answer explaining that -O3 decided the branch was unpredictable, and used cmov, leading to slower behaviour when the branch is predictable. I have also seen -O3 make giant bloated code on rare occasions (gcc 5.2), but -O2 doesn't do auto-vectorization so you're potentially losing out on a lot of per, compiling for x86 or other targets with fast int/fp vector instructions.Octosyllabic
@PeterCordes: So gcc has bugs. I hope someone reported them.Kauffmann
@PlasmaHH: the using-cmov issue would be hard to fix for the general case. Usually you haven't just sorted your data, so when gcc is trying to decide if a branch is predictable or not, static analysis looking for calls to std::sort functions is unlikely to help. Using something like #110210 would help, or maybe write the source to take advantage of the sorted-ness: scan until you see >=128, then start summing. As for the bloated code, yeah I intend to get around to reporting it. :POctosyllabic
It’s now over a decade ago (though not when you wrote it), but in the GCC 3.4 era common practice (and off-the-record insider advice) was to ship with O2, since O3 bugs were more common and often deprioritized, e.g., gcc.gnu.org/bugzilla/show_bug.cgi?id=23870.Bein
Upvoted for redhat 2.96 :'( Sadly, even more recent redhat patches are buggy.Opener
PlasmaHH and @PeterCordes I used a lot of information from here in my answer to a related question but I'd appreciate if you could write an alternative answer or correct me if I've got anything wrong, or if you have anything to add!Dorman
V
59

In my somewhat checkered experience, applying -O3 to an entire program almost always makes it slower (relative to -O2), because it turns on aggressive loop unrolling and inlining that make the program no longer fit in the instruction cache. For larger programs, this can also be true for -O2 relative to -Os!

The intended use pattern for -O3 is, after profiling your program, you manually apply it to a small handful of files containing critical inner loops that actually benefit from these aggressive space-for-speed tradeoffs. Newer versions of GCC have a profile-guided optimization mode that can (IIUC) selectively apply the -O3 optimizations to hot functions -- effectively automating this process.

Vintager answered 14/11, 2013 at 18:46 Comment(2)
"almost always"? Make it "50-50", and we'll have a deal ;-).Vladi
gcc -O3 hasn't included -funroll-loops for a long time, because of the problem you point out with unrolling loops other than very hot ones. -O3 does still include auto-vectorization, but GCC12 added that to -O2. (A few releases after changes that make it usually quite as much bloat in loop prologues to deal with alignment, although trip counts that might not be a multiple of the vector width can still be very bloated, especially a uint8_t loop with AVX-512.)Octosyllabic
Q
27

Yes, O3 is buggier. I'm a compiler developer and I've identified clear and obvious gcc bugs caused by O3 generating buggy SIMD assembly instructions when building my own software. From what I've seen, most production software ships with O2 which means O3 will get less attention wrt testing and bug fixes.

Think of it this way: O3 adds more transformations on top of O2, which adds more transformations on top of O1. Statistically speaking, more transformations means more bugs. That's true for any compiler.

Quartus answered 25/8, 2016 at 20:17 Comment(1)
This is too simple. When -O3 entirely removes unreachable code, any optimization made by -O2 or -O1 in the unreachable code becomes irrelevant. This shows that optimizations do no stack as assumed here. Also, "when building my own software" is a clear warning signal. -O3 will apply optimizations based on the behavior of the code being compiled. Any Undefined Behavior in the code can trigger unexpected optimizations. But Undefined Behavior means that the compiler cannot be wrong.Dealer
K
19

-O3 option turns on more expensive optimizations, such as function inlining, in addition to all the optimizations of the lower levels ‘-O2’ and ‘-O1’. The ‘-O3’ optimization level may increase the speed of the resulting executable, but can also increase its size. Under some circumstances where these optimizations are not favorable, this option might actually make a program slower.

Kenrick answered 18/7, 2012 at 18:12 Comment(6)
I understand that some "apparent optimizations" might make a program slower, but do you have a source that claims that GCC -O3 has made a program slower?Febrifugal
@MooingDuck: While I can not cite a source, I remember running into such a case with some older AMD processors that had quite a small L1I cache (~10k instructions). I am sure google has more for the interested, but especially options like loop unrolling are not part of O3, and those increase sizes a lot. -Os is the one for when you want to make executable smallest. Even -O2 can increase code size. A nice tool to play with the outcome of different optimization levels is the gcc explorer.Kauffmann
@PlasmaHH: Actually, a tiny cache size is something a compiler could screw up, good point. That's a really good example. Please put it in the answer.Febrifugal
@Kauffmann Pentium III had 16KB code cache. AMD's K6 and above actually had 32KB instruction cache. P4's started with around 96KB worth. Core I7 actually has a 32KB L1 code cache. Instruction decoders are strong nowadays, so your L3 is good enough to fall back on for almost any loop.Hypoderm
You'll see an enormous performance increase any time there is a function called in a loop and it can do significant common subexpression elimination and hoisting of unnecessary recalculation out of the function to before the loop.Hypoderm
AMD processors does not have L3 (as far as I know. maybe changed recently)Szechwan
A
4

Recently I experienced a problem when using optimization with g++. The problem was related to a PCI card, where the registers (for commands and data) were represented by a memory address. My driver mapped the physical address to a pointer within the application and gave it to the called process, which worked with it like this:

unsigned int * pciMemory;
askDriverForMapping( & pciMemory );
...
pciMemory[ 0 ] = someCommandIdx;
pciMemory[ 0 ] = someCommandLength;
for ( int i = 0; i < sizeof( someCommand ); i++ )
    pciMemory[ 0 ] = someCommand[ i ];

The card didn't act as expected. When I saw the assembly code I understood that the compiler only wrote someCommand[ the last ] into pciMemory, omitting all preceding writes.

In conclusion: be accurate and attentive with optimization.

Ayurveda answered 25/3, 2013 at 15:1 Comment(7)
But the point here is that your program simply has undefined behaviour; the optimiser did nothing wrong. In particular you need to declare pciMemory as volatile.Uncouth
@KonradRudolph yes, of course, but when I say "Some time ago" I played down. It was about 10 years ago, and I didn't know about volatile. BTW, why it's UB ?Ayurveda
It’s actually not UB but the compiler is within its right to omit all but the last writes to pciMemory because all other writes provably have no effect. For the optimiser that’s awesome because it can remove many useless and time-consuming instructions.Uncouth
I found this in standard (after 10+ years))) - A volatile declaration may be used to describe an object corresponding to a memory-mapped input/output port or an object accessed by an asynchronously interrupting function. Actions on objects so declared shall not be ‘‘optimized out’’ by an implementation or reordered except as permitted by the rules for evaluating expressions.Ayurveda
@Ayurveda Somewhat off-topic but how do you know that your device have taken the command before sending a new command?Lowboy
@Lowboy I saw it by device's behaviuor, but it was a great questAyurveda
@user877329: PCI cards have a control wire that they can use to extend memory cycles, effectively stalling the bus until they are complete. Some boards use such control signals to avoid the need for software polling, while others rely upon software polling since stalling the bus may block programs running on other cores from accessing other cards.Ingenue

© 2022 - 2024 — McMap. All rights reserved.