Why not always use compiler optimization?
Asked Answered
T

10

43

One of the questions that I asked some time ago had undefined behavior, so compiler optimization was actually causing the program to break.

But if there is no undefined behavior in you code, then is there ever a reason not to use compiler optimization? I understand that sometimes, for debugging purposes, one might not want optimized code (please correct me if I am wrong). Other than that, on production code, why not always use compiler optimization?

Also, is there ever a reason to use, say, -O instead of -O2 or -O3?

Tasty answered 22/10, 2011 at 5:23 Comment(4)
Undefined behaviour = broken program. The compiler is no longer at fault at that point, however it does or doesn't optimize. Anyway - Compiling with full optimizations can take a long time. But other than that, no, I can't really think of a reason.Majoriemajority
There used to be many more bugs in GCC than there are now (remember the 2.x series?). There were plenty of programs for which GCC produced bad output at -O2 but not at -O1.Carbrey
Is this problem really still going on? IBM brought out a C & C++ compiler in about 1992 in which all optimizations were safe.Aloft
I'm never using anything but -O2 with GCC during development (and usually identical flags for release), no exceptions, no ifs and no whens. Works perfectly well, even in the debugger (except you might not be able to break at a particular line or watch a particular local when it has been completely optimized out, but that's not surprising -- you simply can't watch something that isn't there).Astrogation
M
36

If there is no undefined behavior, but there is definite broken behavior (either deterministic normal bugs, or indeterminate like race-conditions), it pays to turn off optimization so you can step through your code with a debugger.

Typically, when I reach this kind of state, I like to do a combination of:

  1. debug build (no optimizations) and step through the code
  2. sprinkled diagnostic statements to stderr so I can easily trace the run path

If the bug is more devious, I pull out valgrind and drd, and add unit-tests as needed, both to isolate the problem and ensure that to when the problem is found, the solution works as expected.

In some extremely rare cases, the debug code works, but the release code fails. When this happens, almost always, the problem is in my code; aggressive optimization in release builds can reveal bugs caused by mis-understood lifetimes of temporaries, etc... ...but even in this kind of situation, having a debug build helps to isolate the issues.

In short, there are some very good reasons why professional developers build and test both debug (non-optimized) and release (optimized) binaries. IMHO, having both debug and release builds pass unit-tests at all times will save you a lot of debugging time.

Miler answered 22/10, 2011 at 5:47 Comment(4)
A program containing a data race has undefined behaviour (§1.10/21).Hypothesis
Also, when debugging large programs, a high optimization level can take longer to compile.Meteoroid
@robot1208: exactly; gcc -Og is aimed at that use-case: compile fairly fast, but without the -O0 anti-optimization / zero-effort behaviour of dumping values back to memory between every C statement. Possibly relevant for programs that still have some performance requirements to be tested, like games that might be unplayable if too slow.Greene
But you still lose some debugging functionality for local vars, especially in terms of setting new values for local vars and resuming as if making that change in the C abstract machine. Or even continuing at a different source line in the function, instead of recompiling the program with different source and restarting. (Supporting that is why -O0 asm is intentionally so clunky: see Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? for details on not doing constant-propagation etc.)Greene
H
30

Compiler optimisations have two disadvantages:

  1. Optimisations will almost always rearrange and/or remove code. This will reduce the effectiveness of debuggers, because there will no longer be a 1 to 1 correspondence between your source code and the generated code. Parts of the stack may be missing, and stepping through instructions may end up skipping over parts of the code in counterintuitive ways.
  2. Optimisation is usually expensive to perform, so your code will take longer to compile with optimisations turned on than otherwise. It is difficult to do anything productive while your code is compiling, so obviously shorter compile times are a good thing.

Some of the optimisations performed by -O3 can result in larger executables. This might not be desirable in some production code.

Another reason to not use optimisations is that the compiler that you are using may contain bugs that only exist when it is performing optimisation. Compiling without optimisation can avoid those bugs. If your compiler does contain bugs, a better option might be to report/fix those bugs, to change to a better compiler, or to write code that avoids those bugs completely.

If you want to be able to perform debugging on the released production code, then it might also be a good idea to not optimise the code.

Hypothesis answered 22/10, 2011 at 5:42 Comment(4)
I thought most of the compiler optimizations that result in larger code (like -funroll-loops) was in -O2 and -O3 is pretty much just -O2 + -fomit-frame-pointer. Am I operating off of old/outdated information?Lippold
@OmnipotentEntity: The GCC Manual is the definitive source. I won't bother repeating all the details, but yes -- -O2 can also lead to larger executables (-Os is the option to use for small executables). -fomit-frame-pointer is activated by -O (not -O3), -O3 contains optimisations that can potentially increase code size by a lot (as well as some other expensive to perform optimisations).Hypothesis
@OmnipotentEntity: Even GCC's -O3 hasn't included -funroll-loops for years; it's only on with -fprofile-use PGO. clang's -O2 does include it, though. But -fomit-frame-pointer has been on at -O1 since forever for some targets like x86-64, and has been for a while on i386 which changed later. godbolt.org/z/cheo97oWf GCC12 -O2 will include auto-vectorization (like clang) since many people are afraid of -O3 because of old folklore. (It's true that auto-vec does increase code-size, and sometimes isn't worth it, especially for cold code; use PGO to help GCC decide wisely.)Greene
Thanks! Looking through ancient gcc docs, it looks like -funroll-loops was never enabled by any -O level (at least since 2.9), and -fomit-frame-pointer was most commonly enabled at -O1. So I'm was just all around wrong.Lippold
R
11

3 Reasons

  1. It confuses the debugger, sometimes
  2. It's incompatible with some code patterns
  3. Not worth it: slow or buggy, or takes too much memory, or produces code that's too big.

In case 2, imagine some OS code that deliberately changes pointer types. The optimizer can assume that objects of the wrong type could not be referenced and generate code that aliases changing memory values in registers and gets the "wrong"1 answer.

Case 3 is an interesting concern. Sometimes optimizers make code smaller but sometimes they make it bigger. Most programs are not the least bit CPU-bound and even for the ones that are, only 10% or less of the code is actually computationally-intensive. If there is any downside at all to the optimizer then it is only a win for less than 10% of a program.

If the generated code is larger, then it will be less cache-friendly. This might be worth it for a matrix algebra library with O(n3) algorithms in tiny little loops. But for something with more typical time complexity, overflowing the cache might actually make the program slower. Optimizers can be tuned for all this stuff, typically, but if the program is a web application, say, it would certainly be more developer-friendly if the compiler would just do the all-purpose things and allow the developer to just not open the fancy-tricks Pandora's box.


1. Such programs are usually not standard-conforming so the optimizer is technically "correct", but still not doing what the developer intended.

Rhapsodize answered 22/10, 2011 at 6:9 Comment(5)
For case 2, use -fno-strict-aliasing with compilers like GCC or clang, to define the behaviour of stuff like *(int*)&my_float. Compiling without optimization to work around strict-aliasing violations in your code is an extremely blunt instrument. I'd certainly believe that some people would not know any better, but it's not a good reason. Similarly with -fwrapv to define the behaviour of signed-integer overflow as wrapping instead of UB.Greene
Producing code that's too big makes little sense as a reason to disable optimization. All the bloat of store/reload makes larger code, and mainstream compilers have options like gcc/clang -Os, or I think MSVC -O1, to optimize for size (while still caring about speed some); that's usually how you'll get the most compact code. And indeed some people find that some codebases run fastest on some CPUs when compiled with -Os instead of -O2.Greene
It's not plausible for -Os code to ever be slower than -O0, barring some corner case of a missed-optimization compiler bug like in GCC optimization flag -O2 makes code much slower that -O0 (if -Os would also have inlined a strcmp against a string literal as an x86 repe cmpsb on any GCC version.) Also not very plausible for it to be larger. So if anyone is using anti-optimized debug builds for size reasons, they're doing it wrong. Or using compilers very different from GCC/clang/ICC/MSVC.Greene
@PeterCordes: Code using -O0 and register can sometimes outperform code processed at other optimization levels, at least on the ARM. Sometimes gcc applies heuristics which are expected to make code faster, but actually make it worse. For example, if a register-qualified object always holds a constant value throughout its lifetime, gcc may replace it with a constant even though using a register within a loop (as would happen with -O0) would be faster than using a constant within a loop.Oakleil
@PeterCordes: One wouldn't have to improve -O0 very much to make it be in many cases competitive with other optimization settings. Some simple things like observing that if arithmetic is performed on a register immediately before it is compared with zero, the comparison may be omitted, or if no downstream code will examine the upper 16 bits of a register there's no need to sign-extend it, and recognizing situations where a value is loaded or stored using an address that is the sum of two values already held in registers, would do a lot without affecting semantics of any programs.Oakleil
S
4

The reason is that you develop one application (debug build) and your customers run completely different application (release build). If testing resources are low and/or compiler used is not very popular, I would disable optimization for release builds.

MS publishes numerous hotfixes for optimization bugs in their MSVC x86 compiler. Fortunately, I've never encountered one in real life. But this was not the case with other compilers. SH4 compiler in MS Embedded Visual C++ was very buggy.

Swithbart answered 22/10, 2011 at 6:54 Comment(0)
K
3

Two big reasons that I have seen arise from floating point math, and overly aggressive inlining. The former is caused by the fact that floating point math is extremely poorly defined by the C++ standard. Many processors perform calculations using 80-bits of precision, for instance, only dropping down to 64-bits when the value is put back into main memory. If a version of a routine flushes that value to memory frequently, while another only grabs the value once at the end, the results of the calculations can be slightly different. Just tweaking the optimizations for that routine may well be a better move than refactoring the code to be more robust to the differences.

Inlining can be problematic because, by its very nature, it generally results in larger object files. Perhaps this increase is code size is unacceptable for practical reasons: it needs to fit on a device with limited memory, for instance. Or perhaps the increase in code size results in the code being slower. If it a routine becomes big enough that it no longer fits in cache, the resultant cache misses can quickly outweigh the benefits inlining provided in the first place.

I frequently hear of people who, when working in a multi-threaded environment, turn off debugging and immediately encounter hordes of new bugs due to newly uncovered race conditions and whatnot. The optimizer just revealed the underlying buggy code here, though, so turning it off in response is probably ill advised.

Kovar answered 22/10, 2011 at 7:6 Comment(5)
Narrowing data by shuffling data between registers and cache/RAM is beyong control of the compiler, because it can happen also when e.g. kernel switches tasks and saves registers. People from numerics hate this, because it makes computations not 100% reproducible. You have problem with or without optimizations. (FP math is defined in IEEE754, and it is defined quite well I think; some optimizations may violate it, but only if you ask for non-compliant behavior, e.g. -ffast-math with gcc)Ultimatum
@eudoxos: I was referring to the C++ definition... with integral math, the exact results are specified in almost all cases, while with floating point math most bets are off.Kovar
@eudoxos, are you asserting that the kernel may only save 64 bits of an 80 bit x87 register on a task switch? That sounds... extraordinary to me.Lewie
@Russell Borogove: I tried to google it up to make sure; x87 has special instructions to save/load FP registers, and extended instructions to do so with SSE>=3. I found a paper hal.archives-ouvertes.fr/docs/00/12/81/24/PDF/… which extensively mentions register allocation, which does not make the same binary produce different results. I distinctly remember physicist complaining at conference about the 80/64 narrowing unpredictability, though. Sorry for noise, anyway.Ultimatum
@eudoxos: The FP non-determinism with x87 is between different binaries built from the same source with different compiler options. (e.g. optimized to keep extended precision temporaries in registers across statements or not; compilers default to violating the ISO C standard by doing it between statements instead of only within expressions, but you could use gcc -ffloat-store). The same binary is always deterministic; context-switches save the exact architectural state. Using dedicated instructions like xsave, but it would have been possible to use plain fstp m80 10-byte stores/reloads.Greene
T
2

Just happened to me. The code generated by swig for interfacing Java is correct but won't work with -O2 on gcc.

Transmittance answered 22/10, 2011 at 5:29 Comment(2)
swig notoriously relies on aliasing aka using pointer casts to reinterpret the data type.Ethnography
Sounds like you need -fno-strict-aliasing -O2 then, rather than debug builds.Greene
F
1

Simple. Compiler optimization bugs.

Frei answered 22/10, 2011 at 5:29 Comment(2)
Can you give me some simple examples of such bugs?Tasty
One of them I can remember was with the gcc 2.95 compiler. The arithmetic for a certain integer expression was wrong when optimized with -O2 but correct when optimized with -O or no optimization. So, I had to turn down or turn off the optimization.Frei
E
1

There is an example, why sometimes is dangerous using optimization flag and our tests should cover most of the code to notice such an error.

Using clang (because in gcc even without optimization flag, makes some iptimizations and the output is corrupted):

File: a.cpp

#include <stdio.h>

int puts(const char *str) {
    fputs("Hello, world!\n", stdout);
    return 1;
}

int main() {
    printf("Goodbye!\n");
    return 0;
}

Without -Ox flag:

> clang --output withoutOptimization a.cpp; ./withoutOptimization

> Goodbye!

With -Ox flag:

> clang --output withO1 -O1 a.cpp; ./withO1

> Hello, world!

Euchologion answered 15/4, 2014 at 12:14 Comment(8)
You can't really blame compiler optimizations for problems you get as a result of trying to redefine standard library functions, which leads to undefined behavior.Ring
I ain't blaming. I answer the question: "Why not always use compiler optimization". This is one of the reasons why. Because they might change the application behaviour. Am I right?Euchologion
Yes, optimization level often affects programs with undefined behaviour. If you want to redefine names reserved by the ISO C standard library, you need to compile with -fno-builtin. In this case, -fno-builtin-puts is unfortunately not sufficient for GCC, but -fno-builtin-printf is. godbolt.org/z/7c9fn83MT (GCC's printf optimizer still assumes that a function called puts will do what the ISO C standard says it should.) Clang is the same.Greene
More typical examples of UB worked around by disabling optimization is stuff like data races in multi-threaded code such as Multithreading program stuck in optimized mode but runs normally in -O0 (or with interrupts on a microcontroller)Greene
@PeterCordes: A lot of existing freestanding projects expect compilers to behave in a manner agnostic to whether they are freestanding or hosted, and it's common for such projects to include printf functions that are tailored to their individual needs. I really don't like the printf->puts substitution, especially since in many embedded projects it will often, at best, require the inclusion fo code for an otherwise-unnecessary puts function which takes up more space than the eliminated newline characters.Oakleil
@supercat: GCC and clang both have -ffreestanding which makes this code work as expected: godbolt.org/z/x6Ya9E4rM (by implying -fno-builtin, I think). It can be used for a hosted built (as shown on Godbolt which actually runs the program); it's merely a code-gen option, not affecting linking. So GCC and clang both have an option named exactly for the use-case you mention; complaining that it's not the default for a hosted build seems unreasonable.Greene
@PeterCordes: Given that gcc and clang don't come with a standard library implementation, I question the assumption that puts will always cheaper than printf while being semantically equivalent in all corner cases. If an underlying OS guarantees that its write-N-bytes-to-stream operation will behave atomically with respect to other processes that write the same stream, but doesn't coordinate with any public locks, then a conforming puts("Hello world"); would be required to generate the string "Hello world\n" and output that using a single write operation, but the printf form could...Oakleil
...simply write the contents of the string literal directly.Oakleil
O
0

An optimization that is predicated on the idea that a program won't do X will be useful when processing tasks that don't involve doing X, but will be at best counter-productive when performing a task which could be best accomplished by doing X.

Because the C language is used for many purposes, the Standard deliberately allows compilers which are designed for specialized purposes to make assumptions about program behavior that would render them unsuitable for many other purposes. The authors of the Standard allowed implementations to extend the semantics of the language by specifying how they will behave in situations where the Standard imposes no requirements, and expected that quality implementations would seek to do so in cases where their customers would find it useful, without regard for whether the Standard required them to do so.

Programs that need to perform tasks not anticipated or accommodated by the Standard will often need to exploit constructs whose behavior is defined by many implementations, but not mandated by the Standard. Such programs are not "broken", but are merely written in a dialect that the Standard doesn't require that all implementations support.\

As an example, consider the following function test and whether it satisfies the following behavioral requirements:

  1. If passed a value whose bottom 16 bits would match those of some power of 17, return the bottom 32 bits of that power of 17.

  2. Do not write to arr[65536] under any circumstances.

The code would appear like it should obviously meet the second requirement, but can it be relied upon to do so?

#include <stdint.h>
int arr[65537];
uint32_t doSomething(uint32_t x)
{
    uint32_t i=1;
    while ((uint16_t)i != x)
        i*=17;
    if (x < 65536)
        arr[x] = 1;
    return i;
}
void test(uint32_t x)
{
    doSomething(x);
}

If the code is fed to clang with a non-zero optimization level, the generated machine code for test will fail the second requirement if x is 65536, since the generated code will be equivalent to simply arr[x] = 1;. Clang will perform this "optimization" even at -O1, and none of the normal options to limit broken optimizations will prevent it other than those which force C89 or C99 mode.

Oakleil answered 5/8, 2021 at 19:30 Comment(8)
Usually the right solution to problems like this is gcc -fwrapv -fno-strict-aliasing -O3 or similar, so you can still enable some optimization while telling the compiler to define behaviour that it normally doesn't. Options like this usually get added when new compiler versions want to start assuming that some form of UB doesn't happen, for exactly the reason you describe. (But it's not a perfect situation as you've ranted about all over the place on Stack Overflow. There are some things it's a pain to write safely with modern C compilers.)Greene
@PeterCordes: The options you describe do nothing to protect against arbitrary memory corruption in clang or gcc if a program compares for equality a just-past pointer for one object and a pointer to an object that happens to immediately follow it in address space--a situation explicitly defined by the Standard. They also do nothing in clang to protect against memory corruption in clang if a program receives input that would cause it to loop endless. Note that optimizations that would cause such pointer comparisons to yield 0 or 1 arbitrarily, or would cause endless loops...Oakleil
...to be cleanly omitted, would be reasonable, but the optimizations by clang and gcc don't work that way. Instead, in situations where a pair of operations would each be rendered individually redundant by the other, clang and gcc may end up deciding to eliminate both. If a statement like if (x < 66000) arr[x] = 2; can't be relied upon to prevent a write to arr[66000] and above, not much of anything can be relied upon for anything.Oakleil
@PeterCordes: Do you know any options that would force clang to generate code meeting the specified requirements without having to disable optimizations entirely?Oakleil
I expect asm volatile(""); inside the loop might do the trick even in C++ mode, preventing infloop removal. I don't know of an option that would prevent that or the associated bug. That does sound like a real bug, though, could probably report it with that MCVE.Greene
@PeterCordes: The Standard doesn't forbid anything that compilers might do in case execution gets caught in and endless loop. A programmer working with a compiler that can cleanly omit loops in some cases in a manner that is agnostic with regard to whether they would terminate would be able to generate better machine code than one which had to ensure that all loops that might fail to terminate included dummy side effects. Since the Standard doesn't forbid clang's counter-productive behavior, however, I don't think clang's maintainers would view it as a bug.Oakleil
I didn't say asm volatile was a good solution, just the first I thought of. However, you only run into that if your have a loop that's intentionally infinite for some inputs. That seems like an unusual and pretty specific choice to make when writing that check, although I can imagine that it was reduced from code that didn't seem designed in a silly way. I think optimizing away the later check seems undesirable. But OTOH, that path of execution does encounter UB (infinite loop).Greene
@PeterCordes: The example was contrived to clearly demonstrate the issue. It's doubtful that clang would behave in problematic fashion with most non-contrived situations where some inputs might unintentionally cause a program to get stuck looping endlessly, but since clang is designed around the assumption that all possible behaviors would be equally acceptable if a program receives inputs that would get it stuck in a loop, such loops would need to be avoided at all costs even though that would negate any advantages to the "optimization".Oakleil
K
-13

One example is short-circuit boolean evaluation. Something like:

if (someFunc() && otherFunc()) {
  ...
}

A 'smart' compiler might realize that someFunc will always return false for some reason, making the entire statement evaluate to false, and decide to not call otherFunc to save CPU time. But if otherFunc contains some code that directly affects program execution (maybe it resets a global flag or something), it now won't perform that step and you program enters an unknown state.

Katelin answered 22/10, 2011 at 5:31 Comment(6)
hmm. If someFunc is always false, then otherFunc should never be evaluated, as the && operator must short-circuit.Proconsul
Your example is broken, if someFunc returns false, the generated code is not allowed to call otherFunc!Lawry
If someFunc returns false, then it otherFunc should not be called at any optimization level. It would be a grievous violation of the C standard to do otherwise.Carbrey
@Proconsul the operator&& doesn't always short circuit, because &&, || can be overloaded and lose the short-circuit mechanism. Though overloading operators like &&, ||, , is highly unrecommended.Rachael
Anyway, the answer is still incorrect. otherFunc is either short-circuited, or evaluated as intended if && gets overloaded.Rachael
The author of this answer hasn't logged in for 5 years; we should just vote to delete this brain-fart. && does short circuit in the abstract machine, and compilers obviously aren't allowed to change the visible side-effects when optimizing. Changing a global variable is a visible side-effect.Greene

© 2022 - 2024 — McMap. All rights reserved.