Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
Asked Answered
A

1

14

I am disassembling this code on llvm clang Apple LLVM version 8.0.0 (clang-800.0.42.1):

int main() {
    float a=0.151234;
    float b=0.2;
    float c=a+b;
    printf("%f", c);
}

I compiled with no -O specifications, but I also tried with -O0 (gives the same) and -O2 (actually computes the value and stores it precomputed)

The resulting disassembly is the following (I removed the parts that are not relevant)

->  0x100000f30 <+0>:  pushq  %rbp
    0x100000f31 <+1>:  movq   %rsp, %rbp
    0x100000f34 <+4>:  subq   $0x10, %rsp
    0x100000f38 <+8>:  leaq   0x6d(%rip), %rdi       
    0x100000f3f <+15>: movss  0x5d(%rip), %xmm0           
    0x100000f47 <+23>: movss  0x59(%rip), %xmm1        
    0x100000f4f <+31>: movss  %xmm1, -0x4(%rbp)  
    0x100000f54 <+36>: movss  %xmm0, -0x8(%rbp)
    0x100000f59 <+41>: movss  -0x4(%rbp), %xmm0         
    0x100000f5e <+46>: addss  -0x8(%rbp), %xmm0
    0x100000f63 <+51>: movss  %xmm0, -0xc(%rbp)
    ...

Apparently it's doing the following:

  1. loading the two floats onto registers xmm0 and xmm1
  2. put them in the stack
  3. load one value (not the one xmm0 had earlier) from the stack to xmm0
  4. perform the addition.
  5. store the result back to the stack.

I find it inefficient because:

  1. Everything can be done in registry. I am not using a and b later, so it could just skip any operation involving the stack.
  2. even if it wanted to use the stack, it could save reloading xmm0 from the stack if it did the operation with a different order.

Given that the compiler is always right, why did it choose this strategy?

Aureus answered 18/11, 2018 at 23:16 Comment(3)
Because you didn't enable optimizations and this is the simplest the way to do it.Maturity
Even though the basic answer is simple, thanks for writing up this well-formatted question. There is some interesting stuff to say, and this looks like a good place to put a canonical answer that I've often repeated as part of other answers. Now I can just link to this as a go-to for -O0 being a bad choice for looking at compiler-generated asm, and exactly what -O0 implies for the asm.Maddocks
don't try to predict the execution time by looking at asm/c code, modern CPU EXTREMELY complex black box, if you aren't an expert you are easy could be wrong. CPU executing instructions out of order and with different speed, pipeline, data dependency, superscalaring - all these things could run longer dummy program faster than shorter and obvious. That's the general rule, always run, don't look at code.Nagano
M
34

-O0 (unoptimized) is the default. It tells the compiler you want it to compile fast (short compile times), not to take extra time compiling to make efficient code.

(-O0 isn't literally no optimization; e.g. gcc will still eliminate code inside if(1 == 2){ } blocks. Especially gcc more than most other compilers still does things like use multiplicative inverses for division at -O0, because it still transforms your C source through multiple internal representations of the logic before eventually emitting asm.)

Plus, "the compiler is always right" is an exaggeration even at -O3. Compilers are very good at a large scale, but minor missed-optimizations are still common within single loops. Often with very low impact, but wasted instructions (or uops) in a loop can eat up space in the out-of-order execution reordering window, and be less hyper-threading friendly when sharing a core with another thread. See C++ code for testing the Collatz conjecture faster than hand-written assembly - why? for more about beating the compiler in a simple specific case.


More importantly,-O0 also implies treating all variables similar to volatile for consistent debugging. i.e. so you can set a breakpoint or single step and modify the value of a C variable, and then continue execution and have the program work the way you'd expect from your C source running on the C abstract machine. So the compiler can't do any constant-propagation or value-range simplification. (e.g. an integer that's known to be non-negative can simplify things using it, or make some if conditions always true or always false.)

(It's not quite as bad as volatile: multiple references to the same variable within one statement don't always result in multiple loads; at -O0 compilers will still optimize somewhat within a single expression.)

Compilers have to specifically anti-optimize for -O0 by storing/reloading all variables to their memory address between statements. (In C and C++, every variable has an address unless it was declared with the (now obsolete) register keyword and has never had its address taken. Optimizing away the address is possible according to the as-if rule for other variables, but isn't done at -O0)

Unfortunately, debug-info formats can't track the location of a variable through registers, so fully consistent debugging isn't possible without this slow-and-stupid code-gen.

If you don't need this, you can compile with -Og for light optimization, and without the anti-optimizations required for consistent debugging. The GCC manual recommends it for the usual edit/compile/run cycle, but you will get "optimized out" for many local variables with automatic storage when debugging. Globals and function args still usually have their actual values, at least at function boundaries.


Even worse, -O0 makes code that still works even if you use GDB's jump command to continue execution at a different source line. So each C statement has to be compiled into a fully independent block of instructions. (Is it possible to "jump"/"skip" in GDB debugger?)

for() loops can't be transformed into idiomatic (for asm) do{}while() loops, and other restrictions.

For all the above reasons, (micro-)benchmarking un-optimized code is a huge waste of time; the results depend on silly details of how you wrote the source that don't matter when you compile with normal optimization. -O0 vs. -O3 performance is not linearly related; some code will speed up much more than others.

The bottlenecks in -O0 code will often be different from -O3- often on a loop counter that's kept in memory, creating a ~6-cycle loop-carried dependency chain. This can create interesting effects in the compiler-generated asm like Adding a redundant assignment speeds up code when compiled without optimization (which are interesting from an asm perspective, but not for C.)

"My benchmark optimized away otherwise" is not a valid justification for looking at the performance of -O0 code. See C loop optimization help for final assignment for an example and more details about the rabbit hole that tuning for -O0 is.


Getting interesting compiler output

If you want to see how the compiler adds 2 variables, write a function that takes args and returns a value. Remember you only want to look at the asm, not run it, so you don't need a main or any numeric literal values for anything that should be a runtime variable.

See also How to remove "noise" from GCC/clang assembly output? for more about this.

float foo(float a, float b) {
    float c=a+b;
    return c;
}

compiles with clang -O3 (on the Godbolt compiler explorer) to the expected

    addss   xmm0, xmm1
    ret

But with -O0 it spills the args to stack memory. (Godbolt uses debug info emitted by the compiler to colour-code asm instructions according to which C statement they came from. I've added line breaks to show blocks for each statement, but you can see this with colour highlighting on the Godbolt link above. Often very handy for finding the interesting part of an inner loop in optimized compiler output.)

gcc -fverbose-asm will put comments on every line showing the operand names as C vars. In optimized code that's often an internal tmp name, but in un-optimized code it's usual an actual variable from the C source. I've manually commented the clang output because it doesn't do that.

# clang7.0 -O0  also on Godbolt
foo:
    push    rbp
    mov     rbp, rsp                  # make a traditional stack frame
    movss   DWORD PTR [rbp-20], xmm0  # spill the register args
    movss   DWORD PTR [rbp-24], xmm1  # into the red zone (below RSP)

    movss   xmm0, DWORD PTR [rbp-20]  # a
    addss   xmm0, DWORD PTR [rbp-24]  # +b
    movss   DWORD PTR [rbp-4], xmm0   # store c

    movss   xmm0, DWORD PTR [rbp-4]   # return 0
    pop     rbp                       # epilogue
    ret

Fun fact: using register float c = a+b;, the return value can stay in XMM0 between statements, instead of being spilled/reloaded. The variable has no address. (I included that version of the function in the Godbolt link.)

The register keyword has no effect in optimized code (except making it an error to take a variable's address, like how const on a local stops you from accidentally modifying something). I don't recommend using it, but it's interesting to see that it does actually affect un-optimized code.


Related:

Maddocks answered 18/11, 2018 at 23:34 Comment(1)
Note that at least clang actually starts out with each variable having memory on the stack allocated for it. One of the first optimisation passes (which I guess is omitted for -O0) turns these into a bunch of SSA variables if possible. So at least on clang, there is no “anti optimisation” going on, the normal optimisations are just turned off.Nonce

© 2022 - 2024 — McMap. All rights reserved.