Which GCC optimization flags affect binary size the most?
Asked Answered
T

4

32

I am developing a C++ for ARM using GCC. I have ran into an issue where, I no optimizations are enabled, I am unable to create a binary (ELF) for my code because it will not fit in the available space. However, if I simply enable optimization for debugging (-Og), which is the lowest optimization available to my knowledge, the code easily fits.

In both cases, -ffunction-sections, -fdata-sections, -fno-exceptions, and -Wl,--gc-sections is enabled.

  • Flash size: 512 kB
  • Without Optimizations: .text Overflows by ~200 kB
  • With -Og Optimizations: .text is ~290 kB

This is huge difference in binary size even with minimal optimizations.

I took a look at 3.11 Options That Control Optimization for details on to what optimizations are being performed with the -Og flag to see if that would give me any insight.

What optimization flags affect binary size the most? Is there anything I should be looking for to explain this massive difference?

Turnspit answered 27/4, 2022 at 14:31 Comment(8)
Maybe you could readelf -e <binary> and see which are the big sections that became small/nonexistent with -Og.Fugleman
You could simply test it. Do a build at each of the available optimization levels and compare the resulting sizes.Castiron
@Fugleman When I run readelf --all, I get a bunch of lines in the Symbol Table where the type is NOTYPE and the name is $t or $d, what does this mean?Turnspit
@PatrickWright No idea, sorry, I do not use that tool much.Fugleman
As indirectly pointed out in answers, you need to try the various, obvious ones -O2, -O3, -Os, etc. And using readelf or some other binary tools (remember the elf file size is really not relevant) check the size. As I have seen many times optimize for speed making smaller than optimize for size. (usually size wins but it is not assumed that the tool tries a zillion combinations, it has a formula). You definitely do not want to add any debugging options as those are counter productive. And when you change the code, the optimizations can fall off of a cliff, so go through the exercise againImogene
If you are already fine with disabling exceptions, why not disable rtti too: -fno-exceptions -fno-rtti -fno-unwind-tables -fno-asynchronous-unwind-tables. The answers have mentioned most of the big stuff, but here's something that helps if you have lots of templates and hard-coded constants that are "not really different": -fmerge-all-constantsHyetograph
Was -fno-exceptions misspelt in the actual compiler invocation? Or only here?Twosided
As I read the question, I get the feeling that it assumes optimizations are somewhat independent. X saves 40 kB, Y saves 70 kB, Z saves 3 kB, therefore Y+Z save 73 kB. In reality, optimizations are not independent. For instance, Dead Code Elimination means that some code is utterly removed, even code that would be subject to Common Subexpression Elimination. Inlining removes call barriers that can unlock further optimizations, e.g. CSE possibilities can become apparent. This shows the saving can be more or less than a simple sum.Seiber
P
20

Most of the extra code-size for an un-optimized build is the fact that the default -O0 also means a debug build, not keeping anything in registers across statements for consistent debugging even if you use a GDB j command to jump to a different source line in the same function. -O0 means a huge amount of store/reload vs. even the lightest level of optimization, especially disastrous for code-size on a non-CISC ISA that can't use memory source operands. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to GCC equally.

Especially for modern C++, a debug build is disastrous because simple template wrapper functions that normally inline and optimize away to nothing in simple cases (or maybe one instruction), instead compile to actual function calls that have to set up args and run a call instruction. e.g. for a std::vector, the operator[] member function can normally inline to a single ldr instruction, assuming the compiler has the .data() pointer in a register. But without inlining, every call-site takes multiple instructions1


Options that affect code-size in the actual .text section1 the most: alignment of branch-targets in general, or just loops, costs some code-size. Other than that:

  • -ftree-vectorize - make SIMD versions loops, also necessitating scalar cleanup if the compiler can't prove that the iteration count will be a multiple of the vector width. (Or that pointed-to arrays are non-overlapping if you don't use restrict; that may also need a scalar fallback). Enabled at -O3 in GCC11 and earlier. Enabled at -O2 in GCC12 and later, like clang.

  • -funroll-loops / -funroll-all-loops - not enabled by default even at -O3 in modern GCC. Enabled with profile-guided optimization (-fprofile-use), when it has profiling data from a -fprofile-generate build to know which loops are actually hot and worth spending code-size on. (And which are cold and thus should be optimized for size so you get fewer I-cache misses when they do run, and less eviction of other code.) PGO also influences vectorization decisions.

    Related to loop unrolling are heuristics (tuning knobs) that control loop peeling (fully unrolling) and how much to unroll. The normal way to set these is with -march=native, implying -mtune= whatever as well. -mtune=znver3 may favour big unroll factors (at least clang does), compared to -mtune=sandybridge or -mtune=haswell. But there are GCC options to manually adjust individual things, as discussed in comments on gcc: strange asm generated for simple loop and in How to ask GCC to completely unroll this loop (i.e., peel this loop)?
    There are options to override the weights and thresholds for other decision heuristics like inlining, too, but it's very rare you'd want to fine-tune that much unless you're working on refining the defaults, or finding good defaults for a new CPU.

  • -Os - optimize for size and speed, trying not to sacrifice too much speed. A good tradeoff if your code has a lot of I-cache misses, otherwise -O3 is normally faster, or at least that's the design goal for GCC. Can be worth trying different options to see if -O2 or -Os make your code faster than -O3 across some CPUs you care about; sometimes missed-optimizations or quirks of certain microarchitectures make a difference, as in Why does GCC generate 15-20% faster code if I optimize for size instead of speed? which has actual benchmarks from GCC4.6 to 4.8 (current at the time) for a specific small loop in a test program, on quite a few different x86 and ARM CPUs, with and without -march=native to actually tune for them. There's zero reason to expect that to be representative of other code, though, so you need to test yourself for your own codebase. (And for any given loop, small code changes could make a different compile option better on any given CPU.)

    And obviously -Os is very useful if you need your static code-size smaller to fit in some size limit.

  • -Oz optimizing for size only, even at a large cost in speed. GCC only very recently added this to current trunk, so expect it in GCC12 or 13. Presumably what I wrote below about clang's implementation of -Oz being quite aggressive also applies to GCC, but I haven't yet tested it.

Clang has similar options, including -Os. It also has a clang -Oz option to optimize only for size, without caring about speed. It's very aggressive, e.g. on x86 using code-golf tricks like push 1; pop rax (3 bytes total) instead of mov eax, 1 (5 bytes).

GCC's -Os unfortunately chooses to use div instead of a multiplicative inverse for division by a constant, costing lots of speed but not saving much if any size. (https://godbolt.org/z/x9h4vx1YG for x86-64). For ARM, GCC -Os still uses an inverse if you don't use a -mcpu= that implies udiv is even available, otherwise it uses udiv: https://godbolt.org/z/f4sa9Wqcj .

Clang's -Os still uses a multiplicative inverse with umull, only using udiv with -Oz. (or a call to __aeabi_uidiv helper function without any -mcpu option). So in that respect, clang -Os makes a better tradeoff than GCC, still spending a little bit of code-size to avoid slow integer division.


Footnote 1: inlining or not for std::vector

#include <vector>
int foo(std::vector<int> &v) {
    return v[0] + v[1];
}

Godbolt with gcc with the default -O0 vs. -Os for -mcpu=cortex-m7 just to randomly pick something. IDK if it's normal to use dynamic containers like std::vector on an actual microcontroller; probably not.

# -Os (same as -Og for this case, actually, omitting the frame pointer for this leaf function)
foo(std::vector<int, std::allocator<int> >&):
        ldr     r3, [r0]                @ load the _M_start member of the reference arg
        ldrd    r0, r3, [r3]            @ load a pair of words (v[0..1]) from there into r0 and r3
        add     r0, r0, r3              @ add them into the return-value register
        bx      lr

vs. a debug build (with name-demangling enabled for the asm)

# GCC -O0 -mcpu=cortex-m7 -mthumb
foo(std::vector<int, std::allocator<int> >&):
        push    {r4, r7, lr}             @ non-leaf function requires saving LR (the return address) as well as some call-preserved registers
        sub     sp, sp, #12
        add     r7, sp, #0              @ Use r7 as a frame pointer.  -O0 defaults to -fno-omit-frame-pointer
        str     r0, [r7, #4]            @ spill the incoming register arg to the stack


        movs    r1, #0                  @ 2nd arg for operator[]
        ldr     r0, [r7, #4]            @ reload the pointer to the control block as the first arg
        bl      std::vector<int, std::allocator<int> >::operator[](unsigned int)
        mov     r3, r0                  @ useless copy, but hey we told GCC not to spend any time optimizing.
        ldr     r4, [r3]                @ deref the reference (pointer) it returned, into a call-preserved register that will survive across the next call


        movs    r1, #1                  @ arg for the v[1]  operator[]
        ldr     r0, [r7, #4]
        bl      std::vector<int, std::allocator<int> >::operator[](unsigned int)
        mov     r3, r0
        ldr     r3, [r3]                @ deref the returned reference

        add     r3, r3, r4              @ v[1] + v[0]
        mov     r0, r3                  @ and copy into the return value reg because GCC didn't bother to add into it directly

        adds    r7, r7, #12             @ tear down the stack frame
        mov     sp, r7
        pop     {r4, r7, pc}            @ and return by popping saved-LR into PC

@ and there's an actual implementation of the operator[] function
@ it's 15 instructions long.  
@ But only one instance of this is needed for each type your program uses (vector<int>, vector<char*>, vector<my_foo>, etc.)
@ so it doesn't add up as much as each call-site
std::vector<int, std::allocator<int> >::operator[](unsigned int):
        push    {r7}
        sub     sp, sp, #12
  ...

As you can see, un-optimized GCC cares more about fast compile-times than even the most simple things like avoiding useless mov reg,reg instructions even within code for evaluating one expression.


Footnote 1: metadata

If you could a whole ELF executable with metadata, not just the .text + .rodata + .data you'd need to burn to flash, then of course -g debug info is very significant for size of the file, but basically irrelevant because it's not mixed in with the parts that are needed while running, so it just sits there on disk.

Symbol names and debug info can be stripped with gcc -s or strip.

Stack-unwind info is an interesting tradeoff between code-size and metadata. -fno-omit-frame-pointer wastes extra instructions and a register as a frame pointer, leading to larger machine-code size, but smaller .eh_frame stack unwind metadata. (strip does not consider that "debug" info by default, even for C programs not C++ where exception-handling might need it in non-debugging contexts.)

How to remove "noise" from GCC/clang assembly output? mentions how to get the compiler to omit some of that: -fno-asynchronous-unwind-tables omits .cfi directives in the asm output, and thus the metadata that goes into the .eh_frame section. Also -fno-exceptions -fno-rtti with C++ can reduce metadata. (Run-Time Type Information for reflection takes space.)

Linker options that control alignment of sections / ELF segments can also take extra space, relevant for tiny executables but is basically a constant amount of space, not scaling with the size of the program. See also Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?

Persuade answered 28/4, 2022 at 2:16 Comment(6)
wrt -funroll-loops on clang: There are actually some targets where it defaults on for high optimization levels like -O3 (namely NVPTX), but even for lower optimization levels recent versions of clang have started aggressively vectorizing and unrolling small loops even at -O2 godbolt.org/z/qT8xsGnPeJobey
@SteveCox: Yes, clang vectorizes at -O2, and defaults to unrolling small loops by 2 or by 4, depending on how tiny the loop is. This answer is primarily about GCC, which never does even with -O3, even when vectorizing. (Except possibly with OpenMP? IDK, I didn't re-check.) Anyway, sometimes leads to the silly situation where GCC code spends 99% of its time in a rolled-up SIMD loop with a couple instructions, but 90% of the code-size of the function is fully-unrolled scalar cleanup. Especially for char* with wide vectors, so each vector holds a lot of elementsPersuade
This particularly surprised me when I saw clang generated this (gorgeous) 110 line monstrosity where they check for pointer aliasing before dropping down into a 4 way unrolled and vectorized loop: godbolt.org/z/7nq77nWzK. Compared to gcc: godbolt.org/z/7G95ofeffJobey
@SteveCox: That specific comparison with GCC is kind of silly because you told clang to vectorize but not GCC. -O2 isn't some universally agreed-upon thing, it has a meaning specific to the compiler. godbolt.org/z/hj37G5qhT shows GCC's -O3 vectorization with the aliasing check, also lots of code. (And as I said in this answer, GCC12 will enable vectorization at -O2, but turns out not as aggressive about checking for overlap and making multiple loop versions; GCC nightly still makes scalar asm for that case)Persuade
@SteveCox: BTW, GCC's loop unrolling doesn't usually invent multiple accumulators to hide FP latency, unlike clang which does. Even when GCC -funroll-loop does unroll, it doesn't help with the bottleneck in an FP reduction loop like a sum or dot product. godbolt.org/z/czadK5dod shows clang using XMM0 and XMM1 (unfortunately not more registers so it still bottlenecks at 1 vector per 2 clocks on Skylake), vs. GCC addps xmm0, [rdi] / add xmm0, [rdi+16] etcPersuade
Oh, nice! GCC is finally enabling -ftree-vectorize by default at -O2, just like Clang. That's nice to hear. I wonder why loop unrolling is not also enabled? I still compile at -O3 with GCC as a matter of course because too many useful optimizations are not enabled with -O2. And, in spite of obsession over code size, I regularly find that my strategy of turning optimizations up to 11 will actually result in smaller, tighter code (or at least roughly equivalent), especially when taking advantage of things like LTO. -Oz will be useful for code golfing. Otherwise... Meh.Petrography
C
26

Which GCC optimization flags affect binary size the most?

It will vary somewhat depending on the program itself. Most accurate way to find out how each flag affects your program is to try it out and compare the result with base-level.

A good choice of base-level for size optimisation is to use -Os which enables all optimisations of -O2 except those that are expected to potentially increase the binary size significantly which are (at the moment):

-falign-functions
-falign-jumps
-falign-labels
-falign-loops
-fprefetch-loop-arrays
-freorder-blocks-algorithm=stc
Chenille answered 27/4, 2022 at 14:37 Comment(7)
There is also -Oz, which optimizes even more aggressive for size instead of speed according to the manual.Wag
@JakobStark Oh cool. That appears to be new in the development version since it's not in GCC 11 (latest release at the moment).Chenille
Oh I didn't notice that it's that brandnew. I just stumbled over it while scolling through the man page linked in the question..Wag
@JakobStark: -Oz is only supported by clang. It's very aggressive, e.g. on x86 using push 1; pop rax (3 bytes total) instead of mov eax, 1 (5 bytes). -Os on both GCC and clang still cares some about speed, although for GCC it unfortunately chooses to use div instead of a multiplicative inverse for division by a constant, costing lots of speed but not saving much if any size. godbolt.org/z/x9h4vx1YG. For ARM, with GCC -Os still uses an inverse if you don't use a -mcpu= that implies udiv is even available, otherwise udiv: godbolt.org/z/f4sa9WqcjPersuade
clang -Os is a good middle-ground, using umull multiplicative inverses but still generally favouring size in terms of alignment.Persuade
@PeterCordes, GCC recently added -Oz in the trunk. See gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.Skatole
@Pyautogui: Thanks, looks like it has the same intent as clang -Oz, aggressive stuff like push -1 / pop rax. Updated my answer to mention its existence for GCC, not just clang.Persuade
P
20

Most of the extra code-size for an un-optimized build is the fact that the default -O0 also means a debug build, not keeping anything in registers across statements for consistent debugging even if you use a GDB j command to jump to a different source line in the same function. -O0 means a huge amount of store/reload vs. even the lightest level of optimization, especially disastrous for code-size on a non-CISC ISA that can't use memory source operands. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to GCC equally.

Especially for modern C++, a debug build is disastrous because simple template wrapper functions that normally inline and optimize away to nothing in simple cases (or maybe one instruction), instead compile to actual function calls that have to set up args and run a call instruction. e.g. for a std::vector, the operator[] member function can normally inline to a single ldr instruction, assuming the compiler has the .data() pointer in a register. But without inlining, every call-site takes multiple instructions1


Options that affect code-size in the actual .text section1 the most: alignment of branch-targets in general, or just loops, costs some code-size. Other than that:

  • -ftree-vectorize - make SIMD versions loops, also necessitating scalar cleanup if the compiler can't prove that the iteration count will be a multiple of the vector width. (Or that pointed-to arrays are non-overlapping if you don't use restrict; that may also need a scalar fallback). Enabled at -O3 in GCC11 and earlier. Enabled at -O2 in GCC12 and later, like clang.

  • -funroll-loops / -funroll-all-loops - not enabled by default even at -O3 in modern GCC. Enabled with profile-guided optimization (-fprofile-use), when it has profiling data from a -fprofile-generate build to know which loops are actually hot and worth spending code-size on. (And which are cold and thus should be optimized for size so you get fewer I-cache misses when they do run, and less eviction of other code.) PGO also influences vectorization decisions.

    Related to loop unrolling are heuristics (tuning knobs) that control loop peeling (fully unrolling) and how much to unroll. The normal way to set these is with -march=native, implying -mtune= whatever as well. -mtune=znver3 may favour big unroll factors (at least clang does), compared to -mtune=sandybridge or -mtune=haswell. But there are GCC options to manually adjust individual things, as discussed in comments on gcc: strange asm generated for simple loop and in How to ask GCC to completely unroll this loop (i.e., peel this loop)?
    There are options to override the weights and thresholds for other decision heuristics like inlining, too, but it's very rare you'd want to fine-tune that much unless you're working on refining the defaults, or finding good defaults for a new CPU.

  • -Os - optimize for size and speed, trying not to sacrifice too much speed. A good tradeoff if your code has a lot of I-cache misses, otherwise -O3 is normally faster, or at least that's the design goal for GCC. Can be worth trying different options to see if -O2 or -Os make your code faster than -O3 across some CPUs you care about; sometimes missed-optimizations or quirks of certain microarchitectures make a difference, as in Why does GCC generate 15-20% faster code if I optimize for size instead of speed? which has actual benchmarks from GCC4.6 to 4.8 (current at the time) for a specific small loop in a test program, on quite a few different x86 and ARM CPUs, with and without -march=native to actually tune for them. There's zero reason to expect that to be representative of other code, though, so you need to test yourself for your own codebase. (And for any given loop, small code changes could make a different compile option better on any given CPU.)

    And obviously -Os is very useful if you need your static code-size smaller to fit in some size limit.

  • -Oz optimizing for size only, even at a large cost in speed. GCC only very recently added this to current trunk, so expect it in GCC12 or 13. Presumably what I wrote below about clang's implementation of -Oz being quite aggressive also applies to GCC, but I haven't yet tested it.

Clang has similar options, including -Os. It also has a clang -Oz option to optimize only for size, without caring about speed. It's very aggressive, e.g. on x86 using code-golf tricks like push 1; pop rax (3 bytes total) instead of mov eax, 1 (5 bytes).

GCC's -Os unfortunately chooses to use div instead of a multiplicative inverse for division by a constant, costing lots of speed but not saving much if any size. (https://godbolt.org/z/x9h4vx1YG for x86-64). For ARM, GCC -Os still uses an inverse if you don't use a -mcpu= that implies udiv is even available, otherwise it uses udiv: https://godbolt.org/z/f4sa9Wqcj .

Clang's -Os still uses a multiplicative inverse with umull, only using udiv with -Oz. (or a call to __aeabi_uidiv helper function without any -mcpu option). So in that respect, clang -Os makes a better tradeoff than GCC, still spending a little bit of code-size to avoid slow integer division.


Footnote 1: inlining or not for std::vector

#include <vector>
int foo(std::vector<int> &v) {
    return v[0] + v[1];
}

Godbolt with gcc with the default -O0 vs. -Os for -mcpu=cortex-m7 just to randomly pick something. IDK if it's normal to use dynamic containers like std::vector on an actual microcontroller; probably not.

# -Os (same as -Og for this case, actually, omitting the frame pointer for this leaf function)
foo(std::vector<int, std::allocator<int> >&):
        ldr     r3, [r0]                @ load the _M_start member of the reference arg
        ldrd    r0, r3, [r3]            @ load a pair of words (v[0..1]) from there into r0 and r3
        add     r0, r0, r3              @ add them into the return-value register
        bx      lr

vs. a debug build (with name-demangling enabled for the asm)

# GCC -O0 -mcpu=cortex-m7 -mthumb
foo(std::vector<int, std::allocator<int> >&):
        push    {r4, r7, lr}             @ non-leaf function requires saving LR (the return address) as well as some call-preserved registers
        sub     sp, sp, #12
        add     r7, sp, #0              @ Use r7 as a frame pointer.  -O0 defaults to -fno-omit-frame-pointer
        str     r0, [r7, #4]            @ spill the incoming register arg to the stack


        movs    r1, #0                  @ 2nd arg for operator[]
        ldr     r0, [r7, #4]            @ reload the pointer to the control block as the first arg
        bl      std::vector<int, std::allocator<int> >::operator[](unsigned int)
        mov     r3, r0                  @ useless copy, but hey we told GCC not to spend any time optimizing.
        ldr     r4, [r3]                @ deref the reference (pointer) it returned, into a call-preserved register that will survive across the next call


        movs    r1, #1                  @ arg for the v[1]  operator[]
        ldr     r0, [r7, #4]
        bl      std::vector<int, std::allocator<int> >::operator[](unsigned int)
        mov     r3, r0
        ldr     r3, [r3]                @ deref the returned reference

        add     r3, r3, r4              @ v[1] + v[0]
        mov     r0, r3                  @ and copy into the return value reg because GCC didn't bother to add into it directly

        adds    r7, r7, #12             @ tear down the stack frame
        mov     sp, r7
        pop     {r4, r7, pc}            @ and return by popping saved-LR into PC

@ and there's an actual implementation of the operator[] function
@ it's 15 instructions long.  
@ But only one instance of this is needed for each type your program uses (vector<int>, vector<char*>, vector<my_foo>, etc.)
@ so it doesn't add up as much as each call-site
std::vector<int, std::allocator<int> >::operator[](unsigned int):
        push    {r7}
        sub     sp, sp, #12
  ...

As you can see, un-optimized GCC cares more about fast compile-times than even the most simple things like avoiding useless mov reg,reg instructions even within code for evaluating one expression.


Footnote 1: metadata

If you could a whole ELF executable with metadata, not just the .text + .rodata + .data you'd need to burn to flash, then of course -g debug info is very significant for size of the file, but basically irrelevant because it's not mixed in with the parts that are needed while running, so it just sits there on disk.

Symbol names and debug info can be stripped with gcc -s or strip.

Stack-unwind info is an interesting tradeoff between code-size and metadata. -fno-omit-frame-pointer wastes extra instructions and a register as a frame pointer, leading to larger machine-code size, but smaller .eh_frame stack unwind metadata. (strip does not consider that "debug" info by default, even for C programs not C++ where exception-handling might need it in non-debugging contexts.)

How to remove "noise" from GCC/clang assembly output? mentions how to get the compiler to omit some of that: -fno-asynchronous-unwind-tables omits .cfi directives in the asm output, and thus the metadata that goes into the .eh_frame section. Also -fno-exceptions -fno-rtti with C++ can reduce metadata. (Run-Time Type Information for reflection takes space.)

Linker options that control alignment of sections / ELF segments can also take extra space, relevant for tiny executables but is basically a constant amount of space, not scaling with the size of the program. See also Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?

Persuade answered 28/4, 2022 at 2:16 Comment(6)
wrt -funroll-loops on clang: There are actually some targets where it defaults on for high optimization levels like -O3 (namely NVPTX), but even for lower optimization levels recent versions of clang have started aggressively vectorizing and unrolling small loops even at -O2 godbolt.org/z/qT8xsGnPeJobey
@SteveCox: Yes, clang vectorizes at -O2, and defaults to unrolling small loops by 2 or by 4, depending on how tiny the loop is. This answer is primarily about GCC, which never does even with -O3, even when vectorizing. (Except possibly with OpenMP? IDK, I didn't re-check.) Anyway, sometimes leads to the silly situation where GCC code spends 99% of its time in a rolled-up SIMD loop with a couple instructions, but 90% of the code-size of the function is fully-unrolled scalar cleanup. Especially for char* with wide vectors, so each vector holds a lot of elementsPersuade
This particularly surprised me when I saw clang generated this (gorgeous) 110 line monstrosity where they check for pointer aliasing before dropping down into a 4 way unrolled and vectorized loop: godbolt.org/z/7nq77nWzK. Compared to gcc: godbolt.org/z/7G95ofeffJobey
@SteveCox: That specific comparison with GCC is kind of silly because you told clang to vectorize but not GCC. -O2 isn't some universally agreed-upon thing, it has a meaning specific to the compiler. godbolt.org/z/hj37G5qhT shows GCC's -O3 vectorization with the aliasing check, also lots of code. (And as I said in this answer, GCC12 will enable vectorization at -O2, but turns out not as aggressive about checking for overlap and making multiple loop versions; GCC nightly still makes scalar asm for that case)Persuade
@SteveCox: BTW, GCC's loop unrolling doesn't usually invent multiple accumulators to hide FP latency, unlike clang which does. Even when GCC -funroll-loop does unroll, it doesn't help with the bottleneck in an FP reduction loop like a sum or dot product. godbolt.org/z/czadK5dod shows clang using XMM0 and XMM1 (unfortunately not more registers so it still bottlenecks at 1 vector per 2 clocks on Skylake), vs. GCC addps xmm0, [rdi] / add xmm0, [rdi+16] etcPersuade
Oh, nice! GCC is finally enabling -ftree-vectorize by default at -O2, just like Clang. That's nice to hear. I wonder why loop unrolling is not also enabled? I still compile at -O3 with GCC as a matter of course because too many useful optimizations are not enabled with -O2. And, in spite of obsession over code size, I regularly find that my strategy of turning optimizations up to 11 will actually result in smaller, tighter code (or at least roughly equivalent), especially when taking advantage of things like LTO. -Oz will be useful for code golfing. Otherwise... Meh.Petrography
R
1

Fast doesn't mean small. In fact, a large part of speed optimization revolves around loop unrolling, which increases code generation by a lot.

If you want to optimize for size, use -Os, which is equivalent to -O2 except for all optimizations that increase size (again, like loop unrolling).

Rubellite answered 27/4, 2022 at 14:37 Comment(5)
To be precise, -O2 doesn't enable loop unrolling either.Chenille
@eerorika: Correct for GCC. But clang -O2 does unroll. clang likes to unroll tiny loops by 4, small loops by 2, but larger loops it doesn't unroll. (Also, clang -O2 enables auto-vectorization, unlike in GCC11 and earlier where that's only at -O3. But GCC12 and later will enable -ftree-vectorize at -O2 because it's generally useful on modern CPUs.)Persuade
@Blindy: Most of the code-size for an un-optimized build is the fact that the default -O0 also means a debug build, not keeping anything in registers across statements for consistent debugging even if you use a GDB j command to jump to a different source line in the same function. This means a huge amount of store/reload vs. even the lightest level of optimization, especially disastrous for code-size on a non-CISC that can't use memory source operands. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?Persuade
"-Os" is for size NOT for speed...Rna
-Os - "Optimize for size. -Os enables all -O2 optimizations except those that often increase code size ... It also enables -finline-functions, causes the compiler to tune for code size rather than execution speed, and performs further optimizations designed to reduce code size. "Twosided
M
1

Try -s -z noseparate-code (found somewhere a few months ago on stackoverflow, while wondering why a simple hello world in assembly was several kilobytes instead of a few bytes)

If i remember correct, -s removes unused symbols and -z noseparate-code removes unneeded entrys from elf-header... (Also usefull for Gentoo:)

Masera answered 4/5, 2022 at 16:20 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.