Bubble sort slower with -O3 than -O2 with GCC
Asked Answered
M

1

149

I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3 flag made it run even slower than no flags at all! Meanwhile -O2 was making it run a lot faster as expected.

Without optimisations:

time ./sort 30000

./sort 30000  1.82s user 0.00s system 99% cpu 1.816 total

-O2:

time ./sort 30000

./sort 30000  1.00s user 0.00s system 99% cpu 1.005 total

-O3:

time ./sort 30000

./sort 30000  2.01s user 0.00s system 99% cpu 2.007 total

The code:

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <time.h>

int n;

void bubblesort(int *buf)
{
    bool changed = true;
    for (int i = n; changed == true; i--) { /* will always move at least one element to its rightful place at the end, so can shorten the search by 1 each iteration */
        changed = false;

        for (int x = 0; x < i-1; x++) {
            if (buf[x] > buf[x+1]) {
                /* swap */
                int tmp = buf[x+1];
                buf[x+1] = buf[x];
                buf[x] = tmp;

                changed = true;
            }
        }
    }
}

int main(int argc, char *argv[])
{
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <arraysize>\n", argv[0]);
        return EXIT_FAILURE;
    }

    n = atoi(argv[1]);
    if (n < 1) {
        fprintf(stderr, "Invalid array size.\n");
        return EXIT_FAILURE;
    }

    int *buf = malloc(sizeof(int) * n);

    /* init buffer with random values */
    srand(time(NULL));
    for (int i = 0; i < n; i++)
        buf[i] = rand() % n + 1;

    bubblesort(buf);

    return EXIT_SUCCESS;
}

The assembly language generated for -O2 (from godbolt.org):

bubblesort:
        mov     r9d, DWORD PTR n[rip]
        xor     edx, edx
        xor     r10d, r10d
.L2:
        lea     r8d, [r9-1]
        cmp     r8d, edx
        jle     .L13
.L5:
        movsx   rax, edx
        lea     rax, [rdi+rax*4]
.L4:
        mov     esi, DWORD PTR [rax]
        mov     ecx, DWORD PTR [rax+4]
        add     edx, 1
        cmp     esi, ecx
        jle     .L2
        mov     DWORD PTR [rax+4], esi
        mov     r10d, 1
        add     rax, 4
        mov     DWORD PTR [rax-4], ecx
        cmp     r8d, edx
        jg      .L4
        mov     r9d, r8d
        xor     edx, edx
        xor     r10d, r10d
        lea     r8d, [r9-1]
        cmp     r8d, edx
        jg      .L5
.L13:
        test    r10b, r10b
        jne     .L14
.L1:
        ret
.L14:
        lea     eax, [r9-2]
        cmp     r9d, 2
        jle     .L1
        mov     r9d, r8d
        xor     edx, edx
        mov     r8d, eax
        xor     r10d, r10d
        jmp     .L5

And the same for -O3:

bubblesort:
        mov     r9d, DWORD PTR n[rip]
        xor     edx, edx
        xor     r10d, r10d
.L2:
        lea     r8d, [r9-1]
        cmp     r8d, edx
        jle     .L13
.L5:
        movsx   rax, edx
        lea     rcx, [rdi+rax*4]
.L4:
        movq    xmm0, QWORD PTR [rcx]
        add     edx, 1
        pshufd  xmm2, xmm0, 0xe5
        movd    esi, xmm0
        movd    eax, xmm2
        pshufd  xmm1, xmm0, 225
        cmp     esi, eax
        jle     .L2
        movq    QWORD PTR [rcx], xmm1
        mov     r10d, 1
        add     rcx, 4
        cmp     r8d, edx
        jg      .L4
        mov     r9d, r8d
        xor     edx, edx
        xor     r10d, r10d
        lea     r8d, [r9-1]
        cmp     r8d, edx
        jg      .L5
.L13:
        test    r10b, r10b
        jne     .L14
.L1:
        ret
.L14:
        lea     eax, [r9-2]
        cmp     r9d, 2
        jle     .L1
        mov     r9d, r8d
        xor     edx, edx
        mov     r8d, eax
        xor     r10d, r10d
        jmp     .L5

It seems like the only significant difference to me is the apparent attempt to use SIMD, which seems like it should be a large improvement, but I also can't tell what on earth it's attempting with those pshufd instructions... is this just a failed attempt at SIMD? Or maybe the couple of extra instructions is just about edging out my instruction cache?

Timings were done on an AMD Ryzen 5 3600.

Mons answered 9/10, 2021 at 2:35 Comment(7)
@Abel: gcc -Ofast is just a shortcut for -O3 -ffast-math, but there's no FP math here. If you're going to try anything, try -O3 -march=native to let it use AVX2 in case GCC's vectorization strategy could help with wider vectors instead of hurt, whatever it's trying to do. Although I don't think so; it's just doing a 64-bit load and shuffle, not even 128-bit with SSE2.Livy
At least on older versions of gcc, -Os (optimize for space) sometimes produced the fastest code because of the size of the instruction cache on x86-64. I don't know if that would matter here or if it's still applicable in current versions of gcc but it might be interesting to try it and compare.Tengdin
@DavidConrad: -Os would make GCC choose not to auto-vectorize, so it would be about the same as -O2 I'd expect, not shooting itself in the foot with store-forwarding stalls and increased latency before it can detect branch mispredicts.Livy
You should include the assembly code that your actual compiler outputs, not from godbolt.org.Abell
@user253751: disagree; as long as the querent picked the same GCC version on Godbolt as they have locally so the instructions are the same, Godbolt's nice filtering of directives is better. And linking the source+asm on Godbolt makes it better for anyone who wants to see what other GCC versions / options do.Livy
(update on that last comment: it is sometimes relevant for performance questions to include actual disassembly, so we can look for code alignment issues wrt. 32-byte boundaries, especially on Skylake-family CPUs where the JCC erratum mitigation can create performance pot-holes unexpectedly. You won't see that from Godbolt, and its binary output won't necessarily be linked with identical CRT code so addresses may differ.)Livy
If you're benchmarking a bubble sort, you're using the wrong algorithm.Balkhash
L
176

This is a regression in GCC11/12.
GCC10 and earlier were doing separate dword loads, even if it merged for a qword store.

It looks like GCC's naïveté about store-forwarding stalls is hurting its auto-vectorization strategy here. See also Store forwarding by example for some practical benchmarks on Intel with hardware performance counters, and What are the costs of failed store-to-load forwarding on x86? Also Agner Fog's x86 optimization guides.

(gcc -O3 enables -ftree-vectorize and a few other options not included by -O2, e.g. if-conversion to branchless cmov, which is another way -O3 can hurt with data patterns GCC didn't expect. By comparison, Clang enables auto-vectorization even at -O2, although some of its optimizations are still only on at -O3.)

It's doing 64-bit loads (and branching to store or not) on pairs of ints. This means, if we swapped the last iteration, this load comes half from that store, half from fresh memory, so we get a store-forwarding stall after every swap. But bubble sort often has long chains of swapping every iteration as an element bubbles far, so this is really bad.

(Bubble sort is bad in general, especially if implemented naively without keeping the previous iteration's second element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so it is fair enough for wanting to try.)

Anyway, this is pretty clearly an anti-optimization you should report on GCC Bugzilla with the "missed-optimization" keyword. Scalar loads are cheap, and store-forwarding stalls are costly. (Can modern x86 implementations store-forward from more than one prior store? no, nor can microarchitectures other than in-order Atom efficiently load when it partially overlaps with one previous store, and partially from data that has to come from the L1d cache.)

Even better would be to keep buf[x+1] in a register and use it as buf[x] in the next iteration, avoiding a store and load. (Like good hand-written asm bubble sort examples, a few of which exist on Stack Overflow.)

If it wasn't for the store-forwarding stalls (which AFAIK GCC doesn't know about in its cost model), this strategy might be about break-even. SSE 4.1 for a branchless pmind / pmaxd comparator might be interesting, but that would mean always storing and the C source doesn't do that.


If this strategy of double-width load had any merit, it would be better implemented with pure integer on a 64-bit machine like x86-64, where you can operate on just the low 32 bits with garbage (or valuable data) in the upper half. E.g.,

## What GCC should have done,
## if it was going to use this 64-bit load strategy at all

        movsx   rax, edx           # apparently it wasn't able to optimize away your half-width signed loop counter into pointer math
        lea     rcx, [rdi+rax*4]   # Usually not worth an extra instruction just to avoid an indexed load and indexed store, but let's keep it for easy comparison.
.L4:
        mov     rax, [rcx]       # into RAX instead of XMM0
        add     edx, 1
            #  pshufd  xmm2, xmm0, 0xe5
            #  movd    esi, xmm0
            #  movd    eax, xmm2
            #  pshufd  xmm1, xmm0, 225
        mov     rsi, rax
        rol     rax, 32   # swap halves, just like the pshufd
        cmp     esi, eax  # or eax, esi?  I didn't check which is which
        jle     .L2
        movq    QWORD PTR [rcx], rax   # conditionally store the swapped qword

(Or with BMI2 available from -march=native, rorx rsi, rax, 32 can copy-and-swap in one uop. Without BMI2, mov and swapping the original instead of the copy saves latency if running on a CPU without mov-elimination, such as Ice Lake with updated microcode.)

So total latency from load to compare is just integer load + one ALU operation (rotate). Vs. XMM load -> movd. And its fewer ALU uops. This does nothing to help with the store-forwarding stall problem, though, which is still a showstopper. This is just an integer SWAR implementation of the same strategy, replacing 2x pshufd and 2x movd r32, xmm with just mov + rol.

Actually, there's no reason to use 2x pshufd here. Even if using XMM registers, GCC could have done one shuffle that swapped the low two elements, setting up for both the store and movd. So even with XMM regs, this was sub-optimal. But clearly two different parts of GCC emitted those two pshufd instructions; one even printed the shuffle constant in hex while the other used decimal! I assume one swapping and the other just trying to get vec[1], the high element of the qword.


slower than no flags at all

The default is -O0, consistent-debugging mode that spills all variables to memory after every C statement, so it's pretty horrible and creates big store-forwarding latency bottlenecks. (Somewhat like if every variable was volatile.) But it's successful store forwarding, not stalls, so "only" ~5 cycles, but still much worse than 0 for registers. (A few modern microarchitectures including Zen 2 have some special cases that are lower latency). The extra store and load instructions that have to go through the pipeline don't help.

It's generally not interesting to benchmark -O0. -O1 or -Og should be your go-to baseline for the compiler to do the basic amount of optimization a normal person would expect, without anything fancy, but also not intentionally gimp the asm by skipping register allocation.


Semi-related: optimizing bubble sort for size instead of speed can involve memory-destination rotate (creating store-forwarding stalls for back-to-back swaps), or a memory-destination xchg (implicit lock prefix -> very slow). See this Code Golf answer.

Livy answered 9/10, 2021 at 3:9 Comment(4)
"(Bubble Sort is bad in general, especially if implemented naively without keeping the previous iteration's 2nd element around in a register. It can be interesting to analyze the asm details of exactly why it sucks, so fair enough for wanting to try.)" When you say this, you mean even compared to other O(N^2) sorting algorithms, yes?Freightage
@KarlKnechtel: Yes, precisely, like I explained in my answer linked from the start of that sentence you quoted; that's why I linked it. Simple sorting algorithms have their place for small problem sizes, e.g. as the base-case for divide-and-conquer sorts like MergeSort; it's common for such algorithms to use InsertionSort below a size threshold like maybe 16. Or like in this case, just as an experiment to see how well branch-prediction and other CPU microarchitectural features do at running "simple" loops. And also how well compilers do.Livy
Excellent answer, especially the recommendation and rationale for reporting this to GCC.Octonary
@PeterMortensen - Thanks for the edit, although I had to fix a couple things (e.g. [] link inside another [] didn't work, and also "the assembly language" doesn't read well to talk about a compiler's output. You could say "the assembly code", but I think it's still 100% clear and actually easier to read to just say "the asm". Concision is valuable, so IMO it's not always better to expand things. Sometimes it is on the whole better, perhaps for beginners, so I put up with some amount of that even when I think it's unnecessary.)Livy

© 2022 - 2025 — McMap. All rights reserved.