I made a bubble sort implementation in C, and was testing its performance when I noticed that the -O3
flag made it run even slower than no flags at all! Meanwhile -O2
was making it run a lot faster as expected.
Without optimisations:
time ./sort 30000
./sort 30000 1.82s user 0.00s system 99% cpu 1.816 total
-O2
:
time ./sort 30000
./sort 30000 1.00s user 0.00s system 99% cpu 1.005 total
-O3
:
time ./sort 30000
./sort 30000 2.01s user 0.00s system 99% cpu 2.007 total
The code:
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <time.h>
int n;
void bubblesort(int *buf)
{
bool changed = true;
for (int i = n; changed == true; i--) { /* will always move at least one element to its rightful place at the end, so can shorten the search by 1 each iteration */
changed = false;
for (int x = 0; x < i-1; x++) {
if (buf[x] > buf[x+1]) {
/* swap */
int tmp = buf[x+1];
buf[x+1] = buf[x];
buf[x] = tmp;
changed = true;
}
}
}
}
int main(int argc, char *argv[])
{
if (argc != 2) {
fprintf(stderr, "Usage: %s <arraysize>\n", argv[0]);
return EXIT_FAILURE;
}
n = atoi(argv[1]);
if (n < 1) {
fprintf(stderr, "Invalid array size.\n");
return EXIT_FAILURE;
}
int *buf = malloc(sizeof(int) * n);
/* init buffer with random values */
srand(time(NULL));
for (int i = 0; i < n; i++)
buf[i] = rand() % n + 1;
bubblesort(buf);
return EXIT_SUCCESS;
}
The assembly language generated for -O2
(from godbolt.org):
bubblesort:
mov r9d, DWORD PTR n[rip]
xor edx, edx
xor r10d, r10d
.L2:
lea r8d, [r9-1]
cmp r8d, edx
jle .L13
.L5:
movsx rax, edx
lea rax, [rdi+rax*4]
.L4:
mov esi, DWORD PTR [rax]
mov ecx, DWORD PTR [rax+4]
add edx, 1
cmp esi, ecx
jle .L2
mov DWORD PTR [rax+4], esi
mov r10d, 1
add rax, 4
mov DWORD PTR [rax-4], ecx
cmp r8d, edx
jg .L4
mov r9d, r8d
xor edx, edx
xor r10d, r10d
lea r8d, [r9-1]
cmp r8d, edx
jg .L5
.L13:
test r10b, r10b
jne .L14
.L1:
ret
.L14:
lea eax, [r9-2]
cmp r9d, 2
jle .L1
mov r9d, r8d
xor edx, edx
mov r8d, eax
xor r10d, r10d
jmp .L5
And the same for -O3
:
bubblesort:
mov r9d, DWORD PTR n[rip]
xor edx, edx
xor r10d, r10d
.L2:
lea r8d, [r9-1]
cmp r8d, edx
jle .L13
.L5:
movsx rax, edx
lea rcx, [rdi+rax*4]
.L4:
movq xmm0, QWORD PTR [rcx]
add edx, 1
pshufd xmm2, xmm0, 0xe5
movd esi, xmm0
movd eax, xmm2
pshufd xmm1, xmm0, 225
cmp esi, eax
jle .L2
movq QWORD PTR [rcx], xmm1
mov r10d, 1
add rcx, 4
cmp r8d, edx
jg .L4
mov r9d, r8d
xor edx, edx
xor r10d, r10d
lea r8d, [r9-1]
cmp r8d, edx
jg .L5
.L13:
test r10b, r10b
jne .L14
.L1:
ret
.L14:
lea eax, [r9-2]
cmp r9d, 2
jle .L1
mov r9d, r8d
xor edx, edx
mov r8d, eax
xor r10d, r10d
jmp .L5
It seems like the only significant difference to me is the apparent attempt to use SIMD, which seems like it should be a large improvement, but I also can't tell what on earth it's attempting with those pshufd
instructions... is this just a failed attempt at SIMD? Or maybe the couple of extra instructions is just about edging out my instruction cache?
Timings were done on an AMD Ryzen 5 3600.
gcc -Ofast
is just a shortcut for-O3 -ffast-math
, but there's no FP math here. If you're going to try anything, try-O3 -march=native
to let it use AVX2 in case GCC's vectorization strategy could help with wider vectors instead of hurt, whatever it's trying to do. Although I don't think so; it's just doing a 64-bit load and shuffle, not even 128-bit with SSE2. – Livy-Os
(optimize for space) sometimes produced the fastest code because of the size of the instruction cache on x86-64. I don't know if that would matter here or if it's still applicable in current versions of gcc but it might be interesting to try it and compare. – Tengdin-O2
I'd expect, not shooting itself in the foot with store-forwarding stalls and increased latency before it can detect branch mispredicts. – Livy