My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually unrolling 1-vs-2 or 1-vs-4 iterations gave a p-value ~0.03, 2-vs-4 was rarely < 0.05, and 2-vs-8+ was always > 0.05.
If I set the compiler to use 128-bit vectors or 256-bit vectors it always unrolled 4x, which is a multiple of 64-byte cache lines (significant or coincidence?).
The reason I'm thinking about cache lines is because I didn't expect unrolling to have any impact for a memory-bound program that sequentially reads and writes gigabytes of floats. Should there be a benefit to unrolling in this case? It's also possible there was no significant difference and my sample size wasn't large enough.
I found this blog that says manually unrolling an array copy is faster for medium sized arrays and streaming is fastest for longer arrays. Their AvxAsyncPFCopier, and AvxAsyncPFUnrollCopier functions seem to benefit from using whole cache lines as well as manual unrolling. Benchmark in the blog with source here.
#include <iostream>
#include <immintrin.h>
int main() {
// example of manually unrolling float arrays
size_t bytes = sizeof(__m256) * 10;
size_t alignment = sizeof(__m256);
// 10 x 32-byte vectors
__m256* a = (__m256*) _mm_malloc(bytes, alignment);
__m256* b = (__m256*) _mm_malloc(bytes, alignment);
__m256* c = (__m256*) _mm_malloc(bytes, alignment);
for (int i = 0; i < 10; i += 2) {
// cache miss?
// load 2 x 64-byte cache lines:
// 2 x 32-byte vectors from b
// 2 x 32-byte vectors from c
a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);
// cache hit?
a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);
// special bonus for consuming whole cache lines?
}
}
Original source for 3 unique float arrays
for (int64_t i = 0; i < size; ++i) {
a[i] = b[i] + c[i];
}
MSVC with AVX2 instructions
a[i] = b[i] + c[i];
00007FF7E2522370 vmovups ymm2,ymmword ptr [rax+rcx]
00007FF7E2522375 vmovups ymm1,ymmword ptr [rcx+rax-20h]
00007FF7E252237B vaddps ymm1,ymm1,ymmword ptr [rax-20h]
00007FF7E2522380 vmovups ymmword ptr [rdx+rax-20h],ymm1
00007FF7E2522386 vaddps ymm1,ymm2,ymmword ptr [rax]
00007FF7E252238A vmovups ymm2,ymmword ptr [rcx+rax+20h]
00007FF7E2522390 vmovups ymmword ptr [rdx+rax],ymm1
00007FF7E2522395 vaddps ymm1,ymm2,ymmword ptr [rax+20h]
00007FF7E252239A vmovups ymm2,ymmword ptr [rcx+rax+40h]
00007FF7E25223A0 vmovups ymmword ptr [rdx+rax+20h],ymm1
00007FF7E25223A6 vaddps ymm1,ymm2,ymmword ptr [rax+40h]
00007FF7E25223AB add r9,20h
00007FF7E25223AF vmovups ymmword ptr [rdx+rax+40h],ymm1
00007FF7E25223B5 lea rax,[rax+80h]
00007FF7E25223BC cmp r9,r10
00007FF7E25223BF jle main$omp$2+0E0h (07FF7E2522370h)
MSVC with default instructions
a[i] = b[i] + c[i];
00007FF71ECB2372 movups xmm0,xmmword ptr [rax-10h]
00007FF71ECB2376 add r9,10h
00007FF71ECB237A movups xmm1,xmmword ptr [rcx+rax-10h]
00007FF71ECB237F movups xmm2,xmmword ptr [rax+rcx]
00007FF71ECB2383 addps xmm1,xmm0
00007FF71ECB2386 movups xmm0,xmmword ptr [rax]
00007FF71ECB2389 addps xmm2,xmm0
00007FF71ECB238C movups xmm0,xmmword ptr [rax+10h]
00007FF71ECB2390 movups xmmword ptr [rdx+rax-10h],xmm1
00007FF71ECB2395 movups xmm1,xmmword ptr [rcx+rax+10h]
00007FF71ECB239A movups xmmword ptr [rdx+rax],xmm2
00007FF71ECB239E movups xmm2,xmmword ptr [rcx+rax+20h]
00007FF71ECB23A3 addps xmm1,xmm0
00007FF71ECB23A6 movups xmm0,xmmword ptr [rax+20h]
00007FF71ECB23AA addps xmm2,xmm0
00007FF71ECB23AD movups xmmword ptr [rdx+rax+10h],xmm1
00007FF71ECB23B2 movups xmmword ptr [rdx+rax+20h],xmm2
00007FF71ECB23B7 add rax,40h
00007FF71ECB23BB cmp r9,r10
00007FF71ECB23BE jle main$omp$2+0D2h (07FF71ECB2372h)
-funroll-loops
is only enabled by default as part of-fprofile-use
when profiling data is available. You're not on a Mac or something whereg++
is actuallyclang++
are you? LLVM's optimizer does unroll tiny loops by 4, small loops by 3 or 2. – Unfetterstruct alignas(64) Test { float f[16]; };
and avoid all the implementation defined_
stuff. Your original code will optimize just fine if the alignment is known from the struct. – Metrist__m256*
instead of_mm256_loadu_ps
). Unlike with AVX-512 where misaligned pointers can give a slowdown even for DRAM, like 15% or so. (AVX1 CPUs like Sandybridge do have worse slowdowns for misaligned 256-bit load/store, at least on cache-line splits. But otherwise, DRAM is so slow that there's time to absorb the line-split penalties.) – Unfetter_mm_prefetch
or_mm256_stream_si256
or both. The prefetch example in the blog seems to benefit from using a whole cache line per iteration. – Melonie