Is there a special benefit to consuming whole cache lines between iterations of a loop?

My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually unrolling 1-vs-2 or 1-vs-4 iterations gave a p-value ~0.03, 2-vs-4 was rarely < 0.05, and 2-vs-8+ was always > 0.05.

If I set the compiler to use 128-bit vectors or 256-bit vectors it always unrolled 4x, which is a multiple of 64-byte cache lines (significant or coincidence?).

The reason I'm thinking about cache lines is because I didn't expect unrolling to have any impact for a memory-bound program that sequentially reads and writes gigabytes of floats. Should there be a benefit to unrolling in this case? It's also possible there was no significant difference and my sample size wasn't large enough.

I found this blog that says manually unrolling an array copy is faster for medium sized arrays and streaming is fastest for longer arrays. Their AvxAsyncPFCopier, and AvxAsyncPFUnrollCopier functions seem to benefit from using whole cache lines as well as manual unrolling. Benchmark in the blog with source here.

#include <iostream>
#include <immintrin.h>

int main() {
    // example of manually unrolling float arrays
    size_t bytes = sizeof(__m256) * 10;
    size_t alignment = sizeof(__m256);
    // 10 x 32-byte vectors
    __m256* a = (__m256*) _mm_malloc(bytes, alignment); 
    __m256* b = (__m256*) _mm_malloc(bytes, alignment);
    __m256* c = (__m256*) _mm_malloc(bytes, alignment); 

    for (int i = 0; i < 10; i += 2) {
        // cache miss?
        // load 2 x 64-byte cache lines:
        //      2 x 32-byte vectors from b
        //      2 x 32-byte vectors from c
        a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);

        // cache hit?
        a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);

        // special bonus for consuming whole cache lines?
    }
}

Original source for 3 unique float arrays

for (int64_t i = 0; i < size; ++i) {
    a[i] = b[i] + c[i];
}

MSVC with AVX2 instructions

            a[i] = b[i] + c[i];
00007FF7E2522370  vmovups     ymm2,ymmword ptr [rax+rcx]  
00007FF7E2522375  vmovups     ymm1,ymmword ptr [rcx+rax-20h]  
00007FF7E252237B  vaddps      ymm1,ymm1,ymmword ptr [rax-20h]  
00007FF7E2522380  vmovups     ymmword ptr [rdx+rax-20h],ymm1  
00007FF7E2522386  vaddps      ymm1,ymm2,ymmword ptr [rax]  
00007FF7E252238A  vmovups     ymm2,ymmword ptr [rcx+rax+20h]  
00007FF7E2522390  vmovups     ymmword ptr [rdx+rax],ymm1  
00007FF7E2522395  vaddps      ymm1,ymm2,ymmword ptr [rax+20h]  
00007FF7E252239A  vmovups     ymm2,ymmword ptr [rcx+rax+40h]  
00007FF7E25223A0  vmovups     ymmword ptr [rdx+rax+20h],ymm1  
00007FF7E25223A6  vaddps      ymm1,ymm2,ymmword ptr [rax+40h]  
00007FF7E25223AB  add         r9,20h  
00007FF7E25223AF  vmovups     ymmword ptr [rdx+rax+40h],ymm1  
00007FF7E25223B5  lea         rax,[rax+80h]  
00007FF7E25223BC  cmp         r9,r10  
00007FF7E25223BF  jle         main$omp$2+0E0h (07FF7E2522370h)

MSVC with default instructions

            a[i] = b[i] + c[i];
00007FF71ECB2372  movups      xmm0,xmmword ptr [rax-10h]  
00007FF71ECB2376  add         r9,10h  
00007FF71ECB237A  movups      xmm1,xmmword ptr [rcx+rax-10h]  
00007FF71ECB237F  movups      xmm2,xmmword ptr [rax+rcx]  
00007FF71ECB2383  addps       xmm1,xmm0  
00007FF71ECB2386  movups      xmm0,xmmword ptr [rax]  
00007FF71ECB2389  addps       xmm2,xmm0  
00007FF71ECB238C  movups      xmm0,xmmword ptr [rax+10h]  
00007FF71ECB2390  movups      xmmword ptr [rdx+rax-10h],xmm1  
00007FF71ECB2395  movups      xmm1,xmmword ptr [rcx+rax+10h]  
00007FF71ECB239A  movups      xmmword ptr [rdx+rax],xmm2  
00007FF71ECB239E  movups      xmm2,xmmword ptr [rcx+rax+20h]  
00007FF71ECB23A3  addps       xmm1,xmm0  
00007FF71ECB23A6  movups      xmm0,xmmword ptr [rax+20h]  
00007FF71ECB23AA  addps       xmm2,xmm0  
00007FF71ECB23AD  movups      xmmword ptr [rdx+rax+10h],xmm1  
00007FF71ECB23B2  movups      xmmword ptr [rdx+rax+20h],xmm2  
00007FF71ECB23B7  add         rax,40h  
00007FF71ECB23BB  cmp         r9,r10  
00007FF71ECB23BE  jle         main$omp$2+0D2h (07FF71ECB2372h)

Recommended topics

Hot tags