Is there a special benefit to consuming whole cache lines between iterations of a loop?
Asked Answered
M

1

11

My program adds float arrays and is unrolled 4x when compiled with max optimizations by MSVC and G++. I didn't understand why both compilers chose to unroll 4x so I did some testing and found only occasionally a t-test on runtimes for manually unrolling 1-vs-2 or 1-vs-4 iterations gave a p-value ~0.03, 2-vs-4 was rarely < 0.05, and 2-vs-8+ was always > 0.05.

If I set the compiler to use 128-bit vectors or 256-bit vectors it always unrolled 4x, which is a multiple of 64-byte cache lines (significant or coincidence?).

The reason I'm thinking about cache lines is because I didn't expect unrolling to have any impact for a memory-bound program that sequentially reads and writes gigabytes of floats. Should there be a benefit to unrolling in this case? It's also possible there was no significant difference and my sample size wasn't large enough.

I found this blog that says manually unrolling an array copy is faster for medium sized arrays and streaming is fastest for longer arrays. Their AvxAsyncPFCopier, and AvxAsyncPFUnrollCopier functions seem to benefit from using whole cache lines as well as manual unrolling. Benchmark in the blog with source here.

#include <iostream>
#include <immintrin.h>

int main() {
    // example of manually unrolling float arrays
    size_t bytes = sizeof(__m256) * 10;
    size_t alignment = sizeof(__m256);
    // 10 x 32-byte vectors
    __m256* a = (__m256*) _mm_malloc(bytes, alignment); 
    __m256* b = (__m256*) _mm_malloc(bytes, alignment);
    __m256* c = (__m256*) _mm_malloc(bytes, alignment); 

    for (int i = 0; i < 10; i += 2) {
        // cache miss?
        // load 2 x 64-byte cache lines:
        //      2 x 32-byte vectors from b
        //      2 x 32-byte vectors from c
        a[i + 0] = _mm256_add_ps(b[i + 0], c[i + 0]);

        // cache hit?
        a[i + 1] = _mm256_add_ps(b[i + 1], c[i + 1]);

        // special bonus for consuming whole cache lines?
    }
}

Original source for 3 unique float arrays

for (int64_t i = 0; i < size; ++i) {
    a[i] = b[i] + c[i];
}

MSVC with AVX2 instructions

            a[i] = b[i] + c[i];
00007FF7E2522370  vmovups     ymm2,ymmword ptr [rax+rcx]  
00007FF7E2522375  vmovups     ymm1,ymmword ptr [rcx+rax-20h]  
00007FF7E252237B  vaddps      ymm1,ymm1,ymmword ptr [rax-20h]  
00007FF7E2522380  vmovups     ymmword ptr [rdx+rax-20h],ymm1  
00007FF7E2522386  vaddps      ymm1,ymm2,ymmword ptr [rax]  
00007FF7E252238A  vmovups     ymm2,ymmword ptr [rcx+rax+20h]  
00007FF7E2522390  vmovups     ymmword ptr [rdx+rax],ymm1  
00007FF7E2522395  vaddps      ymm1,ymm2,ymmword ptr [rax+20h]  
00007FF7E252239A  vmovups     ymm2,ymmword ptr [rcx+rax+40h]  
00007FF7E25223A0  vmovups     ymmword ptr [rdx+rax+20h],ymm1  
00007FF7E25223A6  vaddps      ymm1,ymm2,ymmword ptr [rax+40h]  
00007FF7E25223AB  add         r9,20h  
00007FF7E25223AF  vmovups     ymmword ptr [rdx+rax+40h],ymm1  
00007FF7E25223B5  lea         rax,[rax+80h]  
00007FF7E25223BC  cmp         r9,r10  
00007FF7E25223BF  jle         main$omp$2+0E0h (07FF7E2522370h) 

MSVC with default instructions

            a[i] = b[i] + c[i];
00007FF71ECB2372  movups      xmm0,xmmword ptr [rax-10h]  
00007FF71ECB2376  add         r9,10h  
00007FF71ECB237A  movups      xmm1,xmmword ptr [rcx+rax-10h]  
00007FF71ECB237F  movups      xmm2,xmmword ptr [rax+rcx]  
00007FF71ECB2383  addps       xmm1,xmm0  
00007FF71ECB2386  movups      xmm0,xmmword ptr [rax]  
00007FF71ECB2389  addps       xmm2,xmm0  
00007FF71ECB238C  movups      xmm0,xmmword ptr [rax+10h]  
00007FF71ECB2390  movups      xmmword ptr [rdx+rax-10h],xmm1  
00007FF71ECB2395  movups      xmm1,xmmword ptr [rcx+rax+10h]  
00007FF71ECB239A  movups      xmmword ptr [rdx+rax],xmm2  
00007FF71ECB239E  movups      xmm2,xmmword ptr [rcx+rax+20h]  
00007FF71ECB23A3  addps       xmm1,xmm0  
00007FF71ECB23A6  movups      xmm0,xmmword ptr [rax+20h]  
00007FF71ECB23AA  addps       xmm2,xmm0  
00007FF71ECB23AD  movups      xmmword ptr [rdx+rax+10h],xmm1  
00007FF71ECB23B2  movups      xmmword ptr [rdx+rax+20h],xmm2  
00007FF71ECB23B7  add         rax,40h  
00007FF71ECB23BB  cmp         r9,r10  
00007FF71ECB23BE  jle         main$omp$2+0D2h (07FF71ECB2372h)  
Melonie answered 19/6, 2022 at 5:11 Comment(12)
For contiguous accesses, I wouldn't expect that to be a reason why unrolling helps; I'd expect unrolling to have the same benefit even if the array address % 64 == 16 or 32, so 64 contiguous bytes span 2 lines. If striding down a column of a matrix, it could be much better to consume whole cache lines so you don't have to touch those lines again for the next (set of) column(s) next outer iteration.Unfetter
Also, G++ decided to unroll? What version and options? -funroll-loops is only enabled by default as part of -fprofile-use when profiling data is available. You're not on a Mac or something where g++ is actually clang++ are you? LLVM's optimizer does unroll tiny loops by 4, small loops by 3 or 2.Unfetter
Doing an AVX load is more efficient than doing 4 separate register loads even if you are memory bound. So the compiler will unroll the loop till it can do a combined large load but beyond that there is no benefit to unrolling further.Metrist
You should make a struct alignas(64) Test { float f[16]; }; and avoid all the implementation defined _ stuff. Your original code will optimize just fine if the alignment is known from the struct.Metrist
@GoswinvonBrederlow So an aligned move is significantly faster than unaligned? I will have to do this!Melonie
@PeterCordes I will check my compiler options at work. I'm writing this question from memory. Maybe I'm remembering wrong that G++ unrolled.Melonie
If you're bottlenecked on DRAM bandwidth, misalignment (vs. the vector width) makes barely any difference, like a couple percent for modern AVX2 CPUs. (Except to correctness if you use alignment-required loads like you're doing here, with deref of __m256* instead of _mm256_loadu_ps). Unlike with AVX-512 where misaligned pointers can give a slowdown even for DRAM, like 15% or so. (AVX1 CPUs like Sandybridge do have worse slowdowns for misaligned 256-bit load/store, at least on cache-line splits. But otherwise, DRAM is so slow that there's time to absorb the line-split penalties.)Unfetter
Aligning your arrays by the vector width is typically fine, you don't need to go all the way to 64-byte lines or 128-byte pairs of lines. Or to 2MiB largepages.Unfetter
@PeterCordes I have read some of your other answers about streaming, and after reading the blog I added to my question, I'm curious to know when you think it is better to use _mm_prefetch or _mm256_stream_si256 or both. The prefetch example in the blog seems to benefit from using a whole cache line per iteration.Melonie
Software prefetch sometimes gains a little bit for sequential access, but is hard to tune and depends on the system and other load. If all cores are busy, there's often no bandwidth going to waste so no need for it. NT stores are good (or can be, see this Q&A for occasional drawbacks) if you're not going to re-read the destination data soon. But that's a sign you should cache-block your problem to get L2 cache hits, instead of bottlenecking on memory bandwidth. Avoid wasting CPU cycles mostly just copying.Unfetter
You're writing in C++ with intrinsics, so you're not limited to simply gluing together library functions over arrays like you would be in NumPY or whatever. Increase your computational intensity (ALU work per memory bandwidth, or time you load data into registers) by doing more work in a single pass, even if that means sometimes redoing the same work if it's cheap. And by cache-blocking for L2 size. Spending a lot of effort optimizing single-thread memory bandwidth should be a last resort. Or the last thing on the todo list after already doing more useful optimizations in other code.Unfetter
@Melonie Not sure if the difference is significant anymore. But there still is some difference between aligned and unaligned loads. Data crossing cache lines will make the biggest difference because it will take up 2 cache entries so cache efficiency is reduced. Less of a problem with larger and larger arrays as the whole array will use 1 extra cache entry. But for short vectors a doubling of the cache usage can be fatal to speed.Metrist
D
-1

I think the decision to unroll loops by compilers can be influenced by various factors, including instruction pipelining, instruction-level parallelism, and memory access patterns. Unrolling loops can help expose more opportunities for the compiler to optimize instruction scheduling and reduce loop cost, potentially improving performance.

In your case, since you are dealing with memory-bound operations, the main bottleneck is probably memory access rather than computation. Unrolling loops could help improve performance by increasing memory prefetching opportunities and reducing loop cost.

Dissertate answered 16/5, 2024 at 11:15 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Classicist

© 2022 - 2025 — McMap. All rights reserved.