AVX2: Computing dot product of 512 float arrays

Asked 27/12, 2019 at 0:23 Answered 28/1 at 6:36

I will preface this by saying that I am a complete beginner at SIMD intrinsics.

Essentially, I have a CPU which supports the AVX2 instrinsic (Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz). I would like to know the fastest way to compute the dot product of two std::vector<float> of size 512.

I have done some digging online and found this and this, and this stack overflow question suggests using the following function __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);, However, these all suggest different ways of performing the dot product I am not sure what is the correct (and fastest) way to do it.

In particular, I am looking for the fastest way to perform dot product for a vector of size 512 (because I know the vector size effects the implementation).

Thank you for your help

Edit 1: I am also a little confused about the -mavx2 gcc flag. If I use these AVX2 functions, do I need to add the flag when I compile? Also, is gcc able to do these optimizations for me (say if I use the -OFast gcc flag) if I write a naive dot product implementation?

Edit 2 If anyone has the time and energy, I would very much appreciate if you could write a full implementation. I am sure other beginners would also value this information.

Cabasset answered 27/12, 2019 at 0:23 Comment(7)

You only want dpps for something like a 3 or 4-element dot product. For larger arrays, you want vertical FMA into multiple accumulators (Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)). You'll want -mfma -mavx2 or better -march=native. And yes, you need to enable target options for any intrinsic you want to use, along with -O3. – Boleslaw 27/12, 2019 at 0:39

e.g. How to properly use prefetch instructions? shows the inner loop of a normal SIMD dot product. – Boleslaw 27/12, 2019 at 0:45

-O3 can't auto-vectorize a naive dot-product unless you use OpenMP or -ffast-math to tell the compiler to treat FP math as associative. – Boleslaw 27/12, 2019 at 0:47

Sorry I meant to say -OFas instead of -O3, will update my question – Cabasset 27/12, 2019 at 0:49

-Ofast is currently equivalent to -O3 -ffast-math so yes that would work. But unfortunately GCC won't use multiple accumulators even if it does unroll an FP loop, so you'll bottleneck on SIMD FMA latency instead of throughput. (Factor of 8 on Skylake) – Boleslaw 27/12, 2019 at 0:51

@PeterCordes if you have the time, I would very much appreciate if you could write a full implementation. Even with the information and links you have provided (which I am very grateful for), I am still confused on how to put it all together (as I am a complete beginner at this stuff). – Cabasset 27/12, 2019 at 0:58

Soonts' answer is exactly what I was describing. It requires AVX + FMA, not AVX2. Thanks, @Soonts, for writing it up. I think there's another Q&A somewhere with a good canonical dot-product implementation which I was looking for earlier but didn't find immediately. – Boleslaw 27/12, 2019 at 2:44

_mm256_dp_ps is only useful for dot-products of 2 to 4 elements; for longer vectors use vertical SIMD in a loop and reduce to scalar at the end. Using _mm256_dp_ps and _mm256_add_ps in a loop would be much slower.

GCC and clang require you to enable (with command line options) ISA extensions that you use intrinsics for, unlike MSVC and ICC.

The code below is probably close to theoretical performance limit of your CPU. Untested.

Compile it with clang or gcc -O3 -march=native. (Requires at least -mavx -mfma, but -mtune options implied by -march are good, too, and so are the other -mpopcnt and other things arch=native enables. Tune options are critical to this compiling efficiently for most CPUs with FMA, specifically -mno-avx256-split-unaligned-load: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?)

Or compile it with MSVC -O2 -arch:AVX2

#include <immintrin.h>
#include <vector>
#include <assert.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_mul_ps( a, b );
}

// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a, b, acc );
}

// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    assert( 0 == ( a.size() % 32 ) );
    if( a.empty() )
        return 0.0f;

    const float* p1 = a.data();
    const float* const p1End = p1 + a.size();
    const float* p2 = b.data();

    // Process initial 32 values. Nothing to add yet, just multiplying.
    __m256 dot0 = mul8<0>( p1, p2 );
    __m256 dot1 = mul8<1>( p1, p2 );
    __m256 dot2 = mul8<2>( p1, p2 );
    __m256 dot3 = mul8<3>( p1, p2 );
    p1 += 8 * 4;
    p2 += 8 * 4;

    // Process the rest of the data.
    // The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0, p1, p2 );
        dot1 = fma8<1>( dot1, p1, p2 );
        dot2 = fma8<2>( dot2, p1, p2 );
        dot3 = fma8<3>( dot3, p1, p2 );
        p1 += 8 * 4;
        p2 += 8 * 4;
    }

    // Add 32 values into 8
    const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
    const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
    const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
    // Add 8 values into 4
    const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
    // Add 4 values into 2
    const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
    // Add 2 lower values into the final result
    const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r1 );
}

Possible further improvements:

Unroll by 8 vectors instead of 4. I’ve checked gcc 9.2 asm output, compiler only used 8 vector registers out of the 16 available.
Make sure both input vectors are aligned, e.g. use a custom allocator which calls _aligned_malloc / _aligned_free on msvc, or aligned_alloc / free on gcc & clang. Then replace _mm256_loadu_ps with _mm256_load_ps.

To auto-vectorize a simple scalar dot product, you'd also need OpenMP SIMD or -ffast-math (implied by -Ofast) to let the compiler treat FP math as associative even though it's not (because of rounding). But GCC won't use multiple accumulators when auto-vectorizing, even if it does unroll, so you'd bottleneck on FMA latency, not load throughput.

(2 loads per FMA means the throughput bottleneck for this code is vector loads, not actual FMA operations.)

Update 2023: because this answer collected many upvotes, here’s another version which supports vectors of arbitrary lengths, not necessarily a multiple of 32 elements. The main loop is the same, the difference is handling of the remainder.

As you see, it’s relatively tricky to handle the remainder in a way which is both performant, and fair with regards to summation order. Summation order affects numerical precision of the result. The key part of the implementation is _mm256_maskload_ps conditional load instruction.

#include <immintrin.h>
#include <vector>
#include <algorithm>
#include <assert.h>
#include <stdint.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Returns acc + ( p1 * p2 ), for 8 float lanes
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
    constexpr ptrdiff_t lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a, b, acc );
}

#ifdef __AVX2__
inline __m256i makeRemainderMask( ptrdiff_t missingLanes )
{
    // Make a mask of 8 bytes
    // These aren't branches, they should compile to conditional moves
    missingLanes = std::max( missingLanes, (ptrdiff_t)0 );
    uint64_t mask = -( missingLanes < 8 );
    mask >>= missingLanes * 8;
    // Sign extend the bytes into int32 lanes in AVX vector
    __m128i tmp = _mm_cvtsi64_si128( (int64_t)mask );
    return _mm256_cvtepi8_epi32( tmp );
}
#else
// Aligned by 64 bytes
// The load will only touch a single cache line, no penalty for unaligned load
static const int alignas( 64 ) s_remainderLoadMask[ 16 ] = {
    -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0 };
inline __m256i makeRemainderMask( ptrdiff_t missingLanes )
{
    // These aren't branches, they compile to conditional moves
    missingLanes = std::max( missingLanes, (ptrdiff_t)0 );
    missingLanes = std::min( missingLanes, (ptrdiff_t)8 );
    // Unaligned load from a constant array
    const int* rsi = &s_remainderLoadMask[ missingLanes ];
    return _mm256_loadu_si256( ( const __m256i* )rsi );
}
#endif

// Same as fma8(), load conditionally using the mask
// When the mask has all bits set, an equivalent of fma8(), but 1 instruction longer
// When the mask is a zero vector, the function won't load anything, will return `acc`
template<int offsetRegs>
inline __m256 fma8rem( __m256 acc, const float* p1, const float* p2, ptrdiff_t rem )
{
    constexpr ptrdiff_t lanes = offsetRegs * 8;
    // Generate the mask for conditional loads
    // The implementation depends on whether AVX2 is enabled with compiler switches
    const __m256i mask = makeRemainderMask( ( 8 + lanes ) - rem );
    // These conditional load instructions produce zeros for the masked out lanes
    const __m256 a = _mm256_maskload_ps( p1 + lanes, mask );
    const __m256 b = _mm256_maskload_ps( p2 + lanes, mask );
    return _mm256_fmadd_ps( a, b, acc );
}

// Compute dot product of float vectors, using 8-wide FMA instructions
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    const size_t length = a.size();
    if( length == 0 )
        return 0.0f;

    const float* p1 = a.data();
    const float* p2 = b.data();
    // Compute length of the remainder; 
    // We want a remainder of length [ 1 .. 32 ] instead of [ 0 .. 31 ]
    const ptrdiff_t rem = ( ( length - 1 ) % 32 ) + 1;
    const float* const p1End = p1 + length - rem;

    // Initialize accumulators with zeros
    __m256 dot0 = _mm256_setzero_ps();
    __m256 dot1 = _mm256_setzero_ps();
    __m256 dot2 = _mm256_setzero_ps();
    __m256 dot3 = _mm256_setzero_ps();

    // Process the majority of the data.
    // The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0, p1, p2 );
        dot1 = fma8<1>( dot1, p1, p2 );
        dot2 = fma8<2>( dot2, p1, p2 );
        dot3 = fma8<3>( dot3, p1, p2 );
        p1 += 32;
        p2 += 32;
    }

    // Handle the last, possibly incomplete batch of length [ 1 .. 32 ]
    // To save multiple branches, we load that entire batch with `vmaskmovps` conditional loads
    // On modern CPUs, the performance of such loads is pretty close to normal full vector loads
    dot0 = fma8rem<0>( dot0, p1, p2, rem );
    dot1 = fma8rem<1>( dot1, p1, p2, rem );
    dot2 = fma8rem<2>( dot2, p1, p2, rem );
    dot3 = fma8rem<3>( dot3, p1, p2, rem );

    // Add 32 values into 8
    dot0 = _mm256_add_ps( dot0, dot2 );
    dot1 = _mm256_add_ps( dot1, dot3 );
    dot0 = _mm256_add_ps( dot0, dot1 );
    // Add 8 values into 4
    __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0 ),
        _mm256_extractf128_ps( dot0, 1 ) );
    // Add 4 values into 2
    r4 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
    // Add 2 lower values into the scalar result
    r4 = _mm_add_ss( r4, _mm_movehdup_ps( r4 ) );

    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r4 );
}

Forbearance answered 27/12, 2019 at 1:55 Comment(9)

Note that the unroll factor doesn't have to be a power of 2; e.g. unroll by 6 would be fine (e.g. if you were targeting 32-bit mode with only 8 vector regs, you might aim for 1 vmovups load and one vfmadd with a memory source operand, and have the other 7 vector regs as accumulators.) But with a power-of-2 (512) known size, 4 or 8 is fine. Apparently more accumulators can help in practice even though would in theory this would bottleneck on load throughput (2 loads per FMA). But for short vectors, probably best to keep code-size and cleanup compact. Not many total iterations. – Boleslaw 27/12, 2019 at 2:48

The template stuff is not necessary for GCC / clang to use a memory source operand for FMA. It could do constant-propagation after inlining if you just did p1+8*1 and so on. – Boleslaw 27/12, 2019 at 2:56

I made some edits to your answer; I hope you don't mind the collaboration. If you do, roll it back and let me know so I can repost as my own answer. – Boleslaw 27/12, 2019 at 3:3

@PeterCordes Yeah, I think the performance should be OK. However, numerical precision is probably not. For some input data, these intermediate 32-bit floats combined with almost random addition order may result in numerical issues. If I would do what OP is doing, I would consider accumulating in __m256d instead: more complex, slower, but more precise. The best tradeoff depends on the application, obviously. – Forbearance 27/12, 2019 at 3:9

About template stuff, I don’t have high confidence in compilers. Constant propagation works almost always, but breaks in some rare cases. When the constant is passed as a template argument, compiler has much less freedom how to treat my code. – Forbearance 27/12, 2019 at 3:10

SIMD + multiple accumulators is part way towards pairwise summation; a recognized technique for mitigating the error of summing an FP array (where adding a small element to a large total is bad). Having many smaller totals reduces rounding error significantly if the elements are mostly similar magnitude, especially if they're all positive. See my answer on Simd matmul program gives different numerical results. – Boleslaw 27/12, 2019 at 3:48

IIRC, FMA also makes efficient Kahan summation possible (but then of course you'd have to compile without -ffast-math or the compiler will optimize away the error compensation). Converting on the fly to double would be significantly slower, especially on Intel CPUs: vcvtps2pd on SKL costs 1 shuffle (port 5) + 1 FP uop (port 0 or 1), so it would bottleneck on 1x 128-bit of floats loaded+converted per clock vs. the current 2x 256-bit: 4x slower. You can probably do float Kahan for cheaper, and like I said multiple accumulators is already usually good. – Boleslaw 27/12, 2019 at 3:51

@Forbearance Thanks for this excellent example. I'm using in a hardware performance course that I'm teaching. I've done some measurements of your code on my 3.5GHz skylake machine and find can sustain about 19G FMA/s if everything is in the L1 cache. The peak rate for 256-bit loads on my processor is 28G loads/s, which seems like it should be bottleneck. That would mean your code achieve about 67% of peak performance. Based on your experience, do significant improvements seem possible, or is 67% of peak about as much as one can hope to achieve? – Nata 6/3 at 6:54

@Nata Well, you could try incrementing count of accumulators from 4 to 8 and see if it helps. About further optimizations, it’s possible to do much better if you need these dot products because you’re computing matrix-vector or better yet matrix-matrix product. These cases enable reusing loaded elements to do more than 1 FMA on them. Example for matrix-vector: https://mcmap.net/q/14872/-multi-threaded-fixed-size-matrix-vector-multiplication-optimized-for-many-core-cpus-with-non-uniform-caches – Forbearance 6/3 at 13:25

An alternative option is Using Vector Instructions through Built-in Functions, which is CPU-agnostic and simple to code, uses fused-multiply-add automatically, but may be not as efficient as hand-coded versions with CPU-specific code. Still, using vector built-ins can often be much cheaper in terms of labour costs, much faster than original scalar version and serve as a baseline for hand-crafted optimization.

An example:

#include <vector>
#include <utility>
#include <cassert>

template<class V, int... i>
auto hsum(V v, std::integer_sequence<int, i...>) noexcept {
    return (v[i] + ...);
}

template<class V>
auto hsum(V v) noexcept {
    return hsum(v, std::make_integer_sequence<int, sizeof v / sizeof v[0]>{});
}

float dotProductFma(const std::vector<float>& a, const std::vector<float>& b) noexcept {
    constexpr int SIMD_BYTES = 64;
    using T = float;
    // Unaligned vector, uses unaligned loads.
    using V = T __attribute__((vector_size(SIMD_BYTES), aligned(alignof(T)))); 
    constexpr auto N = sizeof(V) / sizeof(T);

    auto s = a.size();
    assert(s % N == 0); // Assumes size is a multiple of N;
    decltype(s) i;
    auto* va = reinterpret_cast<V const*>(a.data());
    auto* vb = reinterpret_cast<V const*>(b.data());

    V vdot = {};
    for(i = 0; i < s / N; ++i)
        vdot += va[i] * vb[i];

    return hsum(vdot);
}

Generated assembly code. With -ffast-math the generated code is tighter.

Accumulating dot products into float may suffer from catastrophic cancellation due to limited precision of float when the magnitudes of terms of the sum become different enough. Can be fixed by using multiple sum accumulators or/and accumulate into double vectors.

Cyclonite answered 1/10, 2023 at 4:15 Comment(25)

One single vector accumulator? An array of 512 floats is enough that out-of-order exec will have a hard time hiding FP latency over a dot product, even if you work in chunks of 16 floats (64 bytes) like here. But yes, as a minimal working example that people can build on, that works. You would normally want to use FMA on machines where it's available, except maybe machines where a separate add has lower latency, if you're not going to unroll with lots of accumulators and the extra front-end bandwidth doesn't hurt. – Boleslaw 1/10, 2023 at 4:42

(But some CPUs have e.g. 3/clock loads and 2/clock FMA/mul/add, so can benefit from FMAs for throughput even with enough unrolling to hide FP latency. e.g. Alder Lake, or Sapphire Rapids with 2x 512-bit FMA units. And probably Zen 4 and Zen 5.) – Boleslaw 1/10, 2023 at 4:42

Also, for a short array, the hsum part is more relevant so it's worth at least doing some SIMD shuffles to narrow the vector in half once or twice before using a generic summing template that does serial adds. Especially if you're starting with 64-byte vectors (16 floats, vs. the whole array only being 32 vectors long, so in terms of latency, 1/3 of the total could come from the hsum if you did it this way for the size in the title. For throughput a bit less than that since you only have 1 load per add, and no multiplies.) – Boleslaw 1/10, 2023 at 4:52

@PeterCordes A compiler may generate code with multiple accumulators, as clang does. – Cyclonite 1/10, 2023 at 4:53

So this works as basic example of using GNU C vector extensions, but it's not production-ready for performance. – Boleslaw 1/10, 2023 at 4:53

I forgot to mention in my previous comments that that's only possible with -ffast-math. Without fast-math, compilers are required to do this FP computation in source order. And some programs need strict FP semantics in at least some parts. And out of programs that would be safe with -ffast-math, many don't get compiled that way because the developers don't understand FP math and compilers well enough to trust it. (And/or they want reproducibility of results across builds, and other justifications for tieing the compiler's hands behind its back for FP.) – Boleslaw 1/10, 2023 at 4:54

@PeterCordes This loop bottlenecks on loads, not on arithmetic, for sufficiently long input arrays, I suspect. In my experience, eliminating loads or stores in loops has most dramatic effect on performance. – Cyclonite 1/10, 2023 at 4:56

On memory bandwidth yes, if your arrays are cold in cache, e.g. because they're large. But consider an AVX version of this on Skylake for example: FMA or add latency is 4, so we have one FMA and two loads per 4 cycles (during which time we could have done 8 loads if data was hot in cache). So that's 2x 32 bytes / 4 cycles = 16 bytes / clock. So on a 4GHz Skylake, that's 64 GiB/sec read bandwidth, about twice what dual-channel DRAM can do for one core. But bottlenecking on DRAM sucks, it's what you're trying to avoid. If data is hot in L2 cache you can be going maybe 2x faster; 4x for L1d. – Boleslaw 1/10, 2023 at 5:2

@PeterCordes With -ffast-math compilers make less of a pig's ear out of my naive hsum. – Cyclonite 1/10, 2023 at 5:2

The original SO question was about arrays of 512 floats, only 2048 bytes. For a pair of arrays, 4 KiB. So the cache footprint is much smaller than L1d and could very plausibly be hot. At least a small amount of unroll is appropriate (like 2 accumulators, maybe 4), especially if that doesn't make actual odd problem sizes end up in the cleanup loop for more of their total work. – Boleslaw 1/10, 2023 at 5:5

Yes, your code depends on -ffast-math for both the vector loop (for small to medium but not tiny problems) and the cleanup to compile efficiently. You should at least mention that in your answer, and include it in the Godbolt link! Also, prefer -march=x86-64-v4 instead of -mavx512f. No CPUs support just AVX-512F without other extensions; you want to give compilers a chance to use AVX-512VL or AVX-512DQ or whatever if relevant. – Boleslaw 1/10, 2023 at 5:7

@PeterCordes Your comments are quite insightful and informative, compiler generated SIMD code is often inferior to hand-coded versions. But using vector intrinsincs can often be much cheaper in terms of labour costs and fast enough and serve as a baseline for hand-crafted optimization, and that's why I felt compelled to mention this option. – Cyclonite 1/10, 2023 at 5:41

Yeah, it's potentially quite useful. I'm suggesting that you can make it more useful by mentioning some of the things you left out to keep it simple, which people would want for production use. Such as loop unrolling with 2 or 4 accumulators which you can do with GNU C vector extensions like V vdot[4] = {{}}. There's no problem doing that with GNU C vector extensions. – Boleslaw 1/10, 2023 at 5:50

For the hsum, there is __builtin_shuffle which you could use to bring the high half down once or twice, avoiding some of the worst disasters of GCC code-gen for a serial sum without store/reload if you want to keep that partly generic and flexible to different vector lengths. Without generic masked loads, it's tricky or impossible to do what Soonts shows for efficient cleanup of the last N % 16 elements, or N % 64 with unrolling. And it's a reduction so we can't just do an overlapping final vector. This is where we get into the limitations of GNU C vector extensions. – Boleslaw 1/10, 2023 at 5:53

@PeterCordes For ultimate performance the caller should not require reducing the vector dot product into a scalar. And may be do the dot product in reverse order if the vectors are last/next traversed in forward order, to reuse hottest L1d cache lines. I get cheeky 1%+ speedups in a few places by just reversing the stores of vectors which are next loaded in forward order, with heads being hottest in L1d. – Cyclonite 1/10, 2023 at 6:4

@PeterCordes As well placing the vectors into huge pages and aligning and padding the vectors to L1d size boundary, if possible, so that no unaligned head / short tail handling is required, for most time consuming calculations; and TLB misses stalling the front-end on loads and the back-end on stores are minimized. Get up to +20% speed-ups by using 1GB huge pages. But that requires a total control of the data pipeline, from loading into memory, processing and consuming/storing elsewhere. – Cyclonite 1/10, 2023 at 6:27

Yeah, I've seen that effect, too, of having one pass forward and the next backward, to get L1d and L2 hits after the turn-around. IIRC it was more than 1% for whatever size I was testing with on Skylake-server when it was new. Of course, this is still admitting defeat in terms of cache-blocking, and being bound by memory bandwidth for most of the dot-product. This sucks, it's a very low computational-intensity problem with 2 loads per FMA. – Boleslaw 1/10, 2023 at 6:27

padding the vectors to L1d size boundary - What? You'd pad a 30 KiB vector with an extra 2 K of data to make it exactly the size of L1d? Or did you actually mean padding and aligning to a cache line (64 bytes)? Yes, that makes sense. And you mean using the line size of L1d if different levels have different cache line size. I might go for 128 byte alignment since the L2 spatial prefetcher tries to complete 128B-aligned pairs. (But probably still only pad to a multiple of 64 bytes.) – Boleslaw 1/10, 2023 at 6:30

@PeterCordes My mistake, you're are quite right: align to L1d cache line size boundary. – Cyclonite 1/10, 2023 at 6:33

Since you mentioned -ffast-math, that would allow a compiler to auto-vectorize pure scalar dot products. godbolt.org/z/hPE6Y65s4 shows both GCC and clang for x86-64-v3 using good hsum patterns after the loop, which is a nice change for GCC vs. older versions :) For a fixed count, GCC even vectorizes at -O2. GCC would only unroll the loop if you used -fprofile-use (or manually -funroll-loops), but uselessly still uses a single vector accumulator. related: Dot Product of Vectors with SIMD about auto-vec – Boleslaw 1/10, 2023 at 14:35

@PeterCordes That's pretty amazing codegen! – Cyclonite 1/10, 2023 at 20:5

The loop itself seems totally normal, and something compilers have been doing for years. I was confident that's what I'd get before I opened Godbolt. The efficient shuffle patterns are nice, and have improved in the years since I've reported compiler missed-optimization bugs about them :) Notice that the size is a constant 10240 so it didn't need to make any scalar cleanup code or any checks ahead of the loop for e.g. n smaller than 1 vector. (And modern SIMD ISAs don't require aligned loads, they just perform a bit better with them, again avoiding the need to check the pointers first.) – Boleslaw 1/10, 2023 at 20:9

But sure, auto-vectorizing a reduction at all is non-trivial, and GCC from 20 years ago might not have been able to do it at all, IDK. And clang's unrolling with multiple accumulators is pretty impressive, choosing an unroll factor based on code-size in the loop. More complicated loop bodies only get unrolled by 2 or not at all. Looking at GCC's braindead unrolling with the same accumulator to save front-end throughput but not help with FP latency, it's easy to see that there's a lot you can get wrong. Still, compilers do this for integers since they're associative. – Boleslaw 1/10, 2023 at 20:13

@PeterCordes On AMD Zen3 unaligned loads are as fast as aligned when the data is aligned. – Cyclonite 1/10, 2023 at 20:13

Yes, that's what I meant by aligned loads; loads that happen to have aligned addresses at run-time. On Intel since Nehalem and AMD since Bulldozer I think, movups has had no penalty when the address is aligned. (And K10 for loads by not stores, IIRC.) I should have said "aligned arrays", that would have been more clearly describing the data/pointers, not the instructions. – Boleslaw 1/10, 2023 at 20:14

Here's the implementation using vectorclass library which will compile both for AVX2 and AVX512F (in case of targeting more modern hardware), so the code is CPU-agnostic in this regard.

#include "vectorclass.h"

float dotProduct(const std::vector<float>& a, const std::vector<float>& b) {
    if (a.size() != b.size()) {
        std::cout << "Error in dot product!" << '\n';
        return -1;
    }
    Vec16f acc1(0), acc2(0), v1, v2, v3, v4;
    size_t i = 0;
    for (; i + 32 < a.size(); i += 32) {
        v1.load(&a[i]);
        v2.load(&b[i]);
        acc1 = mul_add(v1, v2, acc1);
        v3.load(&a[i + 16]);
        v4.load(&b[i + 16]);
        acc2 = mul_add(v3, v4, acc2);
    }
    auto tmp = 0.0f;
    for (; i < a.size(); i++) {
        tmp += a[i] * b[i];
    }
    auto result = horizontal_add(acc1 + acc2) + tmp;
    return result;
}

Compiling for AVX512F would require only changing -march=x86-64-v3 flag to -march=x86-64-v4.

Foxhole answered 28/1 at 6:36 Comment(3)

You don't need two separate horizontal vector sums. The vector cleanup would be nearly twice as efficient with horizontal_add( acc1 + acc2 ). Also, your loop bound wraps for size < 16, since a.size() is an unsigned type, as well as using 16 where you need 32. A sensible loop condition would be i + 32 < a.size() if that compiles efficiently. (With size_t i, not signed int from using auto i = 0. Deducing the type from a literal constant initializer doesn't usually make sense. Surprised clang -Wall doesn't warn about comparison between different signedness the way GCC does.) – Boleslaw 28/1 at 7:14

Not perfectly efficiently: godbolt.org/z/b7rdq35GG is I think correct but gcc wastes a mov instruction in the loop with that, to save a copy of the before-increment i, instead of something where we calculate an end-point ahead of the loop. Typically pointers are good for this. godbolt.org/z/oEPeTdjrf - this also convinces compilers to avoid indexed addressing-modes, very good on Intel for front-end throughput since it avoids un-lamination of the vfma...ps ymm, ymm, [mem] instructions. – Boleslaw 28/1 at 7:23

Ok, this is reasonable. With AVX-512 masking, the cleanup for odd sizes could be a lot more efficient than scalar, but efficient ways to do that differ between AVX2 and AVX-512 and this is intended to work with either, using VectorClass's Vec16f implemented with two __m256 halves for AVX2 if AVX-512 isn't available. – Boleslaw 28/1 at 19:17

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags