Improving performance of floating-point dot-product of an array with SIMD

Asked 20/1, 2021 at 21:53 Answered 21/1, 2021 at 12:41

I have this function to compute a piece of array of double's:

void avx2_mul_64_block(double& sum, double* lhs_arr, double* rhs_arr) noexcept
{
    __m256i accumulator = _mm256_setzero_pd();

    for (std::size_t block = 0; block < 64; block += 4)
    {
        __m256i lhs = _mm256_set_pd(
            lhs_arr[block    ],
            lhs_arr[block + 1],
            lhs_arr[block + 2],
            lhs_arr[block + 3]);

        __m256i rhs = _mm256_set_pd(
            rhs_arr[block    ],
            rhs_arr[block + 1],
            rhs_arr[block + 2],
            rhs_arr[block + 3]);

        accumulator = _mm256_add_pd(accumulator, _mm256_mul_pd(lhs, rhs));
    }
    double* res = reinterpret_cast<double*>(&accumulator);
    sum += res[0] + res[1] + res[2] + res[3];
}

, and performance of this code is not what i wanted. I believe that let it succeed - avoid creating of double array for sum all its elements, but i don't know a way to do it.

By the way, _mm256_setzero_pd slows down the whole function in half compared to _mm256_setzero_si256.

My flags: -O3 -ftree-vectorize -march=native

P.S. It's not a real problem, just design question.

Irresoluble answered 20/1, 2021 at 21:53 Comment(7)

please provide a minimal reproducible example, your example won't compile. Are you using compiler optimisations? How fast is the code? How fast do you want it to be? Are your inputs aligned to the appropriate boundaries? If not could they be? – Dedie 20/1, 2021 at 21:58

I'm sorry, typo was corrected. – Irresoluble 20/1, 2021 at 22:28

Firstly, your register type is wrong. It should be __m256d, not __m256i. Secondly, with properly aligned input arrays (32-byte alignment), you should be using _mm256_load_pd to get the data. Otherwise, you're relying on the compiler to understand this is what those _mm256_set_pd calls were trying to achieve, and it probably won't. As a minor improvement, you could consider using horizontal additions for the final sum (using _mm256_hadd_pd) – Contessacontest 20/1, 2021 at 22:44

Did you look at the generated asm? set_pd puts the numbers in the reverse order of memory. You could also look at the asm the compiler generates for naive code, autovectorization should work well for this with -ffast-math. – Warmongering 20/1, 2021 at 22:48

@paddy: _mm256_hadd_pd is not particularly useful for an efficient horizontal sum of a single vector, unless you're optimizing for code-size over speed. But yes, definitely use load or loadu intrinsics with __m256d, and yes the hsum could be better (see Get sum of values stored in __m256d with SSE/AVX) – Rules 20/1, 2021 at 23:38

Note that _mm256_set_pd will put the first argument into the highest part of the target register, making it hard for a compiler to replace it by a simple load/loadu (_mm256_setr_pd does not reverse the order). – Creamcolored 20/1, 2021 at 23:53

Almost a duplicate of How do you load/store from/to an array of doubles with GNU C Vector Extensions?, but only because my answer has a section about Intel intrinsics, while the question asked for GNU C native vector stuff without intrinsics. Also related: Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? - except this appears not to be GCC, probably clang. GCC wouldn't let you get away with implicit conversion between __m256i and __m256d – Rules 20/1, 2021 at 23:53

Some of the suggestions were already mentioned in the comments, but I will try to suggest something in addition to that.

Most SIMD floating point instructions on Haswell and newer CPUs have reciprocal throughput less than its latency, which means the performance could be improved if multiple instructions are executed in parallel. For example, according to Agner Fog's instruction tables, vaddpd has a latency of 3 and reciprocal througput of 1 clock cycles on Haswell, meaning that the CPU can potentially execute 3 instructions in parallel. Even more vmulpd instructions can be executed in parallel with its 5 and 0.5 clock cycles for latency and reciprocal throughput.

Your code likely does not leverage this instruction level parallelism (ILP) because the loop body depends on the accumulator value that was updated on the previous loop iteration. This is because the compiler is not allowed to perform many of the optimizations, such as reordering of math operations on FP numbers, as that could cause mathematically different results. As a result, your algorithm becomes latency-bound.

You could mitigate that by using compiler-specific options, like -ffast-math for gcc and compatible compilers, but it is more portable to just rewrite your algorithm with this parallelism in mind.

I will also incorporate other recommendations in the code below, such as:

Fix the incorrect vector type, which should be __m256d.
Use dedicated instructions to load the whole vector from memory instead of relying on the compiler optimizing _mm256_set_pd.
Use FMA intrinsics instead of relying on the compiler optimizing _mm256_add_pd+_mm256_mul_pd pair. The FMA instructions reduce latency of the calculation, making the addition effectively free. FMA also produces more precise results because there is no rounding between multiplication and addition. Note that FMA requires AVX2, it is not available in CPUs only supporting AVX.
Use proper intrinsics to extract the final sum from the vector (which will probably be optimized away in the final assembler since double is stored in a vector register anyway).

void avx2_mul_64_block(double& sum, double* lhs_arr, double* rhs_arr) noexcept
{
    __m256d accumulator1 = _mm256_setzero_pd();
    __m256d accumulator2 = _mm256_setzero_pd();

    for (std::size_t block = 0; block < 64; block += 4 * 2)
    {
        __m256d lhs1 = _mm256_loadu_pd(lhs_arr + block);
        __m256d lhs2 = _mm256_loadu_pd(lhs_arr + block + 4);
        __m256d rhs1 = _mm256_loadu_pd(rhs_arr + block);
        __m256d rhs2 = _mm256_loadu_pd(rhs_arr + block + 4);

        accumulator1 = _mm256_fmadd_pd(lhs1, rhs1, accumulator1);
        accumulator2 = _mm256_fmadd_pd(lhs2, rhs2, accumulator2);
    }

    accumulator1 = _mm256_add_pd(accumulator1, accumulator2);

    __m128d accumulator = _mm_add_pd(_mm256_castpd256_pd128(accumulator1),
        _mm256_extractf128_pd(accumulator1, 1));
    accumulator = _mm_add_pd(accumulator,
        _mm_unpackhi_pd(accumulator, accumulator));

    sum += _mm_cvtsd_f64(accumulator);
}

In the code above, I used two separate accumulators, so the CPU is now able to execute two chains of accumulation in parallel. Further increasing parallelism could be beneficial (see the performance numbers mentioned above), but it may be more problematic if the block length is not divisible by the number of accumulators times the number of elements in the vector. You may need to set up tail processing, and that may have some slight performance overhead.

Note that, as mentioned before, this algorithm may produce results that are not strictly equal to the original because of the different order of arithmetic operations and FMA and hence the different accumulation of the math error. However, this is often not a problem, especially with the high precision of double.

Some of the suggestions that were not used in the code above:

Horizontal addition (_mm256_hadd_pd) was not used to accumulate the final sum because on current Intel (up to Coffee Lake) and AMD (up to Zen 2) processors vhadd instruction takes slightly more latency than the vunpckhpd+vaddpd pair, even though the latter has a dependency chain. This may change in future processors, and using horizontal addition may become beneficial. Horizontal addition may be useful in current CPUs to save some code size, though.
The code uses unaligned loads from memory, as opposed to _mm256_load_pd because unaligned loads on modern CPUs that support AVX2 don't have a performance penalty as long as the loads are actually aligned in run time or at least don't cross cache line boundary. There is overhead when cache line and especially page boundary is crossed, but usually this still doesn't degrade performance too much on modern CPUs (there is some performance data in this post by Peter Cordes, as well as materials linked there).

Soldierly answered 21/1, 2021 at 12:41 Comment(5)

Good answer but one minor thing, _mm256_hadd_pd is slow, it’s decodes into multiple µops on all CPUs. I usually do it like this instead: gist.github.com/Const-me/0c08f7bef1fd32f31db81bf1c1cb72a1 – Schaffhausen 21/1, 2021 at 13:2

@Schaffhausen Yes, it seems horizontal addition instructions have slightly more latency than the unpack+add pair. I've updated the answer, thanks. – Soldierly 21/1, 2021 at 13:21

Misaligned loads have a latency and throughput penalty on any cache-line split, as the load execution unit needs to access L1d a 2nd time for the other side of the split. (There are also a limited number of split buffer, and a perf counter for cases where the CPU runs out.) In practice the perf difference is usually small (a few %) for misaligned inputs with 256b vectors when the data isn't already hot in L1d cache - the bottleneck of waiting for data from outer caches or especially DRAM gives the core + L1d time to hide the costs. Unlike with AVX512 where misalignment costs more. – Rules 21/1, 2021 at 14:31

@PeterCordes When the data is not in L1d cache then misaligned access penalty, if any, is insignificant compared to the latency of lower level caches and RAM. There is extra penalty when the access crosses page boundary. Otherwise, if the data is in cache, I had the impression that the penalty is removed in the recent x86 CPUs. Do you have a reference that shows that's not the case? – Soldierly 21/1, 2021 at 15:14

How can I accurately benchmark unaligned access speed on x86_64 has my test results from Skylake: with the same address repeatedly (hot in L1d) cache-line split loads have 1c throughput, vs. 0.5c for non-split. (And ~3.8c throughput for page-split loads). There is truly zero penalty for misaligned loads that don't cross a cache line boundary, e.g. loading a dword when address % 64 <= 60 – Rules 21/1, 2021 at 15:18

As Marc Glisse alluded to in the comments above, you'll probably want to set -ffast-math in your compiler flags. This is one of those functions that is so easy for the compiler to optimise, that you're better off just writing the code in C++ directly.

void mul_64_block(double& sum, double* lhs_arr, double* rhs_arr) {
    double res = 0;
    for(int i = 0; i < 64; ++i) {
        res += lhs_arr[i] * rhs_arr[i];
    }
    sum += res;
}

This C++ code produces identical output to your simd code.

https://godbolt.org/z/cddPMd

Denial answered 21/1, 2021 at 3:37 Comment(1)

Or you can use #pragma omp simd reduction (something) to let the compiler pretend FP math is associative for this one loop, without having to use -ffast-math for your whole file (or program with LTO). OpenMP might also get GCC to unroll with multiple accumulators where it wouldn't by default (except maybe with profile-guided optimization), letting it hide FP latency. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for the effect of that on an FP dot product – Rules 21/1, 2021 at 4:51

Recommended topics

Hot tags