Some of the suggestions were already mentioned in the comments, but I will try to suggest something in addition to that.
Most SIMD floating point instructions on Haswell and newer CPUs have reciprocal throughput less than its latency, which means the performance could be improved if multiple instructions are executed in parallel. For example, according to Agner Fog's instruction tables, vaddpd
has a latency of 3 and reciprocal througput of 1 clock cycles on Haswell, meaning that the CPU can potentially execute 3 instructions in parallel. Even more vmulpd
instructions can be executed in parallel with its 5 and 0.5 clock cycles for latency and reciprocal throughput.
Your code likely does not leverage this instruction level parallelism (ILP) because the loop body depends on the accumulator
value that was updated on the previous loop iteration. This is because the compiler is not allowed to perform many of the optimizations, such as reordering of math operations on FP numbers, as that could cause mathematically different results. As a result, your algorithm becomes latency-bound.
You could mitigate that by using compiler-specific options, like -ffast-math
for gcc and compatible compilers, but it is more portable to just rewrite your algorithm with this parallelism in mind.
I will also incorporate other recommendations in the code below, such as:
- Fix the incorrect vector type, which should be
__m256d
.
- Use dedicated instructions to load the whole vector from memory instead of relying on the compiler optimizing
_mm256_set_pd
.
- Use FMA intrinsics instead of relying on the compiler optimizing
_mm256_add_pd
+_mm256_mul_pd
pair. The FMA instructions reduce latency of the calculation, making the addition effectively free. FMA also produces more precise results because there is no rounding between multiplication and addition. Note that FMA requires AVX2, it is not available in CPUs only supporting AVX.
- Use proper intrinsics to extract the final sum from the vector (which will probably be optimized away in the final assembler since
double
is stored in a vector register anyway).
void avx2_mul_64_block(double& sum, double* lhs_arr, double* rhs_arr) noexcept
{
__m256d accumulator1 = _mm256_setzero_pd();
__m256d accumulator2 = _mm256_setzero_pd();
for (std::size_t block = 0; block < 64; block += 4 * 2)
{
__m256d lhs1 = _mm256_loadu_pd(lhs_arr + block);
__m256d lhs2 = _mm256_loadu_pd(lhs_arr + block + 4);
__m256d rhs1 = _mm256_loadu_pd(rhs_arr + block);
__m256d rhs2 = _mm256_loadu_pd(rhs_arr + block + 4);
accumulator1 = _mm256_fmadd_pd(lhs1, rhs1, accumulator1);
accumulator2 = _mm256_fmadd_pd(lhs2, rhs2, accumulator2);
}
accumulator1 = _mm256_add_pd(accumulator1, accumulator2);
__m128d accumulator = _mm_add_pd(_mm256_castpd256_pd128(accumulator1),
_mm256_extractf128_pd(accumulator1, 1));
accumulator = _mm_add_pd(accumulator,
_mm_unpackhi_pd(accumulator, accumulator));
sum += _mm_cvtsd_f64(accumulator);
}
In the code above, I used two separate accumulators, so the CPU is now able to execute two chains of accumulation in parallel. Further increasing parallelism could be beneficial (see the performance numbers mentioned above), but it may be more problematic if the block length is not divisible by the number of accumulators times the number of elements in the vector. You may need to set up tail processing, and that may have some slight performance overhead.
Note that, as mentioned before, this algorithm may produce results that are not strictly equal to the original because of the different order of arithmetic operations and FMA and hence the different accumulation of the math error. However, this is often not a problem, especially with the high precision of double
.
Some of the suggestions that were not used in the code above:
- Horizontal addition (
_mm256_hadd_pd
) was not used to accumulate the final sum because on current Intel (up to Coffee Lake) and AMD (up to Zen 2) processors vhadd
instruction takes slightly more latency than the vunpckhpd
+vaddpd
pair, even though the latter has a dependency chain. This may change in future processors, and using horizontal addition may become beneficial. Horizontal addition may be useful in current CPUs to save some code size, though.
- The code uses unaligned loads from memory, as opposed to
_mm256_load_pd
because unaligned loads on modern CPUs that support AVX2 don't have a performance penalty as long as the loads are actually aligned in run time or at least don't cross cache line boundary. There is overhead when cache line and especially page boundary is crossed, but usually this still doesn't degrade performance too much on modern CPUs (there is some performance data in this post by Peter Cordes, as well as materials linked there).
__m256d
, not__m256i
. Secondly, with properly aligned input arrays (32-byte alignment), you should be using_mm256_load_pd
to get the data. Otherwise, you're relying on the compiler to understand this is what those_mm256_set_pd
calls were trying to achieve, and it probably won't. As a minor improvement, you could consider using horizontal additions for the final sum (using_mm256_hadd_pd
) – Contessacontest_mm256_hadd_pd
is not particularly useful for an efficient horizontal sum of a single vector, unless you're optimizing for code-size over speed. But yes, definitely useload
orloadu
intrinsics with__m256d
, and yes the hsum could be better (see Get sum of values stored in __m256d with SSE/AVX) – Rules_mm256_set_pd
will put the first argument into the highest part of the target register, making it hard for a compiler to replace it by a simpleload
/loadu
(_mm256_setr_pd
does not reverse the order). – Creamcolored