One main issue is that sum
is a long double
-typed variable. Operations on this types cannot be vectorized using SIMD instructions. At least, it is not the case on x86-64 CPUs using Clang or GCC. AFAIK, other architecture like ARM does not natively support such an extended precision type, but compilers, like GCC and Clang, still support >64-bit long double
precision thanks to a kind of software emulation (i.e. generating a sequence of many instructions) by default. Other compilers, like MSVC, do not seem to support extended precision (i.e. long double
is 64-bit wide and operation can be vectorized). On compilers supporting native x86-64 extended precision, the long double
type is meant to use the (obsolete) x87 FPU operating on 80-bit wide numbers. In practice, the long double
type can be wider in memory due to alignement constraints in the target ABI. There is no 80-bit wide SIMD instruction in x86. In fact, SIMD registers have a size which is a power-of-two (typically 128-bit or 256-bit wide ones).
The loop is not vectorized due to the min-max operations that cause loop dependencies issues. This can be seen with -Rpass-analysis=loop-vectorize
. By the way, please note -fopenmp-simd
is needed so for Clang not to ignore Open SIMD pragmas. In practice, the compiler reports the following message:
remark: loop not vectorized: value that could not be identified as reduction is used outside the loop
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
The dependency issue can easily be remove using -ffast-math
. However, this flag breaks the IEEE-754 standard (see this post for more information), so it may produce an unsafe program, especially if you operate on NaN/Inf values. -ffast-math
is a composite flag enabling others like -fno-associative-math
which is the one needed here. However, it is not enough in this case and it is AFAIK implied by the OpenMP SIMD pragma.
The minimal set of options required for enabling the vectorization seems to be -fno-signed-zeros -ffinite-math-only
. In fact, the minimal Clang-specific options are -fno-signed-zeros -Xclang -menable-no-nans
. With these additional options (and -O3 -mavx2 -mfma
on my machine), Clang produces the (efficient) following assembly code:
.LBB0_5: # =>This Inner Loop Header: Depth=1
vmovupd ymm8, ymmword ptr [rdi + r9]
vmovupd ymm9, ymmword ptr [rdi + r9 + 32]
vaddpd ymm1, ymm8, ymm1
vaddpd ymm7, ymm9, ymm7
vmulpd ymm10, ymm8, ymm8
vaddpd ymm0, ymm10, ymm0
vmulpd ymm10, ymm9, ymm9
vaddpd ymm6, ymm10, ymm6
vminpd ymm3, ymm8, ymm3
vminpd ymm4, ymm9, ymm4
vmaxpd ymm2, ymm8, ymm2
vmaxpd ymm5, ymm9, ymm5
add r9, 64
cmp r8, r9
jne .LBB0_5
This means the input array x
must not contain any -0.0
or NaN
values. The former is generally not much an issue. However, the later is apparently an issue since you explicitly check the presence of NaN values with xi == xi
. Besides, one can see that the compiler did not check the presence of NaN values in the loop (as expected with the above compiler flags).
The only solution I see to support NaN value is to unroll the loop yourself and let the SLP-vectorizer merge instructions (of each chunk). This is a bit cumbersome to do and not very flexible (one need to choose a SIMD vector size and take care about the last chunk), but it works well with Clang. Using an intermediate SIMD-friendly function and #pragma simd declare
may help to avoid the manual vectorization but I doubt this will fix the issue with the NaNs.
1/0
and-1/0
cause an undefined behaviour because they are integers. AFAIK,1.0/0.0
should not but it is better to usestd::numeric_limits<double>::infinity()
in C++ or just a macro likeINFINITY
. – Kramlich