OpenMP SIMD Multi-Reductions (Sum, Min and Max) in Single For Loop

Asked 22/9, 2023 at 19:54 Answered 23/9, 2023 at 12:44

I have the following loop to calculate basic summary statistics (mean, standard deviation, minimum and maximum) in C++, skipping missing values (x is a double vector):

int k = 0;
long double sum = 0.0, sq_sum = 0.0;
double minv = 1.0/0.0, maxv = -1.0/0.0;

#pragma omp simd reduction(+:sum,sq_sum,k) reduction(max:maxv) reduction(min:minv)
for (int i = 0; i < n; ++i) {
  double xi = x[i];
  int tmp = xi == xi;
  sum += tmp ? xi : 0.0;
  sq_sum += tmp ? xi * xi : 0.0;
  k += tmp ? 1 : 0;
  minv = minv > xi ? xi : minv;
  maxv = maxv < xi ? xi : maxv;
}

This loop does not vectorize on clang 16, which has something to do with the min and max values. It vectorizes without min and max. Is there a smart way to vectorize this loop?

Intoxicated answered 22/9, 2023 at 19:54 Comment(3)

If you are not tied to OpenMP, note that there are (native) SIMD-friendly languages that can be used to efficiently vectorize this loop very well (like ISPC : see the result) while properly handling -0.0, NaNs, infinities (and producing generally faster codes with some compiler flag tuning). – Kramlich 23/9, 2023 at 12:55

Unfortunately I'm tied to OpenMP... – Intoxicated 23/9, 2023 at 18:38

Note that 1/0 and -1/0 cause an undefined behaviour because they are integers. AFAIK, 1.0/0.0 should not but it is better to use std::numeric_limits<double>::infinity() in C++ or just a macro like INFINITY. – Kramlich 23/9, 2023 at 20:46

One main issue is that sum is a long double-typed variable. Operations on this types cannot be vectorized using SIMD instructions. At least, it is not the case on x86-64 CPUs using Clang or GCC. AFAIK, other architecture like ARM does not natively support such an extended precision type, but compilers, like GCC and Clang, still support >64-bit long double precision thanks to a kind of software emulation (i.e. generating a sequence of many instructions) by default. Other compilers, like MSVC, do not seem to support extended precision (i.e. long double is 64-bit wide and operation can be vectorized). On compilers supporting native x86-64 extended precision, the long double type is meant to use the (obsolete) x87 FPU operating on 80-bit wide numbers. In practice, the long double type can be wider in memory due to alignement constraints in the target ABI. There is no 80-bit wide SIMD instruction in x86. In fact, SIMD registers have a size which is a power-of-two (typically 128-bit or 256-bit wide ones).

The loop is not vectorized due to the min-max operations that cause loop dependencies issues. This can be seen with -Rpass-analysis=loop-vectorize. By the way, please note -fopenmp-simd is needed so for Clang not to ignore Open SIMD pragmas. In practice, the compiler reports the following message:

remark: loop not vectorized: value that could not be identified as reduction is used outside the loop
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering

The dependency issue can easily be remove using -ffast-math. However, this flag breaks the IEEE-754 standard (see this post for more information), so it may produce an unsafe program, especially if you operate on NaN/Inf values. -ffast-math is a composite flag enabling others like -fno-associative-math which is the one needed here. However, it is not enough in this case and it is AFAIK implied by the OpenMP SIMD pragma.

The minimal set of options required for enabling the vectorization seems to be -fno-signed-zeros -ffinite-math-only. In fact, the minimal Clang-specific options are -fno-signed-zeros -Xclang -menable-no-nans. With these additional options (and -O3 -mavx2 -mfma on my machine), Clang produces the (efficient) following assembly code:

.LBB0_5:                                # =>This Inner Loop Header: Depth=1
        vmovupd ymm8, ymmword ptr [rdi + r9]
        vmovupd ymm9, ymmword ptr [rdi + r9 + 32]
        vaddpd  ymm1, ymm8, ymm1
        vaddpd  ymm7, ymm9, ymm7
        vmulpd  ymm10, ymm8, ymm8
        vaddpd  ymm0, ymm10, ymm0
        vmulpd  ymm10, ymm9, ymm9
        vaddpd  ymm6, ymm10, ymm6
        vminpd  ymm3, ymm8, ymm3
        vminpd  ymm4, ymm9, ymm4
        vmaxpd  ymm2, ymm8, ymm2
        vmaxpd  ymm5, ymm9, ymm5
        add     r9, 64
        cmp     r8, r9
        jne     .LBB0_5

This means the input array x must not contain any -0.0 or NaN values. The former is generally not much an issue. However, the later is apparently an issue since you explicitly check the presence of NaN values with xi == xi. Besides, one can see that the compiler did not check the presence of NaN values in the loop (as expected with the above compiler flags).

The only solution I see to support NaN value is to unroll the loop yourself and let the SLP-vectorizer merge instructions (of each chunk). This is a bit cumbersome to do and not very flexible (one need to choose a SIMD vector size and take care about the last chunk), but it works well with Clang. Using an intermediate SIMD-friendly function and #pragma simd declare may help to avoid the manual vectorization but I doubt this will fix the issue with the NaNs.

Cyrilla answered 23/9, 2023 at 12:44 Comment(15)

Thanks a lot for the detailed answer! I see the long double issue, but I'm using ARM64 so that was not the problem for me. Thanks for pointing out though that this type does not exist on ARM. The code is meant to be cross-plattform compatible so unfortunately I cannot set dangerous compiler flags. As indicated, I do get a vectorized version of the code, including the missing value check, without min and max. I see the issue of loop dependencies, I think with multithreading it should work if each thread has its own copy of minv and maxv, but I guess SIMD does not really create copies. – Intoxicated 23/9, 2023 at 18:35

The min/max operation are harder to vectorize because of the conditionals which are affected by the presence of a NaN and can completely change the result of the long dependency chain (having a length not known at compile time). For example, having a NaN in minv cause a cascade of NaN results while a NaN in xi does not impact minv. This put a significant pressure on the optimizer (eg. mathematical recurrence check). Theoretically the dependency chain should be broken by the OpenMP pragma but not all OMP infos are provided to optimizers (eg. most are not in GCC). – Kramlich 23/9, 2023 at 20:16

This is why a manual unrolling can solve this issue: it manually break the dependency chain so it is easy for the SLP-vectorizer to generate SIMD instructions. I tried with an x2 unrolling and Clang successfully vectorized it. Not all compiler are able to do that but if you use basic loops, I expect mainstream optimizing compilers to generate a good code. Note that there are pragma to specify this to compilers (ivdep) but they are non-portable and the OpenMP pragma is supposed to do that. – Kramlich 23/9, 2023 at 20:21

Regarding the multithreading versus SIMD, copies are possibly in SIMD : they are called broadcast. This is a very common operation. AFAIR, this is the default behaviour in OpenMP. By the way minv and maxv are initialized to 0 so minv cannot be bigger than 0 and maxv cannot be smaller than 0. Is this expected? A summary of all of this is that it is IMHO a missed optimization of the compiler that may be fixed later. – Kramlich 23/9, 2023 at 20:28

Ok, thanks for clarifying. I'll see what can be done here, but manual vectorization will likely not be portable. Regarding NaN, I'm not sure I understand it: minv and maxv cannot be NaN, as any comparison with NaN evaluates to FALSE. I'd expect the compiler to understand this. – Intoxicated 23/9, 2023 at 20:29

Ahh yes, I should edit this. Actually minv is initialized to positive infinity and maxv to negative infinity. – Intoxicated 23/9, 2023 at 20:30

Yes, an optimizing compiler should see that (I expected Clang to do it), but this is not as simple as it seems, especially due to the dependency chain of conditionals. One can track the hundred of optimization IR passes so to track which one fail but this is a long tedious work (and the result is generally that there is a missed optimization in a given pass or that the order of the pass is not optimal)... – Kramlich 23/9, 2023 at 20:38

Regarding manual vectorization, my point is not to use SIMD intrinsics, but just to add inner loops operating on chunks. This should generate a good code on mainstream architectures (it may just be sub-optimal on future ones or very old one). The inner loop will be unrolled by Clang and SLP-vectorized (the chunk-size must be a constant). Note that there is a missed-optimization in Clang preventing some auto-vectorization with inner loops like that (I reported the issue to Clang developers long ago but the issue did not get much attention so far). One has to try this solution to be sure. – Kramlich 23/9, 2023 at 20:40

For Windows, at least with MSVC, long double is the same format as double (IEEE binary64), so it's useless but doesn't prevent vectorization. (Like gcc -mlong-double-64). I think some of the non-x86 platforms that don't have HW support for anything wider than double also use long double = double, so you might get vectorization on AArch64 for example. – Vociferous 25/9, 2023 at 14:34

@PeterCordes Interesting. I though MSVC was supporting it since I remember they did (at least >10 years ago when I did some tests). It turns out they indeed only support 64-bit long double. Surprisingly, it turns out GCC and Clang use 128-bit long double on both ARM and x86 (even though the x87 FPU is 80-bit wide and ARM AFAIK has no scalar >64-bit FP instructions)! On all platforms, long double and double are recognized as different types (even when they have the same size). See: godbolt.org/z/319dParEc. (I assume the OS does not impact the behaviour of compilers) – Kramlich 25/9, 2023 at 18:54

GCC/clang's default on x86-64 is -mlong-double-80. The x86-64 System V ABI has long double as the 80-bit x87 format with alignof(long double)==16, thus sizeof(long double) == 16 as well. In i386 SysV, it only has alignof() = 4, so a sizeof() = 12. (Size has to be a multiple of the alignment so you can make an array without violating the alignment guarantee in any element.) x86-64 SysV's choice to align it by 16 was probably excessive, but it does mean that it never splits across a wider boundary. It's already slow to load/store on real hardware (including at the time), though. – Vociferous 25/9, 2023 at 19:3

Re: OS: Yes, surprisingly I think even MinGW GCC targeting Windows defaults to -mlong-double-80, which is somewhat weird because that's not ABI-compatible with MSVC. But that's just based on what people have reported in SO comments; I haven't tried it. – Vociferous 25/9, 2023 at 19:4

On AArch64, I presume long double is _Float128 / __float128 (gcc.gnu.org/onlinedocs/gcc/Floating-Types.html). GCC calls a helper function to multiply or add. e.g. __multf3, tetra-word float. Thanks for checking that AArch64 didn't just use long double = double. – Vociferous 25/9, 2023 at 19:7

@PeterCordes I edited the answer to include more information about the long double. Thank you. – Kramlich 25/9, 2023 at 21:23

If you really need the 128b long double, cpufun.substack.com/p/portable-support-for-128b-floats may interest you. – Zeiler 26/9, 2023 at 11:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags