How to avoid floating point exceptions in unused SIMD lanes
Asked Answered
C

1

6

I like to run my code with floating point exceptions enabled. I do this under Linux using:

feenableexcept( FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW );

So far so good.

The issue I am having, is that sometimes the compiler (I use clang8) decides to use SIMD instructions to do a scalar division. Fine, if that is faster, even for a single scalar, why not.

But the result is that an unused lane in the SIMD register can contain a zero.

And when the SIMD division is executed, a floating point exception is thrown.

Does that mean that floating point exceptions cannot be used at all if you allow the compiler to use sse/avx extensions?

In my case, this line of C code:

float a0, min, a, d;
...
a0 = (min - a) / (d);

...is exectuted as:

divps  %xmm2,%xmm3

Which then throws a:

Thread 1 "noisetuner" received signal SIGFPE, Arithmetic exception.
Cowberry answered 28/7, 2020 at 1:51 Comment(5)
Does clang have an equivalent for GCC's -ftrapping-math to make FP exceptions a visible side-effect? (Note that GCC's version of that option is on by default, but is actually broken: it fails to stop GCC from doing some optimizations that change the number or type of of FP exceptions, possibly including from 0 to non-zero IIRC.)Yorick
clang doesn't complain when I feed it -ftrapping-math but it doesn't fix it. To stop the FPE, I have to supply -mno-mmx -mno-sse arguments.Cowberry
File a bugreport.Paranoia
Are you sure it generates a divps and not a divss? Can you provide a minimal reproducible example?Autoharp
@Autoharp Not OP but it's very easy to repro, see there: godbolt.org/z/Wd98eGEverard
E
4

I think you have found a bug in clang or maybe in llvm.

Here’s how I have reproduced, clang 10.0 emits the same code i.e. has that bug as well. Clearly, that vdivps instruction only has valid data in the initial 2 lanes of the vectors, and in the higher 2 lanes it will run 0.0 / 0.0, thus you’ll get a runtime exception if you enable these interrupts in mxcsr register like you’re doing.

Microsoft, Intel and gcc don’t emit divps for that code. If you can, switch to gcc and it should be good.

Update: Clang 10+ has an option controlling such optimizations, -ffp-exception-behavior=maytrap, take a look: https://godbolt.org/z/WG7bEE

Everard answered 28/7, 2020 at 20:39 Comment(4)
Note that gcc misses that optimization even with -fno-trapping-math and #pragma STDC FENV_ACCESS OFF godbolt.org/z/bGoe1n. So unfortunately even when you do want the optimization, you can't get it with GCC. (Even with -ffast-math, actually). Ironically, storing x and y to a dst[0] and dst[1] output array (making divps an even better optimization, no shuffle needed) defeats both clang and GCC's auto-vectorizer.Yorick
Looks like clang could easily avoid this by replacing both movsd by movddup (which at least on not too old architectures has the same port usage).Autoharp
@chtz: oh good point, yes on Nehalem and later (and some AMD I think/hope), movddup is a pure load with the broadcast handled by the load port, no vector ALU. That would require special exception-safe vectorization support to look for, which may not succeed enough of the time to be worth looking for (given the cost in compile time).Yorick
The -ffp-exception-behavior=maytrap flag makes this issue go away on clang-10. Thanks.Cowberry

© 2022 - 2024 — McMap. All rights reserved.