Generally nobody unmasks FP exceptions, otherwise you'd need shuffles to e.g. duplicate one of the elements so the top element is doing the same division as one of the other elements. Or has some other known safe thing.
Maybe you can get away with only shuffling the divisor, if you can assume the dividend is non-NaN in that element.
With AVX512 you could suppress exceptions for an element using zero-masking, but until then there's no such feature. Also AVX512 lets you override the rounding mode + Suppress All Exceptions (SAE) without masking, so you could make nearest-even explicit to get SAE. But that suppresses exceptions for all elements.
Seriously, don't enable FP exceptions. Compilers barely / don't know how to optimize in a way that's safe if the number of exceptions is a visible side-effect. e.g. GCC's -ftrapping-math
is on by default, but it's broken.
I wouldn't assume LLVM is any better; the default strict FP probably still does optimizations that could give one SIGFPE where the source would have raised 2 or 4. Maybe even optimizations that raise 0 where the source would raise 1, or vice versa, like GCC's broken and near-useless default.
Enabling FP exceptions might be useful for debugging, though, if you expect to never have any of a certain kind of exception. But you can probably deal with the occasional false positive from a SIMD instruction by ignoring ones with that source address.
If there's a tradeoff between performance and exception-correctness, most users of a library would rather that it maximized performance.
Even clearing and then checking the sticky FP masked-flags with fenv
stuff is rarely done, and requires controlled circumstances to make use of. I wouldn't have any expectations for a library function call, especially not one that used any SIMD.
Avoid subnormals in the garbage element
You can get slowdowns from subnormals (aka denormals), if MXCSR doesn't have FTZ and DAZ set. (i.e. the normal case, unless you compiled with (the Rust equivalent of) -ffast-math
.)
Producing a NaN or +-Inf takes no extra time for typical x86 hardware with SSE / AVX instructions. (Fun fact: NaN is slow, too, with legacy with x87 math even on modern HW). So it's safe to _mm_or_ps
with a cmpps
result to create a NAN in some elements of a vector before a math operation, for example. Or _mm_and_ps
to create some zeros in the divisor before division.
But be careful about what garbage is in your padding because it could lead to spurious subnormals. 0.0
and NaN (all ones) are generally always safe.
Usually avoid horizontal stuff with SIMD. SIMD vec != geometry vec.
Using only 3 out of 4 elements of a SIMD vector is usually a bad idea because it usually means you're using a single SIMD vector to hold a single geometry vector, instead of three vectors of 4 x
coords, 4 y
coords, and 4 z
coords.
Shuffles / horizontal stuff mostly costs extra instructions (except for broadcast loads of a scalar that was already in memory), but you often need a lot of shuffles if you're using SIMD this way. There are cases where you can't vectorize over an array of things, but you can still get a speedup with SIMD.
If you're just using this partial-vector stuff for the leftover elements of an odd-sized operation then great, one partial vector is much better than 3 scalar iterations. But most people asking about using only 3 of 4 vector elements are asking because they're using SIMD wrong, e.g. adding geometry-vector as a SIMD vector is still cheap, but dot-product needs shuffles. See https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ for some nice stuff about how to use SIMD the right way (SoA vs. AoS and so on). If you already know about that and are just using 3-element vectors for the odd corner case, not for most of the work, then that's fine.
Padding to a multiple of the vector width is generally great for odd sizes, but another option for some algos is a final unaligned vector that ends at the end of your data. A partially-overlapping store is fine, unless it's an in-place algorithm and you have to worry about not doing an element twice. (Or about store-forwarding stalls even for idempotent operations like AND-masking or clamping).
Getting zeros for free
If you had just 2 float
elements left over, a movsd
load will load + zero-extend into an XMM register. You might as well get the compiler to do that instead of a movaps
.
Otherwise, if shuffling together 3 scalars, insertps
can zero elements. Or you might have known zero high parts of xmm regs from movss
loads from memory. So using a 0.0
as part of a vector-from-scalar initializer (like C++ _mm_set_ps()
) can be free for the compiler.
With AVX, you can consider using a masked load if you're worried about padding causing a subnormal. https://www.felixcloutier.com/x86/vmaskmov. But that's somewhat slower than vmovaps
. And masked stores are much more expensive on AMD, even Ryzen.