"Safe" SIMD arithmetic on aligned vectors of odd size?
Asked Answered
Q

2

8

Let's say I have some 16-bytes aligned structure, that just wraps 3xFloat32 array:

#[repr(C, align(16))]
pub struct Vector(pub [f32; 3]);

Now I want to divide two instances of it, like this:

use core::arch::x86_64;

let a = Vector([1f32, 2f32, 3f32]);
let b = Vector([4f32, 5f32, 6f32]);
let mut q = Vector([0f32, 0f32, 0ff32]);

unsafe {
    let a1 = x86_64::_mm_load_ps(a.0.as_ptr());
    let b1 = x86_64::_mm_load_ps(b.0.as_ptr());
    let q1 = x86_64::_mm_div_ps(a1, b1);
    x86_64::_mm_store_ps(q.0.as_mut_ptr(), q1);
}

It does division, but there's a problem: 4-th element contains garbage, that, among other things, can be signalling NaN. And if certain exceptions flags are unmasked, SIGFPE will be triggered. I want to avoid this somehow, without silencing signal completely. I.e. I either want to silence it only on 4-th pair of elements, or put there some sane values. What are the best, fastest way to do this? Or maybe there's better approach in general?

Quadriplegia answered 8/10, 2019 at 6:39 Comment(0)
T
6

Generally nobody unmasks FP exceptions, otherwise you'd need shuffles to e.g. duplicate one of the elements so the top element is doing the same division as one of the other elements. Or has some other known safe thing.

Maybe you can get away with only shuffling the divisor, if you can assume the dividend is non-NaN in that element.

With AVX512 you could suppress exceptions for an element using zero-masking, but until then there's no such feature. Also AVX512 lets you override the rounding mode + Suppress All Exceptions (SAE) without masking, so you could make nearest-even explicit to get SAE. But that suppresses exceptions for all elements.


Seriously, don't enable FP exceptions. Compilers barely / don't know how to optimize in a way that's safe if the number of exceptions is a visible side-effect. e.g. GCC's -ftrapping-math is on by default, but it's broken.

I wouldn't assume LLVM is any better; the default strict FP probably still does optimizations that could give one SIGFPE where the source would have raised 2 or 4. Maybe even optimizations that raise 0 where the source would raise 1, or vice versa, like GCC's broken and near-useless default.

Enabling FP exceptions might be useful for debugging, though, if you expect to never have any of a certain kind of exception. But you can probably deal with the occasional false positive from a SIMD instruction by ignoring ones with that source address.


If there's a tradeoff between performance and exception-correctness, most users of a library would rather that it maximized performance.

Even clearing and then checking the sticky FP masked-flags with fenv stuff is rarely done, and requires controlled circumstances to make use of. I wouldn't have any expectations for a library function call, especially not one that used any SIMD.


Avoid subnormals in the garbage element

You can get slowdowns from subnormals (aka denormals), if MXCSR doesn't have FTZ and DAZ set. (i.e. the normal case, unless you compiled with (the Rust equivalent of) -ffast-math.)

Producing a NaN or +-Inf takes no extra time for typical x86 hardware with SSE / AVX instructions. (Fun fact: NaN is slow, too, with legacy with x87 math even on modern HW). So it's safe to _mm_or_ps with a cmpps result to create a NAN in some elements of a vector before a math operation, for example. Or _mm_and_ps to create some zeros in the divisor before division.

But be careful about what garbage is in your padding because it could lead to spurious subnormals. 0.0 and NaN (all ones) are generally always safe.


Usually avoid horizontal stuff with SIMD. SIMD vec != geometry vec.

Using only 3 out of 4 elements of a SIMD vector is usually a bad idea because it usually means you're using a single SIMD vector to hold a single geometry vector, instead of three vectors of 4 x coords, 4 y coords, and 4 z coords.

Shuffles / horizontal stuff mostly costs extra instructions (except for broadcast loads of a scalar that was already in memory), but you often need a lot of shuffles if you're using SIMD this way. There are cases where you can't vectorize over an array of things, but you can still get a speedup with SIMD.

If you're just using this partial-vector stuff for the leftover elements of an odd-sized operation then great, one partial vector is much better than 3 scalar iterations. But most people asking about using only 3 of 4 vector elements are asking because they're using SIMD wrong, e.g. adding geometry-vector as a SIMD vector is still cheap, but dot-product needs shuffles. See https://deplinenoise.wordpress.com/2015/03/06/slides-simd-at-insomniac-games-gdc-2015/ for some nice stuff about how to use SIMD the right way (SoA vs. AoS and so on). If you already know about that and are just using 3-element vectors for the odd corner case, not for most of the work, then that's fine.

Padding to a multiple of the vector width is generally great for odd sizes, but another option for some algos is a final unaligned vector that ends at the end of your data. A partially-overlapping store is fine, unless it's an in-place algorithm and you have to worry about not doing an element twice. (Or about store-forwarding stalls even for idempotent operations like AND-masking or clamping).


Getting zeros for free

If you had just 2 float elements left over, a movsd load will load + zero-extend into an XMM register. You might as well get the compiler to do that instead of a movaps.

Otherwise, if shuffling together 3 scalars, insertps can zero elements. Or you might have known zero high parts of xmm regs from movss loads from memory. So using a 0.0 as part of a vector-from-scalar initializer (like C++ _mm_set_ps()) can be free for the compiler.

With AVX, you can consider using a masked load if you're worried about padding causing a subnormal. https://www.felixcloutier.com/x86/vmaskmov. But that's somewhat slower than vmovaps. And masked stores are much more expensive on AMD, even Ryzen.

Thundersquall answered 8/10, 2019 at 6:51 Comment(5)
I don't want to enable FP exceptions myself, but I want my code (Which is a library) to behave correctly if they're enabled by either compiler or user. If in general nobody unmasks FP exceptions, I can just use conditional compilation to let the user enable/disable them. Just wanted to be sure It is in general acceptable to have a code that does not handle them correctly by default.Quadriplegia
@L117: yeah, very few people will have expectations of strict FP-exception behaviour for an implementation of any library function. Nobody knows what you do internally. Given the choice between perf and exception semantics, most people would prefer perf. There's very little you can usefully do upon catching a SIGFPE, other than maybe log it, so I think it's very rare to care about it. GCC's exception-semantics-preserving option has never worked, so that tells you something! But you do want to be careful with partial vectors to avoid subnormals; added another section to my answer.Thundersquall
I clearly understand the difference between geometric vector and SIMD vector. But what do you mean it's bad to use a SIMD vector to hold a single geometric vector? Is it about using 3 of 8 available "cells" of 256-bits wide register? It is sure not as effective as using 8 of 8, but it still much faster than using scalar operations.Quadriplegia
My code actually chops vectors of arbitrary size into chunks and then performs operations. E.g. If I have two vectors f32x37 each, they will be turned into five chunks: [f32x8, f32x8, f32x8, f32x8, f32x5] that fill SIMD registers entirely (Except the last one, f32x5. That's what my question is all about.).Quadriplegia
@L117: Ah, yes using partial vectors as part of your cleanup for odd lengths is totally fine. I left that section in for the benefit of future readers that find this question for the wrong reason and are making the mistake I was warning against. Padding is great, but when it's not an option another cleanup trick is to do an unaligned final vector that ends at the end of your data. Works most easily for pure-vertical non-in-place stuff, e.g. copy-and-divide so you can safely reload and redo some elements in the overlap. e.g. a memcpy can end with a maybe-overlapping unaligned load+store.Thundersquall
I
4

In Rust like in C, sizeof is always a multiple of alignof: a necessity since sizeof is used as the stride in arrays and array elements need to be aligned properly.

As a result, even though you only use 12 bytes for your struct, its sizeof is 16 bytes anyway, with 4 bytes of "padding".

I would therefore propose a very pragmatic fix: own to the padding. Instead of making the internals of the struct visible, give it a constructor and an accessor... and pad it to 16 bytes with the 1.0 value.

#[repr(C, align(16))]
pub struct Vector([f32; 4]);

impl Vector {
    pub fn new(v: [f32; 3]) -> Vector {
        Vector([v[0], v[1], v[2], 1.0])
    }

    pub fn pad(&mut self, pad: f32) { self.0[3] = pad; }

    pub fn as_ptr(&self) -> *const f32 { self.0.as_ptr() }
}

Then you can perform operations with confidence that no garbage byte is used.

Improbity answered 8/10, 2019 at 7:59 Comment(3)
Padding with 0.0 can be done for free sometimes, e.g. if shuffling together scalars loaded from memory the upper elements of your XMM regs will be known to be zero. 0.0/0.0 has no performance penalty for divps, but does raise exceptions. But since sub can create a 0.0 from 1.0 padding, a / (b-c) will do that if they start with 1.0 padding.Thundersquall
Still, agreed on owning the padding. But using only 3 out of 4 elements of a SIMD vector is usually a bad idea because it usually means you're using a single SIMD vector to hold a single geometry vector, instead of three vectors of 4 x coords, 4 y coords, and 4 z coords. Shuffles / horizontal stuff sucks. I mean sure there are cases where you can't vectorize over an array of things, but with careful use of SIMD you can still get a speedup.Thundersquall
This makes sense. But my code uses experimental, incomplete, nightly features (const_generics, specialization) a lot. Because of this it's not easy to own 4-th element. Well, actually it's even not always 4-th element, vector structure is defined as rust pub struct Vector<T, L, const SIZE: usize> where L: ReprWrapper<Wrapped = [T; { SIZE }]>, { repr: L, } Quadriplegia

© 2022 - 2024 — McMap. All rights reserved.