Cast array of wrapper structs to SIMD vector
Asked Answered
M

1

5

Say I have a wrapper struct, serving as a phantom type.

struct Wrapper {
  float value;
}

Is it legal to load an array of this struct directly into an SIMD intrinsic type such as __m256? For example,

alignas(32) Wrapper arr[8] = {};
static_assert(sizeof(Wrapper) == sizeof(float));
__m256 x = _mm256_load_ps(reinterpret_cast<float*>(arr));


// or (I think this is equivalent):
__m256 y = *(__m256 *)arr;

Discussion:

To me it seems like the pointer arithmetic rules don't apply: if I can assert that the types are compatible (e.g. there is no padding), then it's valid to cast directly because of the special properties of the vector type. But I haven't seen this particular usage and have some worries that a compiler might invoke the pointer arithmetic or another rule for the load operation and trigger UB. Given that this is a non-standard extension, can I be sure?

Muriate answered 27/6, 2022 at 21:14 Comment(5)
As long as sizeof(Wrapper) == sizeof(float) there is no UB there. Because_m256 can alias anything so it can also alias a Wrapper[] and the layout of that is such that it matches what AVR expects. But why not define Wrapper as align(32) float values[8] or even better template <std::size_t N> Wrapper { align(32) float values[N]; };?Rattish
@GoswinvonBrederlow: Why would you want to bake a specific SIMD width into the data structure the rest of your program is going to be touching?Schmitz
@PeterCordes because the alignment gives you some percentage points improvement. What bakes the width into your code is using _m256 and _mm256_load_ps.Rattish
@GoswinvonBrederlow: The code in the question already uses alignas(32) on the array of Wrapper arr[]. You should still do that, but I was assuming you'd want this for an arbitrary-sized array of structs. In that case it would be inconvenient to access the overall ith element, like you'd need arr[i/8].values[i%8]. But yeah if your use-case is equally convenient, putting the array inside the struct is fine.Schmitz
@PeterCordes Having to add alignas every time you use the wrapper is tedious, that's why I suggested adding the alignment to the wrapper itself. And as you write arr[i/8].values[i%8] is tedious which is why I think the template is even better. Make the wrapper a single 32 byte aligned block of floats of whatever size is needed. It will add padding to the next multiple of 32.Rattish
S
6

This is fully safe

You're not directly dereffing the float*, only passing it to _mm256_load_ps which does an aliasing-safe load. In terms of language-lawyering, you can look at _mm256_load_ps / _mm256_store_ps as doing a memcpy (to a private local variable), except it's UB if the pointer isn't 32-byte aligned.

Interconvertibility between Wrapper* and float* isn't really relevant; you're not derefing a float*.

If you'd been using _mm_load_ss(arr) on a buggy GCC version that implements it as _mm_set_ss( *ptr ) instead using a may_alias typdef for float, then that would matter. (Unfortunately even current GCC still has that bug; _mm_loadu_si32 was fixed in GCC11.3 but not the older _ss and _sd loads.) But that is a compiler bug, IMO. _mm_load_ps is aliasing-safe, so it makes no sense that _mm_load_ss wouldn't be, when they both take float*. If you wanted a load with normal C aliasing/alignment semantics to promise more to the optimizer, you'd just deref yourself, using _mm_set_ss( *foo ).


The exact aliasing semantics of Intel Intrinsics are not AFAIK documented anywhere. A lot of x86-specific code has been developed with MSVC, which doesn't enforce strict aliasing at all, i.e. it's like gcc -fno-strict-aliasing, defining the behaviour of stuff like *(int*)my_float and even encouraging it for type-punning.

Not sure about Intel's compiler historically, but I'm guessing it also didn't do type-based aliasing optimizations, otherwise they hopefully would have defined better intrinsics for movd 32-bit integer loads/stores much earlier than _mm_loadu_si32 in the last few years. You can tell from the void* arg that it's recent: Intel previously did insane stuff like _mm_loadl_epi64(__m128i*) for a movq load, taking a pointer to a 16-byte object but only loading the low 8 bytes (with no alignment requirement).

So a lot of Intel intrinsics stuff seemed pretty casual about C and C++ safety rules, like it was designed by people who thought of C as a portable assembler. Or at least that their intrinsics were supposed to work that way.


As I pointed out in my answer you linked in the question (Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?), Intel's intrinsics API effectively requires compilers to support creating misaligned pointers as long as you don't deref them yourself. Including misaligned float* for _mm_loadu_ps, which supports any alignment, not just multiples of 4.

You could probably argue that supporting Intel's intrinsics API (in a way that's compatible with the examples Intel's published) might not require supporting arbitrary casting between pointer types (without deref), but in practice all x86 compilers do, because they target a flat memory model with byte-addressable memory.

With the existence of intrinsics for gather and scatter, use-cases like using a 0 base with pointer elements for _mm256_i64gather_epi64 (e.g. to walk 4 linked lists in parallel) require that a C++ implementation use a sane object-representation for pointers if they want to support that.

As usual with Intel intrinsics, I don't think there's documentation that 100% nails down proof that it would be safe to use _mm_load_ps on a struct { int a; float b[3]; };, but I think everyone working with intrinsics expects that to be the case. And nobody would want to use a compiler that broke it for a cases where memcpy with the same source pointer would be safe.

But in your case, you don't even need to depend on any de-facto guarantees here, beyond the fact that _mm256_load_ps itself is an aliasing-safe load. You've correctly shown that it's 100% safe to create that float* in ISO C, and pass it to an opaque function.


And yes, deref of an __m256* is exactly equivalent to _mm256_load_ps, and is in fact how most compilers implement _mm256_load_ps.

(By comparison, _mm256_loadu_ps would cast to a pointer to a less-aligned 32-byte vector type which isn't part of the documented API, like GCC's __m256_u*. Or maybe pass it to a builtin function. But however the compiler makes it happen, it's equivalent to a memcpy, including the lack of alignment requirement.)

Schmitz answered 28/6, 2022 at 11:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.