Cast array of wrapper structs to SIMD vector

Say I have a wrapper struct, serving as a phantom type.

struct Wrapper {
  float value;
}

Is it legal to load an array of this struct directly into an SIMD intrinsic type such as __m256? For example,

alignas(32) Wrapper arr[8] = {};
static_assert(sizeof(Wrapper) == sizeof(float));
__m256 x = _mm256_load_ps(reinterpret_cast<float*>(arr));


// or (I think this is equivalent):
__m256 y = *(__m256 *)arr;

Discussion:

I know that I can't normally use an array of Wrapper like an array of T because the pointer arithmetic is illegal: [expr.add]/6.
- Even if I can ensure the object representation is identical on the given platform (e.g. no padding in Wrapper), it would still be undefined behavior, as described by Timur Doumler.
That said, a load doesn't seem to use pointer arithmetic.
__m256 is defined such that it can alias anything and it seems intended that one can load, for example, int16_t[] into __m256i by casting.
Casting a Wrapper* to a float* and taking the value does not violate strict aliasing, because of pointer-interconvertibilty, though this seems irrelevant because only the final cast to __m256* really matters.

To me it seems like the pointer arithmetic rules don't apply: if I can assert that the types are compatible (e.g. there is no padding), then it's valid to cast directly because of the special properties of the vector type. But I haven't seen this particular usage and have some worries that a compiler might invoke the pointer arithmetic or another rule for the load operation and trigger UB. Given that this is a non-standard extension, can I be sure?

This is fully safe

You're not directly dereffing the float*, only passing it to _mm256_load_ps which does an aliasing-safe load. In terms of language-lawyering, you can look at _mm256_load_ps / _mm256_store_ps as doing a memcpy (to a private local variable), except it's UB if the pointer isn't 32-byte aligned.

Interconvertibility between Wrapper* and float* isn't really relevant; you're not derefing a float*.

If you'd been using _mm_load_ss(arr) on a buggy GCC version that implements it as _mm_set_ss( *ptr ) instead using a may_alias typdef for float, then that would matter. (Unfortunately even current GCC still has that bug; _mm_loadu_si32 was fixed in GCC11.3 but not the older _ss and _sd loads.) But that is a compiler bug, IMO. _mm_load_ps is aliasing-safe, so it makes no sense that _mm_load_ss wouldn't be, when they both take float*. If you wanted a load with normal C aliasing/alignment semantics to promise more to the optimizer, you'd just deref yourself, using _mm_set_ss( *foo ).

The exact aliasing semantics of Intel Intrinsics are not AFAIK documented anywhere. A lot of x86-specific code has been developed with MSVC, which doesn't enforce strict aliasing at all, i.e. it's like gcc -fno-strict-aliasing, defining the behaviour of stuff like *(int*)my_float and even encouraging it for type-punning.

Not sure about Intel's compiler historically, but I'm guessing it also didn't do type-based aliasing optimizations, otherwise they hopefully would have defined better intrinsics for movd 32-bit integer loads/stores much earlier than _mm_loadu_si32 in the last few years. You can tell from the void* arg that it's recent: Intel previously did insane stuff like _mm_loadl_epi64(__m128i*) for a movq load, taking a pointer to a 16-byte object but only loading the low 8 bytes (with no alignment requirement).

So a lot of Intel intrinsics stuff seemed pretty casual about C and C++ safety rules, like it was designed by people who thought of C as a portable assembler. Or at least that their intrinsics were supposed to work that way.

As I pointed out in my answer you linked in the question (Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?), Intel's intrinsics API effectively requires compilers to support creating misaligned pointers as long as you don't deref them yourself. Including misaligned float* for _mm_loadu_ps, which supports any alignment, not just multiples of 4.

You could probably argue that supporting Intel's intrinsics API (in a way that's compatible with the examples Intel's published) might not require supporting arbitrary casting between pointer types (without deref), but in practice all x86 compilers do, because they target a flat memory model with byte-addressable memory.

With the existence of intrinsics for gather and scatter, use-cases like using a 0 base with pointer elements for _mm256_i64gather_epi64 (e.g. to walk 4 linked lists in parallel) require that a C++ implementation use a sane object-representation for pointers if they want to support that.

As usual with Intel intrinsics, I don't think there's documentation that 100% nails down proof that it would be safe to use _mm_load_ps on a struct { int a; float b[3]; };, but I think everyone working with intrinsics expects that to be the case. And nobody would want to use a compiler that broke it for a cases where memcpy with the same source pointer would be safe.

But in your case, you don't even need to depend on any de-facto guarantees here, beyond the fact that _mm256_load_ps itself is an aliasing-safe load. You've correctly shown that it's 100% safe to create that float* in ISO C, and pass it to an opaque function.

And yes, deref of an __m256* is exactly equivalent to _mm256_load_ps, and is in fact how most compilers implement _mm256_load_ps.

(By comparison, _mm256_loadu_ps would cast to a pointer to a less-aligned 32-byte vector type which isn't part of the documented API, like GCC's __m256_u*. Or maybe pass it to a builtin function. But however the compiler makes it happen, it's equivalent to a memcpy, including the lack of alignment requirement.)

This is fully safe

Recommended topics

Hot tags