This is fully safe
You're not directly dereffing the float*
, only passing it to _mm256_load_ps
which does an aliasing-safe load. In terms of language-lawyering, you can look at _mm256_load_ps
/ _mm256_store_ps
as doing a memcpy
(to a private local variable), except it's UB if the pointer isn't 32-byte aligned.
Interconvertibility between Wrapper*
and float*
isn't really relevant; you're not derefing a float*
.
If you'd been using _mm_load_ss(arr)
on a buggy GCC version that implements it as _mm_set_ss( *ptr )
instead using a may_alias
typdef for float
, then that would matter. (Unfortunately even current GCC still has that bug; _mm_loadu_si32
was fixed in GCC11.3 but not the older _ss
and _sd
loads.) But that is a compiler bug, IMO. _mm_load_ps
is aliasing-safe, so it makes no sense that _mm_load_ss
wouldn't be, when they both take float*
. If you wanted a load with normal C aliasing/alignment semantics to promise more to the optimizer, you'd just deref yourself, using _mm_set_ss( *foo )
.
The exact aliasing semantics of Intel Intrinsics are not AFAIK documented anywhere. A lot of x86-specific code has been developed with MSVC, which doesn't enforce strict aliasing at all, i.e. it's like gcc -fno-strict-aliasing
, defining the behaviour of stuff like *(int*)my_float
and even encouraging it for type-punning.
Not sure about Intel's compiler historically, but I'm guessing it also didn't do type-based aliasing optimizations, otherwise they hopefully would have defined better intrinsics for movd
32-bit integer loads/stores much earlier than _mm_loadu_si32
in the last few years. You can tell from the void*
arg that it's recent: Intel previously did insane stuff like _mm_loadl_epi64(__m128i*)
for a movq
load, taking a pointer to a 16-byte object but only loading the low 8 bytes (with no alignment requirement).
So a lot of Intel intrinsics stuff seemed pretty casual about C and C++ safety rules, like it was designed by people who thought of C as a portable assembler. Or at least that their intrinsics were supposed to work that way.
As I pointed out in my answer you linked in the question (Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?), Intel's intrinsics API effectively requires compilers to support creating misaligned pointers as long as you don't deref them yourself. Including misaligned float*
for _mm_loadu_ps
, which supports any alignment, not just multiples of 4.
You could probably argue that supporting Intel's intrinsics API (in a way that's compatible with the examples Intel's published) might not require supporting arbitrary casting between pointer types (without deref), but in practice all x86 compilers do, because they target a flat memory model with byte-addressable memory.
With the existence of intrinsics for gather and scatter, use-cases like using a 0
base with pointer elements for _mm256_i64gather_epi64
(e.g. to walk 4 linked lists in parallel) require that a C++ implementation use a sane object-representation for pointers if they want to support that.
As usual with Intel intrinsics, I don't think there's documentation that 100% nails down proof that it would be safe to use _mm_load_ps
on a struct { int a; float b[3]; };
, but I think everyone working with intrinsics expects that to be the case. And nobody would want to use a compiler that broke it for a cases where memcpy
with the same source pointer would be safe.
But in your case, you don't even need to depend on any de-facto guarantees here, beyond the fact that _mm256_load_ps
itself is an aliasing-safe load. You've correctly shown that it's 100% safe to create that float*
in ISO C, and pass it to an opaque function.
And yes, deref of an __m256*
is exactly equivalent to _mm256_load_ps
, and is in fact how most compilers implement _mm256_load_ps
.
(By comparison, _mm256_loadu_ps
would cast to a pointer to a less-aligned 32-byte vector type which isn't part of the documented API, like GCC's __m256_u*
. Or maybe pass it to a builtin function. But however the compiler makes it happen, it's equivalent to a memcpy, including the lack of alignment requirement.)
sizeof(Wrapper) == sizeof(float)
there is no UB there. Because_m256
can alias anything so it can also alias aWrapper[]
and the layout of that is such that it matches what AVR expects. But why not define Wrapper asalign(32) float values[8]
or even bettertemplate <std::size_t N> Wrapper { align(32) float values[N]; };
? – Rattish_m256
and_mm256_load_ps
. – Rattishalignas(32)
on the array ofWrapper arr[]
. You should still do that, but I was assuming you'd want this for an arbitrary-sized array of structs. In that case it would be inconvenient to access the overalli
th element, like you'd needarr[i/8].values[i%8]
. But yeah if your use-case is equally convenient, putting the array inside the struct is fine. – Schmitzalignas
every time you use the wrapper is tedious, that's why I suggested adding the alignment to the wrapper itself. And as you writearr[i/8].values[i%8]
is tedious which is why I think the template is even better. Make the wrapper a single 32 byte aligned block of floats of whatever size is needed. It will add padding to the next multiple of 32. – Rattish