As a modification to hirschhornsalz's solution, if i
is a compile-time constant, you could avoid the union path entirely by using a shuffle:
template<unsigned i>
float vectorGetByIndex( __m128 V)
{
// shuffle V so that the element that you want is moved to the least-
// significant element of the vector (V[0])
V = _mm_shuffle_ps(V, V, _MM_SHUFFLE(i, i, i, i));
// return the value in V[0]
return _mm_cvtss_f32(V);
}
A scalar float is just the bottom element of an XMM register, and the upper elements are allowed to be non-zero; _mm_cvtss_f32
is free and will compile to zero instructions. This will inline as just a shufps (or nothing for i==0).
Compilers are smart enough to optimize away the shuffle for i==0
(except for long-obsolete ICC13) so no need for an if (i)
. https://godbolt.org/z/K154Pe. clang's shuffle optimizer will compile vectorGetByIndex<2>
into movhlps xmm0, xmm0
which is 1 byte shorter than shufps
and produces the same low element. You could manually do this with switch
/case
for other compilers since i
is a compile-time constant, but 1 byte of code size in the few places you use this while manually vectorizing is pretty trivial.
Note that SSE4.1 _mm_extract_epi32(V, i);
is not a useful shuffle here: extractps r/m32, xmm, imm
can only extract the FP bit-pattern to an integer register or memory (https://www.felixcloutier.com/x86/extractps). (And the intrinsic returns it as an int
, so it would actually compile to extractps
+ cvtsi2ss
to do int->float conversion on the FP bit-pattern, unless you type-pun it in your C++ code. But then you'd expect it to compile to extractps eax, xmm0, i
/ movd xmm0, eax
which is terrible vs. shufps.)
The only case where extractps
would be useful is if the compiler wanted to store this result straight to memory, and fold the store into the extract instruction. (For i!=0, otherwise it would use movss
). To leave the result in an XMM register as a scalar float, shufps
is good.
(SSE4.1 insertps
would be usable but unnecessary: it makes it possible to zero other elements while taking an arbitrary source element.)
m128_f32
) to do it. But it only masks the performance problem. – Jaxartes_mm_cvtss_f32(V)
, and other elements by first shuffling the desired value into the low element. – Lawmakerreturn V[i]
. – Bedlamite[i]
after theV.m128_f32
- since you say this works on MSVC. And that change obviously doesn't affect the clang error message, and the detail isn't really material to what you're asking. I've tried twice to submit this as an edit to the question, but most reviewers feel that I'm changing the intent of the question, so it's not happening. – Senlac