I have been using the following "trick" in C code with SSE2 for single precision floats for a while now:
static inline __m128 SSEI_m128shift(__m128 data)
{
return (__m128)_mm_srli_si128(_mm_castps_si128(data), 4);
}
For data like [1.0, 2.0, 3.0, 4.0]
, it results in [2.0, 3.0, 4.0, 0.0]
, i.e. it does a left shift by one position and fills the data structure with a zero. If I remember correctly, the above inline function compiles down to a single instruction (with gcc at least).
I am somehow failing to wrap my head around doing the same with AVX2. How could I achieve this in an efficient manner?
gcc
, I recommend using gcc vector extensions instead of architecture-specific intrinsics where possible. In particular, you can use__builtin_shuffle(data, (fvectype){0}, (ivectype){1, 2, 3, 4})
. Be aware though, that AVX-vectors of more than 128 bits are composed of lanes, and lane-crossing instructions (which are unavoidable when extending your example straightforwardly), are a fair bit slower than in-lane operations (~3 times slower), so it may be a good idea to review whether you actually need this. – Expletivegcc
compiles the gcc vector intrinsics to the following assembly:vmovaps %ymm0, %ymm1 vxorps %xmm0, %xmm0, %xmm0 vperm2f128 $33, %ymm0, %ymm1, %ymm0 vpalignr $4, %ymm1, %ymm0, %ymm0
. You can reverse-engineer that into intel-intrinsics if you like. Alternatively, a sane solution would be gcc vector intrinsics. – Expletivevpermd
to do a lane-crossing shuffle with 32-bit elements,vpblendd
to blend in a0.0
element where you want it. – Perlagcc
seems to like the code just as well in a loop. Could you explain howvpermd/vpblendd
would be preferable? Agner Fog and uops.info showvpalignr
to be fast, it apparently doesn't count as a lane-crossing instruction. – Expletivevpermd
costs the same asvperm2f128
on Intel hardware, maybe somewhat less on Zen 1.vpblendd
is 1 uop for any vector ALU port on Intel, so it avoids a potential shuffle-port bottleneck fromvperm2f128
+vpalignr
. vpalignr is just an in-lane shuffle, that's why we need vperm2f128 to set up for it. – Perlavpalignr
being in-lane is (to me) not at all obvious, since it moves data from the low lane of at least one input to the high lane of the output. – Expletivepalign
to 256 bits, and why GCC neededvperm2f128
. See the 256-bit diagram in felixcloutier.com/x86/palignr – Perlavpermd
/vpermps
and avblendps
: godbolt.org/z/RW7_ds which requires a shuffle vector (technically, one could use the same vector for the blend as well). It should be possible with just avperm2f128
and avpalignr
(thevperm2f128
can set one half to 0) -- requiring no shuffle vector, but two operations on p5. – Hestia