Left-shift (of float32 array) with AVX2 and filling up with a zero
Asked Answered
U

0

1

I have been using the following "trick" in C code with SSE2 for single precision floats for a while now:

static inline __m128 SSEI_m128shift(__m128 data)
{
    return (__m128)_mm_srli_si128(_mm_castps_si128(data), 4);
}

For data like [1.0, 2.0, 3.0, 4.0], it results in [2.0, 3.0, 4.0, 0.0], i.e. it does a left shift by one position and fills the data structure with a zero. If I remember correctly, the above inline function compiles down to a single instruction (with gcc at least).

I am somehow failing to wrap my head around doing the same with AVX2. How could I achieve this in an efficient manner?

Similar questions: 1, 2, 3

Unclassical answered 23/5, 2020 at 11:52 Comment(14)
If you're using gcc, I recommend using gcc vector extensions instead of architecture-specific intrinsics where possible. In particular, you can use __builtin_shuffle(data, (fvectype){0}, (ivectype){1, 2, 3, 4}). Be aware though, that AVX-vectors of more than 128 bits are composed of lanes, and lane-crossing instructions (which are unavoidable when extending your example straightforwardly), are a fair bit slower than in-lane operations (~3 times slower), so it may be a good idea to review whether you actually need this.Expletive
@EOF Thanks for the pointer. If I was planning on using the arch-specific intrinsics, do you have any idea about how to do what I want? :)Unclassical
Sure. gcc compiles the gcc vector intrinsics to the following assembly: vmovaps %ymm0, %ymm1 vxorps %xmm0, %xmm0, %xmm0 vperm2f128 $33, %ymm0, %ymm1, %ymm0 vpalignr $4, %ymm1, %ymm0, %ymm0. You can reverse-engineer that into intel-intrinsics if you like. Alternatively, a sane solution would be gcc vector intrinsics.Expletive
@EOF That's ... helpful.Unclassical
You're welcome. In case you want to do this right, here's a godbolt link with the implementation.Expletive
@EOF: another way to do this shuffle (which would better in a loop where you can load vector constants once outside the loop): vpermd to do a lane-crossing shuffle with 32-bit elements, vpblendd to blend in a 0.0 element where you want it.Perla
@PeterCordes Well, gcc seems to like the code just as well in a loop. Could you explain how vpermd/vpblendd would be preferable? Agner Fog and uops.info show vpalignr to be fast, it apparently doesn't count as a lane-crossing instruction.Expletive
@EOF: vpermd costs the same as vperm2f128 on Intel hardware, maybe somewhat less on Zen 1. vpblendd is 1 uop for any vector ALU port on Intel, so it avoids a potential shuffle-port bottleneck from vperm2f128 + vpalignr. vpalignr is just an in-lane shuffle, that's why we need vperm2f128 to set up for it.Perla
@PeterCordes Hmm, ok. Though vpalignr being in-lane is (to me) not at all obvious, since it moves data from the low lane of at least one input to the high lane of the output.Expletive
@EOF: No it doesn't, that's why it's so hard to use / such a bad design for extending palign to 256 bits, and why GCC needed vperm2f128. See the 256-bit diagram in felixcloutier.com/x86/palignrPerla
@PeterCordes Ohhhh, the lanes from sources are effectively rotated into the corresponding lane of the destination! That's... not great. Well, at least I now seem to understand the instruction, so thank you for that as well.Expletive
gcc and clang optimize this to a vpermd/vpermps and a vblendps: godbolt.org/z/RW7_ds which requires a shuffle vector (technically, one could use the same vector for the blend as well). It should be possible with just a vperm2f128 and a vpalignr (the vperm2f128 can set one half to 0) -- requiring no shuffle vector, but two operations on p5.Hestia
@Hestia That's the opposite direction. This would be the right direction, and it's ok for clang, but gcc doesn't like that at all.Expletive
@EOF, argh.. you are right -- I still sometimes get confused with left and right (also OP actually appears to require a right shift, despite the title saying "left-shift" ...)Hestia

© 2022 - 2024 — McMap. All rights reserved.