Emulating shifts on 32 bytes with AVX

Asked 11/8, 2014 at 17:14 Answered 31/3, 2016 at 13:19

I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics.

Much to my disappointment, I discover that the shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.)

Can you recommend me a short substitute ?

UPDATE:

_mm256_slli_si256 is efficiently achieved with

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)

_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)

for shifts larger than 16 bytes.

But the question remains for _mm256_srli_si256.

Capua answered 11/8, 2014 at 17:14 Comment(12)

How about reminding us what those slli instructions do, or even better what you want to do exactly? Did you look at the code generated by gcc with __builtin_shuffle or clang with its own syntax? – Photomicroscope 11/8, 2014 at 18:3

And what do you mean by "only the upper half" "the rest is zeroed"? That's not what Intel's doc says. – Photomicroscope 11/8, 2014 at 18:18

The reason why there is no 32-byte shift is that the hardware simply can't do it. The hardware is SIMD, and a full-vector shift is not SIMD. If you find that you're needing such instructions, it might be worth reconsidering the design. You're probably trying to do something non-SIMD using SIMD which often leads to an avalanche of other (performance) problems as well. If it's an issue of misalignment, just use misaligned memory access. On Haswell, misaligned access is almost as fast as aligned access. – Author 11/8, 2014 at 18:56

@Marc Glisse: "The empty low-order bytes are cleared (set to all '0')." software.intel.com/sites/products/documentation/doclib/iss/2013/… – Capua 11/8, 2014 at 22:23

@Mysticial: as written in my post, the SSE _mm_slli_si128 performs a full shift. And so did psrlq/psllq in "old" MMX. I assume implementing a full 256 bits barrel shifter was too much asking. I am working on neighborhood image processing functions, which are inherently mixed-aligned. – Capua 11/8, 2014 at 22:27

@YvesDaoust I believe you are misinterpreting that doc. In each 128-bit half, the data is shifted to the left and 0s are used to fill in the empty space on the right. "Low order" is to be understood as inside the 128-bit lane. It does not zero a whole lane. By the way, Intel's html doc of the compiler intrinsics sucks, it is often unreadable or wrong, the PDF instruction set reference is much more helpful. – Photomicroscope 11/8, 2014 at 22:43

@Marc Glisse: that's right, I am updating the question. The problem remains, anyway, as some of the bytes are dropped. – Capua 12/8, 2014 at 7:1

@Paul R: my question is not a duplicate as it holds for both left and right shifts. The previous one only solves the case of a left shift very efficiently with a _mm256_alignr_epi8 instruction. Unfortunately, there is no _mm256_alignl_epi8 correspondence. – Capua 12/8, 2014 at 10:46

You don't need _mm256_alignl_epi8 (which is why there is no instruction or intrinsic for this) - _mm256_alignr_epi8 works for both left and right shift cases (just switch the arguments and adjust the shift value). – Poul 12/8, 2014 at 11:37

If you reopen the question I can provide a complete solution. – Capua 12/8, 2014 at 11:58

@YvesDaoust: OK - voting to re-open, but ideally this question needs to be merged with its earlier doppelgänger. – Poul 12/8, 2014 at 12:39

When migrating 128-bit SIMD to AVX-256, it's generally easier to think about the problem in terms of two glued together 128-bit operations, instead of a whole 256-bit operation. Not always ideal, but makes translating them a snap and usually performs better than shoehorning it in with permutes. – Euthanasia 28/8, 2014 at 2:52

From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8.

_mm256_slli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 - N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0))

16 < N < 32

_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), N - 16)

_mm256_srli_si256(A, N)

0 < N < 16

_mm256_alignr_epi8(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), A, N)

N = 16

_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1))

16 < N < 32

_mm256_srli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), N - 16)

Capua answered 12/8, 2014 at 12:54 Comment(6)

The key to crossing the inter-lane barrier is _mm256_permute2x128_si256, surely ? – Poul 12/8, 2014 at 14:38

No, I mean performing an operation that assembles bytes from two different lanes. As the doc states, the processor creates a "32-bytes composite" before shifting. The permute handles whole lanes. – Capua 12/8, 2014 at 16:27

On Ryzen and KNL, _mm256_permute2x128_si256 is slower than _mm256_permute4x64_epi64 for permuting lanes of a single vector like you're doing here. – Postern 13/7, 2017 at 19:57

@PeterCordes: significantly ? – Capua 13/7, 2017 at 20:36

Yes, on Ryzen vperm2i128 is 8 uops, lat=3 tput=3. vpermq is 3 uops, lat=2, tput=2. (Those are actually for the FP equivalents, vperm2f128 and vpermpd, since Agner Fog omitted a lot of AVX2 integer stuff for Ryzen). On KNL, vpermq has twice the throughput and 1c lower latency. There's no downside on any CPU, AFAIK; vpermq is always at least as good as vperm2i128 for shuffling within a single vector. Plus, it can fold a load as a memory source operand. – Postern 13/7, 2017 at 20:42

Update, on Zen2 / Zen3, vperm2i128 is faster (1 uop) than vpermq (2 uops). So it's a tradeoff between Zen1 vs. Zen2/3. :/ uops.info – Postern 8/2, 2022 at 22:54

Here is a function to bit shift left a ymm register using avx2. I use it to shift left by one, though it looks like it works for up to 63 bit shifts.

//----------------------------------------------------------------------------
// bit shift left a 256-bit value using ymm registers
//          __m256i *data - data to shift
//          int count     - number of bits to shift
// return:  __m256i       - carry out bit(s)

static __m256i bitShiftLeft256ymm (__m256i *data, int count)
   {
   __m256i innerCarry, carryOut, rotate;

   innerCarry = _mm256_srli_epi64 (*data, 64 - count);                        // carry outs in bit 0 of each qword
   rotate     = _mm256_permute4x64_epi64 (innerCarry, 0x93);                  // rotate ymm left 64 bits
   innerCarry = _mm256_blend_epi32 (_mm256_setzero_si256 (), rotate, 0xFC);   // clear lower qword
   *data      = _mm256_slli_epi64 (*data, count);                             // shift all qwords left
   *data      = _mm256_or_si256 (*data, innerCarry);                          // propagate carrys from low qwords
   carryOut   = _mm256_xor_si256 (innerCarry, rotate);                        // clear all except lower qword
   return carryOut;
   }

//----------------------------------------------------------------------------

Unreflective answered 11/8, 2014 at 19:29 Comment(2)

Interesting. Six instruction is still a lot. I am only looking for byte shifts. – Capua 11/8, 2014 at 22:30

For byte shifts, 4 instructions should do: shift left, shift right, bring lower lane up, or. – Photomicroscope 11/8, 2014 at 22:53

If the shift count is a multiple of 4 bytes, vpermd (_mm256_permutevar8x32_epi32) with the right shuffle mask will do the trick with one instruction (or more, if you actually need to zero the shifted-in bytes instead of copying a different element over them).

To support variable (multiple-of-4B) shift counts, you could load the control mask from a window into an array of 0 0 0 0 0 0 0 1 2 3 4 5 6 7 0 0 0 0 0 0 0 or something, except that 0 is just the bottom element, and doesn't zero things out. For more on this idea for generating a mask from a sliding window, see my answer on another question.

This answer is pretty minimal, since vpermd doesn't directly solve the problem. I point it out as an alternative that might work in some cases where you're looking for a full vector shift.

Postern answered 31/3, 2016 at 13:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

_mm256_slli_si256(A, N)

_mm256_srli_si256(A, N)

Recommended topics

Hot tags