Shuffling a vector by number of bytes

Run-time-variable shifts only exist in forms that take a vector control operand, not an integer unfortunately. Before AVX-512, the only variable-control shift with byte granularity is SSSE3 pshufb, aka _mm_shuffle_epi8. This is fine for a __m128i, but rather inconvenient for __m256i¹ because it does two 16-byte shuffles in the two halves of a YMM register. Just like the 256-bit versions of all instructions that were originally SSE. i.e. not lane-crossing.

`__m128i` with SSSE3 or later

pshufb will zero bytes where the shuffle mask's byte has the high bit set, otherwise take the corresponding byte of the source vector. So we can use this to shift in zeros at the same time as moving our data.

{ 0, 1, 2, ..., 14, 15} is the identity shuffle, what we need for a shift count of zero.
{-1, 0, 1, ..., 13, 14} is the mask for a left-shift by one: zero the low byte, shift the others.
The pattern continues in an obvious way up to all--1 to shift out all the bytes if you want to support that.

I'm using notation like C arrays, with the low element at the left. Not like diagrams in Intel's manuals where the highest-numbered element is at the left, such that pslldq (_mm_bslli_si128) actually makes sense as a left shift. But that's because we're going to want to create a C array that those shuffle control vectors can be loaded from. Note that they overlap so we only need 32 or 31 bytes, not 16x __m128i = 256 bytes.

__m128i variable_pslldq(__m128i v, unsigned long int count)
{
    // aligned so any 16-byte window into this can't split across a wider boundary
    alignas(32) static const int8_t shuffle[] = {
      -1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1, -1, -1, -1, -1, -1,  // 16 bytes
       0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15   // 16 bytes
    };
 // OPTIONAL: mask the shift count
    //count &= 15;  // wrap the shift count to the 0..15 range, if bounds-check desired
                  // This means the first byte of the array is never used

    const int8_t *identity_shuffle = shuffle+16;  // hand-hold GCC into doing the +16 for free
    __m128i shufmask = _mm_loadu_si128((const __m128i*)&identity_shuffle[-count]);

    return _mm_shuffle_epi8(v, shufmask);
}

On Godbolt, GCC/clang targeting a PIE executable compiles this to 3 instructions, two of them being very cheap integer ALU:

# GCC12.2 -O3 -march=haswell  (with name demangling)
variable_pslldq(long long __vector(2), unsigned long):
        lea     rax, variable_pslldq(long long __vector(2), unsigned long)::shuffle[rip+16]
    # note the +16 as part of the LEA.  Clang is similar but leaves the +16 for the pshufb addressing mode, which is worse.
        sub     rax, rdi
        vpshufb xmm0, xmm0, XMMWORD PTR [rax]
        ret

In a non-PIE executable, it could be even better, neg rdi / vpshufb xmm0, [shuffle+16 + rdi]. But compilers aren't smart enough to do that. And most production code these days is built into PIE executables or shared libraries.

This sliding-window technique is similar to Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

For a right shift, you'd use the same array but have -1 elements at the end, after the 15. (e.g. make it global or in a namespace, not two separate static arrays for two functions, probably alignas(64), so both left and right shift functions can share one 48-byte array.)

With a signed int shift count, you could even support both shift directions in one function using &shuffle[16-count], if you don't mask the count. (Or 16+count if you want negative to be left and positive to be right.)

Repeating 0,1,...,14,15, 0,1,2,..,15 instead of leading -1s would give you a rotate of bytes within a __m128i. i.e. shift in bytes from the vector, instead of zeros. i.e. variable-count palignr (_mm_alignr_epi8)

Footnote 1: __m256i with AVX2, not AVX-512: This is harder. Soonts' answer on Is there a way to shuffle a 8bitX32 ymm register right/left by N positions (c++) branches on the shift count being >= 16 bytes. (And for the n%16 part of the shift count, uses the same strategy of taking a sliding window of a constant array as the shuffle control.)

If you had 4-byte elements, AVX2 has vpermd / vpermps which are full lane-crossing shuffles with 4-byte granularity; you could load a mask for those.

You might also consider @chtz's suggestion of storing along with some zeros into an array, and doing an unaligned reload of that. That has highish latency, but ok throughput as part of a bunch of different surrounding code. Same as my answer on Quickest way to shift/rotate byte vector with SIMD where I suggested the same thing for a case where there is no single-instruction shuffle, and described the cost of the store-forwarding stall.

You wouldn't want that for __m128i unless you find that the shuffle array usually cache-misses (which would mean this code doesn't run very often overall in the program). But in that case, a store-forwarding stall is probably cheaper.

This could be reasonable for a __m256i, if you can't guarantee that shift counts will be a multiple of 4.

AVX-512VBMI (Ice Lake and later) has lane-crossing vpermb; it can't zero out elements with a negative mask, but you can use AVX-512 zero-masking to get the job done. e.g. with ((uint32_t)-1) << count as the mask for zero-masking intrinsic for it, _mm256_maskz_permutexvar_epi8(__mmask32 k, __m256i idx, __m256i a). This can use a 32-byte sliding window onto a 64-byte array.

Or a 64-byte window onto a 128-byte array, but that would be guaranteed to slit across a cache-line boundary, unlike with 16 or 32-byte vectors. For that case, you might consider subtraction to generate the shuffle control, as shown below for the 16-byte case. That would allow a compare-into-mask to generate the zero-masking constant. (vpcmpb or vpcmpub to compare n against each element of the 0..63 vector, so the mask is true only for elements >=n. Since you'd be broadcasting anyway for the subtract, this is just one extra instruction to create the mask, instead of mov-immediate / shift / kmov or something, and it handles corner cases like shift count == 64 to shift out all the bits.)

Alternate shuffle mask generation: broadcast + subtract from constant

Another way to express the shuffle mask we want is {0-n, 1-n, 2-n, ..., 15-n}. For any n>=1, 0-1 will be negative, zeroing the low byte. And so on, for any n up to 128. This is good for supporting larger shift counts that shift out all the bytes.

__m128i variable_pslldq_slower(__m128i v, unsigned count)
{
    __m128i shufmask = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
    shufmask = _mm_sub_epi8(shufmask, _mm_set1_epi8(count));
    return _mm_shuffle_epi8(v, shufmask);
}

_mm_set1_epi8 with a non-constant arg will use pshufb or AVX2 vpbroadcastb which need to run on a shuffle execution unit, and we still need a vector constant loaded from memory.

# GCC12 -O3 -march=haswell
variable_pslldq_slower(long long __vector(2), unsigned int):
        vmovdqa xmm1, XMMWORD PTR .LC0[rip]
        vmovd   xmm2, edi
        vpbroadcastb    xmm2, xmm2
        vpsubb  xmm1, xmm1, xmm2
        vpshufb xmm0, xmm0, xmm1
        ret

This can start loading the vector constant without a data dependency on the shift count, which could help if it was cold in cache and the shift count comes from a dependency chain of decent length (or another load). But otherwise costs more throughput, especially vector shuffle throughput. (A software prefetch on the shuffle mask array could help equally well.)

Is there a way to shuffle a 8bitX32 ymm register right/left by N positions (c++) - __m256i versions for constant or runtime-variable n. Including a C++ way to use a loop to generate a std::array initializer for std::array<uint8_t, 48> (it handles left or right shifts according to the count).
The best way to shift a __m128i? bit-shift of a __m128i, counts from 0..127. Including runtime-variable, but the compiler can optimize it down for constants.

`__m128i` with SSSE3 or later

Alternate shuffle mask generation: broadcast + subtract from constant

Related:

Recommended topics

Hot tags

__m128i with SSSE3 or later

Alternate shuffle mask generation: broadcast + subtract from constant

Related:

Recommended topics

Hot tags

`__m128i` with SSSE3 or later