Shuffling a vector by number of bytes
Asked Answered
P

1

3

Is there any way to left-shift (v{0} -> v{1}) a __m128i by n bytes, where n is only known at runtime?

I'm currently restricted to AVX1 but if AVX2/512 makes this much easier I'm very interested.

I found _mm_bslli_si128(__m128i, imm) aka _mm_slli_si128, but its imm has to be a compile-time constant; the asm instruction pslldq requires an immediate. Other than that, it does what I want.

Placida answered 27/8, 2022 at 5:49 Comment(9)
Run-time-variable shifts only exist in forms that take a vector control operand, not an integer unfortunately. Load a sliding window from an array of int8_t shufmask[] = {..., -1, -1, -1, 0, 1, 2, ..., 14, 15, -1, -1, -1, -1, ...} or something like that, for use with pshufb (_mm_shuffle_epi8). Of course that only works for a __m128i, not shifting across the 16-byte boundary in a __m256i, but you seem to be talking about integer stuff with AVX1, so 16-bit vectors? If you have 4-byte elements, AVX2 has vpermd / vpermps which is a full lane-crossing shuffle with 4-byte granularity.Farcy
@PeterCordes __m128i is brilliant, that's all I'm using anyway. I have vectors of 16x bytes. Could you elaborate on that shufMask, as I don't fully understand the sequence of numbers included in the example?Placida
Ok, it's confusing to describe __m128i as an "AVX vector" or "AVX register", because the thing that was new with AVX was YMM registers, 32-byte. Before fleshing out the details into an answer, I wanted to confirm element size and total width; you should edit your question to include that info from your comment.Farcy
@PeterCordes No worries. I'm loading the first 16 bytes of a string in to a __m128i AVX register. At the moment i'm restricted to AVX1 but I will be moving to AVX2 within a few weeks, so if AVX2 provides substantially easier/better features I would be interested in that answer too.Placida
Besides using pshufb as suggested by @PeterCordes, you can also store the register to memory (to an area that is followed or preceded by 0s) and do an unaligned load with an offset.Trompe
@user997112: Like I said, you should edit your question to clearly describe what it is that you're doing, not just leave it in comments.Farcy
@Trompe won't that kill the latency completely?Placida
Yes, @chtz's suggestion has highish latency, but ok throughput as part of a bunch of different surrounding code. Same as Quickest way to shift/rotate byte vector with SIMD where I suggested the same thing for a case where there is no single-instruction shuffle, and described the cost. But in this case I think you'd only consider that for a __m256i with shift counts that aren't a multiple of 4.Farcy
@Placida You did not say anything about whether you need to optimize latency or throughput (or code size or register usage or ...). For both latency and throughput it would actually be helpful to know the surrounding code (like the entire critical loop where you want to shift your register, and how you determine the shift amount).Trompe
F
4

Run-time-variable shifts only exist in forms that take a vector control operand, not an integer unfortunately. Before AVX-512, the only variable-control shift with byte granularity is SSSE3 pshufb, aka _mm_shuffle_epi8. This is fine for a __m128i, but rather inconvenient for __m256i1 because it does two 16-byte shuffles in the two halves of a YMM register. Just like the 256-bit versions of all instructions that were originally SSE. i.e. not lane-crossing.

__m128i with SSSE3 or later

pshufb will zero bytes where the shuffle mask's byte has the high bit set, otherwise take the corresponding byte of the source vector. So we can use this to shift in zeros at the same time as moving our data.

{ 0, 1, 2, ..., 14, 15} is the identity shuffle, what we need for a shift count of zero.
{-1, 0, 1, ..., 13, 14} is the mask for a left-shift by one: zero the low byte, shift the others.
The pattern continues in an obvious way up to all--1 to shift out all the bytes if you want to support that.

I'm using notation like C arrays, with the low element at the left. Not like diagrams in Intel's manuals where the highest-numbered element is at the left, such that pslldq (_mm_bslli_si128) actually makes sense as a left shift. But that's because we're going to want to create a C array that those shuffle control vectors can be loaded from. Note that they overlap so we only need 32 or 31 bytes, not 16x __m128i = 256 bytes.

__m128i variable_pslldq(__m128i v, unsigned long int count)
{
    // aligned so any 16-byte window into this can't split across a wider boundary
    alignas(32) static const int8_t shuffle[] = {
      -1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1, -1, -1, -1, -1, -1,  // 16 bytes
       0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15   // 16 bytes
    };
 // OPTIONAL: mask the shift count
    //count &= 15;  // wrap the shift count to the 0..15 range, if bounds-check desired
                  // This means the first byte of the array is never used

    const int8_t *identity_shuffle = shuffle+16;  // hand-hold GCC into doing the +16 for free
    __m128i shufmask = _mm_loadu_si128((const __m128i*)&identity_shuffle[-count]);

    return _mm_shuffle_epi8(v, shufmask);
}

On Godbolt, GCC/clang targeting a PIE executable compiles this to 3 instructions, two of them being very cheap integer ALU:

# GCC12.2 -O3 -march=haswell  (with name demangling)
variable_pslldq(long long __vector(2), unsigned long):
        lea     rax, variable_pslldq(long long __vector(2), unsigned long)::shuffle[rip+16]
    # note the +16 as part of the LEA.  Clang is similar but leaves the +16 for the pshufb addressing mode, which is worse.
        sub     rax, rdi
        vpshufb xmm0, xmm0, XMMWORD PTR [rax]
        ret

In a non-PIE executable, it could be even better, neg rdi / vpshufb xmm0, [shuffle+16 + rdi]. But compilers aren't smart enough to do that. And most production code these days is built into PIE executables or shared libraries.

This sliding-window technique is similar to Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all

For a right shift, you'd use the same array but have -1 elements at the end, after the 15. (e.g. make it global or in a namespace, not two separate static arrays for two functions, probably alignas(64), so both left and right shift functions can share one 48-byte array.)

With a signed int shift count, you could even support both shift directions in one function using &shuffle[16-count], if you don't mask the count. (Or 16+count if you want negative to be left and positive to be right.)

Repeating 0,1,...,14,15, 0,1,2,..,15 instead of leading -1s would give you a rotate of bytes within a __m128i. i.e. shift in bytes from the vector, instead of zeros. i.e. variable-count palignr (_mm_alignr_epi8)


Footnote 1: __m256i with AVX2, not AVX-512: This is harder. Soonts' answer on Is there a way to shuffle a 8bitX32 ymm register right/left by N positions (c++) branches on the shift count being >= 16 bytes. (And for the n%16 part of the shift count, uses the same strategy of taking a sliding window of a constant array as the shuffle control.)

If you had 4-byte elements, AVX2 has vpermd / vpermps which are full lane-crossing shuffles with 4-byte granularity; you could load a mask for those.

You might also consider @chtz's suggestion of storing along with some zeros into an array, and doing an unaligned reload of that. That has highish latency, but ok throughput as part of a bunch of different surrounding code. Same as my answer on Quickest way to shift/rotate byte vector with SIMD where I suggested the same thing for a case where there is no single-instruction shuffle, and described the cost of the store-forwarding stall.

You wouldn't want that for __m128i unless you find that the shuffle array usually cache-misses (which would mean this code doesn't run very often overall in the program). But in that case, a store-forwarding stall is probably cheaper.

This could be reasonable for a __m256i, if you can't guarantee that shift counts will be a multiple of 4.


AVX-512VBMI (Ice Lake and later) has lane-crossing vpermb; it can't zero out elements with a negative mask, but you can use AVX-512 zero-masking to get the job done. e.g. with ((uint32_t)-1) << count as the mask for zero-masking intrinsic for it, _mm256_maskz_permutexvar_epi8(__mmask32 k, __m256i idx, __m256i a). This can use a 32-byte sliding window onto a 64-byte array.

Or a 64-byte window onto a 128-byte array, but that would be guaranteed to slit across a cache-line boundary, unlike with 16 or 32-byte vectors. For that case, you might consider subtraction to generate the shuffle control, as shown below for the 16-byte case. That would allow a compare-into-mask to generate the zero-masking constant. (vpcmpb or vpcmpub to compare n against each element of the 0..63 vector, so the mask is true only for elements >=n. Since you'd be broadcasting anyway for the subtract, this is just one extra instruction to create the mask, instead of mov-immediate / shift / kmov or something, and it handles corner cases like shift count == 64 to shift out all the bits.)


Alternate shuffle mask generation: broadcast + subtract from constant

Another way to express the shuffle mask we want is {0-n, 1-n, 2-n, ..., 15-n}. For any n>=1, 0-1 will be negative, zeroing the low byte. And so on, for any n up to 128. This is good for supporting larger shift counts that shift out all the bytes.

__m128i variable_pslldq_slower(__m128i v, unsigned count)
{
    __m128i shufmask = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
    shufmask = _mm_sub_epi8(shufmask, _mm_set1_epi8(count));
    return _mm_shuffle_epi8(v, shufmask);
}

_mm_set1_epi8 with a non-constant arg will use pshufb or AVX2 vpbroadcastb which need to run on a shuffle execution unit, and we still need a vector constant loaded from memory.

# GCC12 -O3 -march=haswell
variable_pslldq_slower(long long __vector(2), unsigned int):
        vmovdqa xmm1, XMMWORD PTR .LC0[rip]
        vmovd   xmm2, edi
        vpbroadcastb    xmm2, xmm2
        vpsubb  xmm1, xmm1, xmm2
        vpshufb xmm0, xmm0, xmm1
        ret

This can start loading the vector constant without a data dependency on the shift count, which could help if it was cold in cache and the shift count comes from a dependency chain of decent length (or another load). But otherwise costs more throughput, especially vector shuffle throughput. (A software prefetch on the shuffle mask array could help equally well.)


Related:

Farcy answered 30/8, 2022 at 1:49 Comment(6)
At the beginning you mention `{ 0, 1, 2, ..., 14, 15} is the identity shuffle" and "{-1, 0, 1, ..., 13, 14} is the mask for a left-shift by one". This makes sense. However, in the code you have "-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ". This I don't understand.Placida
@user997112: Notice that we index with shuffle+16-count, so a count of 15 ends up loading -1 x 15, 0, so the high byte of the output gets the low byte of the input, the rest are zeroed. Maybe it would have been clearer to write it as shuffle+16-count instead of a separate const int8_t *identity_shuffle = shuffle+16 and using a negative index relative to that? But anyway, we index relative to the middle of that array, taking a window into it with a many low -1 elements as we need.Farcy
Okay, I think I have narrowed-down my confusion. If we take count = 3 the shuffle vector is {-1, -1, -1, 0, 1, 2,, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}? The first three shuffle elements seem to represent relative mappings: minus one meaning shift right by one? However, the remaining shuffle elements seem to represent absolute mappings: output[3] = input[0], output[4] = input[1].... is this correct?Placida
@user997112: No, it's a control vector for pshufb. High bit set (like -1) means zero that byte in the output, as described in the 2nd paragraph of this answer, and Intel's manual entry for pshufb. Otherwise use the low 4 bits as an index into the original vector. output[i] = src[ shuf[i] ].Farcy
And recall that x86 is little-endian, but "left" is towards higher-numbered elements, so towards higher-address elements when loading/storing. In the notation we're using, that's confusingly to the right. See Convention for displaying vector registers where I actually used a similar example of taking a window of an array as a pshufb control vector, showing how high element on the left like [ 12, 11, ..., 1, 0, -1, -1, -1 ] makes it easier to see as a left shift.Farcy
Got it. Thank you, really appreciate your answers and comments.Placida

© 2022 - 2024 — McMap. All rights reserved.