Run-time-variable shifts only exist in forms that take a vector control operand, not an integer unfortunately. Before AVX-512, the only variable-control shift with byte granularity is SSSE3 pshufb
, aka _mm_shuffle_epi8
. This is fine for a __m128i
, but rather inconvenient for __m256i
1 because it does two 16-byte shuffles in the two halves of a YMM register. Just like the 256-bit versions of all instructions that were originally SSE. i.e. not lane-crossing.
__m128i
with SSSE3 or later
pshufb
will zero bytes where the shuffle mask's byte has the high bit set, otherwise take the corresponding byte of the source vector. So we can use this to shift in zeros at the same time as moving our data.
{ 0, 1, 2, ..., 14, 15}
is the identity shuffle, what we need for a shift count of zero.
{-1, 0, 1, ..., 13, 14}
is the mask for a left-shift by one: zero the low byte, shift the others.
The pattern continues in an obvious way up to all--1
to shift out all the bytes if you want to support that.
I'm using notation like C arrays, with the low element at the left. Not like diagrams in Intel's manuals where the highest-numbered element is at the left, such that pslldq
(_mm_bslli_si128
) actually makes sense as a left shift. But that's because we're going to want to create a C array that those shuffle control vectors can be loaded from. Note that they overlap so we only need 32 or 31 bytes, not 16x __m128i
= 256 bytes.
__m128i variable_pslldq(__m128i v, unsigned long int count)
{
// aligned so any 16-byte window into this can't split across a wider boundary
alignas(32) static const int8_t shuffle[] = {
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1, -1, -1, -1, -1, -1, // 16 bytes
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 // 16 bytes
};
// OPTIONAL: mask the shift count
//count &= 15; // wrap the shift count to the 0..15 range, if bounds-check desired
// This means the first byte of the array is never used
const int8_t *identity_shuffle = shuffle+16; // hand-hold GCC into doing the +16 for free
__m128i shufmask = _mm_loadu_si128((const __m128i*)&identity_shuffle[-count]);
return _mm_shuffle_epi8(v, shufmask);
}
On Godbolt, GCC/clang targeting a PIE executable compiles this to 3 instructions, two of them being very cheap integer ALU:
# GCC12.2 -O3 -march=haswell (with name demangling)
variable_pslldq(long long __vector(2), unsigned long):
lea rax, variable_pslldq(long long __vector(2), unsigned long)::shuffle[rip+16]
# note the +16 as part of the LEA. Clang is similar but leaves the +16 for the pshufb addressing mode, which is worse.
sub rax, rdi
vpshufb xmm0, xmm0, XMMWORD PTR [rax]
ret
In a non-PIE executable, it could be even better, neg rdi
/ vpshufb xmm0, [shuffle+16 + rdi]
. But compilers aren't smart enough to do that. And most production code these days is built into PIE executables or shared libraries.
This sliding-window technique is similar to Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all
For a right shift, you'd use the same array but have -1
elements at the end, after the 15
. (e.g. make it global or in a namespace, not two separate static arrays for two functions, probably alignas(64)
, so both left and right shift functions can share one 48-byte array.)
With a signed int
shift count, you could even support both shift directions in one function using &shuffle[16-count]
, if you don't mask the count. (Or 16+count
if you want negative to be left and positive to be right.)
Repeating 0,1,...,14,15, 0,1,2,..,15
instead of leading -1
s would give you a rotate of bytes within a __m128i
. i.e. shift in bytes from the vector, instead of zeros. i.e. variable-count palignr
(_mm_alignr_epi8
)
Footnote 1: __m256i
with AVX2, not AVX-512: This is harder. Soonts' answer on Is there a way to shuffle a 8bitX32 ymm register right/left by N positions (c++) branches on the shift count being >= 16 bytes. (And for the n%16
part of the shift count, uses the same strategy of taking a sliding window of a constant array as the shuffle control.)
If you had 4-byte elements, AVX2 has vpermd
/ vpermps
which are full lane-crossing shuffles with 4-byte granularity; you could load a mask for those.
You might also consider @chtz's suggestion of storing along with some zeros into an array, and doing an unaligned reload of that. That has highish latency, but ok throughput as part of a bunch of different surrounding code. Same as my answer on Quickest way to shift/rotate byte vector with SIMD where I suggested the same thing for a case where there is no single-instruction shuffle, and described the cost of the store-forwarding stall.
You wouldn't want that for __m128i
unless you find that the shuffle array usually cache-misses (which would mean this code doesn't run very often overall in the program). But in that case, a store-forwarding stall is probably cheaper.
This could be reasonable for a __m256i
, if you can't guarantee that shift counts will be a multiple of 4.
AVX-512VBMI (Ice Lake and later) has lane-crossing vpermb
; it can't zero out elements with a negative mask, but you can use AVX-512 zero-masking to get the job done. e.g. with ((uint32_t)-1) << count
as the mask for zero-masking intrinsic for it, _mm256_maskz_permutexvar_epi8(__mmask32 k, __m256i idx, __m256i a)
. This can use a 32-byte sliding window onto a 64-byte array.
Or a 64-byte window onto a 128-byte array, but that would be guaranteed to slit across a cache-line boundary, unlike with 16 or 32-byte vectors. For that case, you might consider subtraction to generate the shuffle control, as shown below for the 16-byte case. That would allow a compare-into-mask to generate the zero-masking constant. (vpcmpb
or vpcmpub
to compare n
against each element of the 0..63 vector, so the mask is true only for elements >=n. Since you'd be broadcasting anyway for the subtract, this is just one extra instruction to create the mask, instead of mov-immediate / shift / kmov or something, and it handles corner cases like shift count == 64 to shift out all the bits.)
Alternate shuffle mask generation: broadcast + subtract from constant
Another way to express the shuffle mask we want is {0-n, 1-n, 2-n, ..., 15-n}
. For any n>=1
, 0-1
will be negative, zeroing the low byte. And so on, for any n
up to 128. This is good for supporting larger shift counts that shift out all the bytes.
__m128i variable_pslldq_slower(__m128i v, unsigned count)
{
__m128i shufmask = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);
shufmask = _mm_sub_epi8(shufmask, _mm_set1_epi8(count));
return _mm_shuffle_epi8(v, shufmask);
}
_mm_set1_epi8
with a non-constant arg will use pshufb
or AVX2 vpbroadcastb
which need to run on a shuffle execution unit, and we still need a vector constant loaded from memory.
# GCC12 -O3 -march=haswell
variable_pslldq_slower(long long __vector(2), unsigned int):
vmovdqa xmm1, XMMWORD PTR .LC0[rip]
vmovd xmm2, edi
vpbroadcastb xmm2, xmm2
vpsubb xmm1, xmm1, xmm2
vpshufb xmm0, xmm0, xmm1
ret
This can start loading the vector constant without a data dependency on the shift count, which could help if it was cold in cache and the shift count comes from a dependency chain of decent length (or another load). But otherwise costs more throughput, especially vector shuffle throughput. (A software prefetch on the shuffle mask array could help equally well.)
Related:
int8_t shufmask[] = {..., -1, -1, -1, 0, 1, 2, ..., 14, 15, -1, -1, -1, -1, ...}
or something like that, for use withpshufb
(_mm_shuffle_epi8
). Of course that only works for a__m128i
, not shifting across the 16-byte boundary in a__m256i
, but you seem to be talking about integer stuff with AVX1, so 16-bit vectors? If you have 4-byte elements, AVX2 hasvpermd
/vpermps
which is a full lane-crossing shuffle with 4-byte granularity. – Farcy__m128i
as an "AVX vector" or "AVX register", because the thing that was new with AVX was YMM registers, 32-byte. Before fleshing out the details into an answer, I wanted to confirm element size and total width; you should edit your question to include that info from your comment. – Farcy__m128i
AVX register. At the moment i'm restricted to AVX1 but I will be moving to AVX2 within a few weeks, so if AVX2 provides substantially easier/better features I would be interested in that answer too. – Placidapshufb
as suggested by @PeterCordes, you can also store the register to memory (to an area that is followed or preceded by 0s) and do an unaligned load with an offset. – Trompe__m256i
with shift counts that aren't a multiple of 4. – Farcy