As you know we have below Shift instructions in SIMD SSE: PSLL
(W-D-Q) and PSRL
(W-D-Q)
There's no PSLLB
instruction, so how can we shift vectors of 8bit values (single bytes)?
As you know we have below Shift instructions in SIMD SSE: PSLL
(W-D-Q) and PSRL
(W-D-Q)
There's no PSLLB
instruction, so how can we shift vectors of 8bit values (single bytes)?
In the special-case of left-shift-by-one, you can use paddb xmm0, xmm0
.
As Jester points out in comments, the best option to emulate the non-existent psrlb
and psllb
is to use a wider shift and then mask off any bits that crossed element boundaries.
e.g.
psrlw xmm0, 2 ; doesn't matter what size (w/d/q): performance is the same for all sizes on all CPUs
pand xmm0, [mask_right2]
section .rodata
align 16
;; required mask depends on the shift count
mask_right2: times 16 db 0xff >> 2 (16 bytes of 0x3f)
Or broadcast 0x3f into a vector register ahead of a loop some other way, like vpbroadcastd
or vbroadcastss
from a dword in memory, SSE3 movddup
from a qword, or just a movdqa
vector load. (vpbroadcastb
takes an extra ALU uop, unlike dword or wider broadcasts which are just simple loads). Or generate on the fly with a sequence like pcmpeqd xmm0,xmm0
/ psrlw xmm0, 8+2
/ packuswb xmm0,xmm0
. With the right choice of shift count, you can generate any pattern of 2n-1 bytes (repeated zeros and then repeated ones).
mov r32, imm32
/ movd xmm, r32
and shuffle is also an option, but probably won't save instruction bytes compared to the pcmpeqw
/ ... sequence. (Note that the register-source version of VBROADCASTSS
is AVX2-only, which doesn't matter here since 256b integer shifts are also AVX2-only.)
For a variable-count vector shift, creating the mask in an integer register and broadcasting it to a vector is one option (use pshufb
with an all-zero register to broadcast the low byte, or use imul eax, eax, 0x01010101
to go from a byte to a dword for movd
+ pshufd
). You could also use the pcmpeqd
method to create an all-ones vector and use a psrlw xmm0, xmm1
and then pack
or pshufb
.
I don't see any similarly efficient way to emulate arithmetic right-shift (the non-existant PSRAB
). The high byte of each word is handled correctly by PSRAW
. Shifting the low byte of each word to the high position would let another PSRAW
copy its sign bit as many times as required.
;; vpblendvb is 2 uops on Intel so this is worse throughput in loops than the pxor/paddb version
;; Latency may be the same on Skylake because this has some ILP.
; input in xmm0. Using AVX to save on mov instructions
VPSLLDQ xmm1, xmm0, 1 ; or VPSLLW xmm1, xmm0, 8, but this distributes one of the uops to the shuffle port
VPSRAW xmm1, xmm1, 8+2 ; shift low bytes back to final destination
VPSRAW xmm0, xmm0, 2 ; shift high bytes, leaving garbage in low bytes
VPBLENDVB xmm0, xmm1, xmm0, xmm2 ; (where xmm2 holds a mask of alternating 0 and -1, which could be generated with pcmpeqw / psrlw 8). This insn is fairly slow
There is no immediate-blend with byte granularity, because a single immediate byte can only encode 8 elements.
Without VPBLENDVB (possibly better even when it's available, if generating or loading a constant for it is slow):
;; Probably worse than the PXOR/PADDB version, if 2 constants are cheap to load
;; Needs no vector constants, but this is inefficient vs. versions with constants.
VPSLLDQ xmm1, xmm0, 1 ; or VPSLLW 8
VPSRAW xmm1, xmm1, n ; low bytes in the wrong place
VPSRAW xmm0, xmm0, 8+n ; shift high bytes all the way to the bottom of the element
VPSLLW xmm0, xmm0, 8 ; high bytes back in place, with zero in the low byte. (VPSLLDQ can't work: PSRAW 8+n leaves garbage we need to clear)
VPSRLW xmm1, xmm1, 8 ; shift low bytes into place, leaving zero in the high byte. (VPSRLDQ 1 could do this, if we started with VPSLLW instead of VPSLLDQ)
VPOR xmm0, xmm0, xmm1
Using PAND/PANDN/POR with a constant (alternating 0/-1 bytes) in a register would also work (with far less pressure on the shift port) for doing a byte-blend, and is a better choice if you have to do this in a loop.
Assuming each byte is zero-extended, e.g. after unpacking nibbles to bytes with AND + shift/AND. (Works for any field width, just adjust the constants.)
Flip the high zeros, and the sign bit, with XOR. Add 1 to the sign bit so it will restore the correct sign bit, and either clear the high bits via carry propagation (if it became 0 and carried out) or leave them set (if it became 1 and didn't carry).
; hoist the constants out of a loop if you're looping, of course.
; input in XMM0, upper bits of each byte already zeroed
pxor xmm0, [const_0xf8] ; 1111 s'xxx
paddb xmm0, [const_0x08] ; 0000 0xxx or 1111 1xxx
psrab
This is possible still with only 2 constants from memory. This is most likely the best option for a loop, especially if you have registers to spare to hoist the loads of those constants. (0xf0
can be used with vpandn
to isolate a low nibble if you also need that.)
psrld xmm0, 4 ; ???? sxxx (s = sign bit, xxx = lower bits)
por xmm0, xmm5 ; set1_epi8(0xf0) ; 1111 sxxx
pxor xmm0, xmm6 ; set1_epi8(0x08) ; 1111 s'xxx
paddb xmm0, xmm6 ; set1_epi8(0x08) ; 0000 0xxx or 1111 1xxx
I don't think we can avoid using 2 separate booleans. We need PXOR to counter PADDB or PSUBB flipping the sign bit, but only POR can set bits regardless of their old value.
We could isolate the sign bit and left-shift it before adding or subtracting (pand + pslld + paddb) but that would be worse, especially without AVX for 3-operand instructions to avoid movdqa. It would also be more total instructions including the POR that we would still need.
Upsides:
vpblendvb
version.Downside:
vpblendvb
version, especially on AMD Zen / Zen2 where vpblendvb
is a single-uop instruction with only 1c latency.Instead of pxor / paddb, use pshufb
to look up a new value for each byte, based on the low 4 bits. Unfortunately, pshufb
zeros a lane if the selector has its high bit set, so we can't use it on raw psrld
results that might have shifted in a non-zero high bit.
const __m128i sext_lut = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7,
-8, -7, -6, -5, -4, -3, -2, -1);
return _mm_shuffle_epi8(sext_lut, v);
With AVX for 3-operand non-destructive, this can be a single instruction reusing a lookup table in a register. Without, it'll need a movdqa
to copy the LUT.
Shifting with this:
__m128i srai_4_epi8(__m128i v) {
v = _mm_srli_epi32(v, 4);
v = _mm_and_si128(v, _mm_set1_epi8(0x0f));
const __m128i sext_lut = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7,
-8, -7, -6, -5, -4, -3, -2, -1);
return _mm_shuffle_epi8(sext_lut, v);
}
(x ^ m) - m
–
Analyzer x<<=4
; x>>=4
for Deinterleve vector of nibbles using SIMD (godbolt.org/z/sGafhr). Doing that with the high bits zero turns out to be equivalent to doing (x^m) + m
with the high bits 1, like I came up with for this answer. –
Unheardof Here's another way to emulate "psrab" which works for SSE or AVX with 1 scratch register:
__ punpckhbw(scratch, src); // junk in low bytes
__ punpcklbw(dst, src); // junk in low bytes
__ psraw(scratch, 8 + shift);
__ psraw(dst, 8 + shift);
__ packsswb(dst, scratch); // pack words to get result
paddb
to clear or leave set some high 1
bits to redo sign-extension in each byte. See the bottom of my answer. –
Unheardof © 2022 - 2024 — McMap. All rights reserved.
psrldq xmm0, 1; pand xmm0, [mask]; mask dd 0x7f7f7f7f, 0x7f7f7f7f, 0x7f7f7f7f, 0x7f7f7f7f
. Note that a left shift by one is simplypaddb xmm0, xmm0
. For larger shifts, adjust the mask as appropriate. – Pileumpsrldq
uses byte offsets not bits. So usepsrlq
. – Pileum0x0f
of course. Just keep the low 4 bits. – Pileum