SIMD unpack can benefit from blend instructions instead of and/andn/or because we can blend at dword / word / byte boundaries. We only need to AND once at the end to clear the high bit of each byte.
#include <immintrin.h>
static inline
__m128i ascii_unpack7x8_sse4(__m128i v)
{
__m128i separate_7B_halves = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1,
7, 8, 9,10,11,12,13, -1);
v = _mm_shuffle_epi8(v, separate_7B_halves);
// separate each u64 qword into 2 u32 halves, with the good bits at the bottom
__m128i shifted = _mm_slli_epi64(v, 4);
#ifdef __AVX2__
v = _mm_blend_epi32(v, shifted, 0b1010); // vpblendd is very efficient, 1 uop any port
#else
v = _mm_castps_si128(_mm_blend_ps( // blendps has extra bypass latency between integer insns, but is single-uop
_mm_castsi128_ps(v), _mm_castsi128_ps(shifted), 0b1010) );
#endif
// Separate each u32 into u16
shifted = _mm_slli_epi32(v, 2);
v = _mm_blend_epi16(v, shifted, 0b10101010); // high halves of pairs from shifted
// Separate each u16 into bytes, with one of two strategies
#if 0 // this strategy is simpler but worse
// shifted = _mm_add_epi16(v, v); // v<<1
// v = _mm_blendv_epi8(v, shifted, _mm_set1_epi16(0xff00));
// v = _mm_and_si128(v, _mm_set1_epi8(0x7f)); // clear garbage from high bits
#else
__m128i hi = _mm_and_si128(v, _mm_set1_epi16(0x3f80)); // isolate hi half
v = _mm_and_si128(v, _mm_set1_epi16(0x007f)); // clear high garbage
v = _mm_add_epi16(v, hi); // high halves left 1 (x+=x), low halves stay (x+=0)
// both ways need two vector constants and 3 instructions, but pblendvb can be slower and has an awkward requirement of having the control vector in XMM0
#endif
return v;
}
With AVX2 available, clang compiles it to this nice asm. Godbolt
# clang -O3 -march=x86-64-v3 (implies AVX2+BMI2, basically Haswell with generic tuning)
ascii_unpack7x8_sse4(long long __vector(2)):
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0,1,2,3,4,5,6],zero,xmm0[7,8,9,10,11,12,13],zero
vpsllq xmm1, xmm0, 4
vpblendd xmm0, xmm0, xmm1, 10 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
vpslld xmm1, xmm0, 2
vpblendw xmm0, xmm0, xmm1, 170 # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7]
vpand xmm1, xmm0, xmmword ptr [rip + .LCPI0_1] # in a loop, these constants would be in registers
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
vpaddw xmm0, xmm0, xmm1
ret
With just SSE4.1, compilers need several movdqa
instructions, as in GCC's output. And out-of-order exec will have an extra 1 or 2 cycles of latency to hide because of bypass-forwarding delays for integer shifts forwarding to an FP blendps
, on Intel CPUs. (https://agner.org/optimize/). But that's fine, we're doing this in a loop over an array, modern CPUs have deep enough out-of-order exec.
# gcc -O3 -march=x86-64-v2 # SSE4.2, Nehalem. Actually only using SSE4.1
ascii_unpack7x8_sse4(long long __vector(2)):
movdqa xmm1, xmm0 # silly compiler wastes a MOV
pshufb xmm1, XMMWORD PTR .LC0[rip]
movdqa xmm0, xmm1 # save unshifted v
psllq xmm0, 4
blendps xmm1, xmm0, 10 # 0b1010 = 0xA
movdqa xmm0, xmm1
pslld xmm0, 2
pblendw xmm1, xmm0, 170 # 0b10101010 = 0xAA
movdqa xmm0, XMMWORD PTR .LC1[rip] # after inlining, probably a reg-copy
pand xmm0, xmm1 # and two PAND xmm,xmm
pand xmm1, XMMWORD PTR .LC2[rip]
paddw xmm0, xmm1
ret
If AVX2 is available, an __m256i
version of this is straightforward and wouldn't need the blendps
fallback. That may be better than scalar pdep
(BMI2). AVX2 vpsrlvd
or q
(per-element shift counts) seem like they should help, but we find ourselves needing to move bits across dword boundaries, and it can only be left or right, not alternating directions. (AVX512 has variable-count rotates (32 and 64-bit), and 16-bit variable-count shifts. Rotates let you go right or left with the same instruction.)
The shift element size could be 64 each time; our blends drop bits that would get shifted into the low element of a pair. For the final step, paddw
is 1 byte smaller than psllw/d/q
because it has no immediate operand. And can run on more ports on most CPUs. Especially Haswell, where shifts can only run on port 0, but paddw
can run on port 1 or 5. (This code has no instruction-level parallelism within one iteration, so we rely on out-of-order exec to overlap execution of multiple iterations.)
Skylake through Alder Lake run SIMD shifts on p01, SIMD inter adds on p015, blendps
on p015, pblendw
on p5 (p15 for Alder Lake), pblendvb
as 1 uop for p015. (Only the non-AVX encoding; vpblendvb
is 2 uops for p015). Zen 3 for example has plenty of throughput for all of these.
The final step avoiding _mm_blendv_epi8
has several advantages:
Both ways need two vector constants and 3 instructions. (And no difference in the minimum number of movdqa
register-copies a compiler has to invent without non-destructive AVX instructions.)
The AND/AND/ADD version has better ILP; two ANDs in parallel.
SSE4.1 pblendvb
can be slower (e.g. Haswell runs it as 2 uops for port 5) and has an awkward requirement of having the control vector in XMM0. Some compilers may waste instructions with hard-reg constraints. (Maybe even when inlining into a loop, unlike when we look at how this helper function would compile on its own.)
vpblendvb
(the AVX encoding of it) is 2 uops (for any port) on newer Intel, or 3 on Alder Lake, presumably as the price for having 4 operands (3 inputs and a separate output). Also the AVX version is slow on Alder Lake E-cores (4 uops, 3.23 cycle throughput) https://uops.info/.
AMD CPUs don't have this problem; for example Zen 3 runs vpblendvb
as 1 uop for either of two ports.
The only possible upside to the blend version is that the constants are easier to construct on the fly. GCC12 has started preferring to construct some constants on the fly when AVX is available, but does a rather bad job of it, using 10-byte mov r64, imm64
/ vmovq
/ vpunpcklqdq
instead of 5-byte mov reg, imm32
/ ... / vpbroadcastd
or pshufd v,v,0
. Or instead of starting with an all-ones vector and shifting.
Actually, the constants for the non-blend way can be generated from an all-ones vector with psrlw xmm, 9
to get 0x007f
, and then left shifting that 7-bit mask left by 7. So with AVX, 3 total instructions for both masks, without memory access. Unfortunately compilers don't know how to do this optimization so it's a moot point.
AVX-512F / BW, without AVX-512VBMI / AVX-512GFNI
If you have Ice Lake / Zen4 features, you want @Harold's answer; as I commented there, it's slightly better than AVX-512 vpmultishiftqb
(parallel bitfield-extract within a qword).
But if not, with Skylake-X / Cascade Lake features (AVX-512BW and F) you have have masking and variable-count rotates. This saves 2 instructions vs. the SSE4 version (built with AVX2); it feels like there should be room to save more, especially at the final step within 16-bit elements. But masking has byte granularity, and there is no vprolvw
, and still no byte shift, unlike AArch64 which can shift elements in 2 directions at byte granularity.
Splitting things apart and doing different things, then merging with a merge-masking vmovdqa
could work, but I don't think would help.
#ifdef __AVX512BW__
// pre-Ice Lake, without AVX-512VBMI or AVX512-GFNI
__m128i ascii_unpack7x8_avx512bw(__m128i v)
{
// for YMM or ZMM, use VPERMW, or VPERMB if we have AVX512VBMI since unfortunately VPERMW isn't single-uop on Intel CPUs that support both.
__m128i separate_7B_halves = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1,
7, 8, 9,10,11,12,13, -1);
v = _mm_shuffle_epi8(v, separate_7B_halves);
v = _mm_slli_epi64(v, 4); // 00000HGFE | DCBA0000 // dword halves
v = _mm_rolv_epi32(v, _mm_set_epi32(2, 32-2, 2, 32-2));
// 00HG|FE00 | 00DC|BA00 // u16 chunks of a u64
v = _mm_mask_srli_epi16(v, (__mmask8)0b0101'0101, v, 2); // 00HG | 00FE | 00DC | 00BA
// Separate each u16 into bytes
__m128i hi = _mm_and_si128(v, _mm_set1_epi16(0x3f80)); // isolate hi half
v = _mm_add_epi16(v, hi); // high halves left 1 (x+=x), low halves stay (x+=0)
// 0H0G | 0F0E | 0D0C | 0B0A in each qword.
return v;
}
#endif
Clang (Godbolt) optimizes the masked right-shift to a variable-count right shift, which is a good idea for a stand-alone function not in a loop especially when we're loading other constants.
This uses more non-immediate constants, but fewer uops. A wider version of this using vpermw
to unpack 14-byte chunks to 16-byte lanes might have to do something to introduce zero bits where they're needed, perhaps using zero-masking on the shuffle. But I think we'd still need vpshufb
within lanes, so it can zero those high bits.
Having those known zeros that we move around with shifts and rotates is what lets us only use one and
and add
at the end, unlike the blending version where elements end up with high garbage so we need to mask both ways.
# clang -O3 -march=x86-64-v4
ascii_unpack7x8_avx512bw(long long __vector(2)):
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = xmm0[0,1,2,3,4,5,6],zero,xmm0[7,8,9,10,11,12,13],zero
vpsllq xmm0, xmm0, 4
vprolvd xmm0, xmm0, xmmword ptr [rip + .LCPI1_1]
vpsrlvw xmm0, xmm0, xmmword ptr [rip + .LCPI1_2]
vpand xmm1, xmm0, xmmword ptr [rip + .LCPI1_3]
vpaddw xmm0, xmm1, xmm0
ret
These constants would of course be loaded into registers.
Just 6 uops; shifts run on port 0 or 1, shuffles on port 5, on Skylake, with VPAND and VPADD able to run on any of the 3 vector ALU ports. So it's a good balance, not running into back-end throughput bottlenecks on a specific port. (vs. 8 uops with clang's AVX build of the SSE4 version)
GCC using masking as requested, again the constant init will get hoisted out of loops, including k1
.
# gcc -O3 -march=x86-64-v4
ascii_unpack7x8_avx512bw(long long __vector(2)):
vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]
mov eax, 85 # 0x55
vpsllq xmm0, xmm0, 4
kmovb k1, eax
movabs rax, 4575727041462157184 # 0x3F803F803F803F80 silly to use a 64-bit immediate
vprolvd xmm0, xmm0, XMMWORD PTR .LC3[rip]
vpbroadcastq xmm1, rax
vpsrlw xmm0{k1}, xmm0, 2
vpand xmm1, xmm0, xmm1
vpaddw xmm0, xmm0, xmm1
ret
Same instructions doing the work, just setting up constants differently. (Except for vpsrlw xmm0{k1}, xmm0, 2
to shift some elements but not others.)
pext
could do this fast 64 bits at a time, but CPUs with it also have AVX2 normally, so that would be the competition. (And AMD doesn't have fastpext
until Zen3, despite supporting it in Excavator and Zen1/2 via slow microcode.) – Shawnshawnapext
, or AVX2 for variable-count 32-bit shifts (or AVX-512 for 16-bit). (Or abusingpmulhuw
for right shifts of different elements with power-of-2 multipliers?) – Shawnshawna