bitpack ascii string into 7-bit binary blob using SIMD

Asked 17/12, 2022 at 4:41 Answered 25/12, 2022 at 12:22

Related: bitpack ascii string into 7-bit binary blob using ARM-v8 Neon SIMD - same question specialized for AArch64 intrinsics. This question covers portable C and x86-64 intrinsics.

I would like to encode a char string as a 7-bit blob to gain a 12.5% reduction in memory. I want to do it as fast a possible, i.e. with minimal latency when encoding large strings.

Here is the plain implementation of the algo:

void ascii_pack(const char* ascii, size_t len, uint8_t* bin) {
  uint64_t val;
  const char* end = ascii + len;

  while (ascii + 8 <= end) {
    memcpy(&val, ascii, 8);
    uint64_t dest = (val & 0xFF);

    // Compiler will perform loop unrolling
    for (unsigned i = 1; i <= 7; ++i) {
      val >>= 1;
      dest |= (val & (0x7FUL << 7 * i));
    }
    memcpy(bin, &dest, 7);
    bin += 7;
    ascii += 8;
  }

  // epilog - we do not pack since we have less than 8 bytes.
  while (ascii < end) {
    *bin++ = *ascii++;
  }
}

now, I would like to speed it up with SIMD. I came with SSE2 algo below. My question:

is it possible to optimize the internal loop that is sequential?
will it improve the throughput when running on large strings?


// The algo - do in parallel what ascii_pack does on two uint64_t integers
void ascii_pack_simd(const char* ascii, size_t len, uint8_t* bin) {
  __m128i val;

  __m128i mask = _mm_set1_epi64x(0x7FU);  // two uint64_t masks

  // I leave out 16 bytes in addition to 16 that we load in the loop
  // because we store into "bin" full 16 bytes instead of 14. To prevent out of bound
  // writes we finish one iteration earlier.
  const char* end = ascii + len - 32;
  while (ascii <= end) {
    val = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ascii));
    __m128i dest = _mm_and_si128(val, mask);

    // Compiler unrolls it
    for (unsigned i = 1; i <= 7; ++i) {
      val = _mm_srli_epi64(val, 1);                          // shift right both integers
      __m128i shmask = _mm_slli_epi64(mask, 7 * i);    // mask both
      dest = _mm_or_si128(dest, _mm_and_si128(val, shmask));  // add another 7bit part.
    }

    // dest contains two 7 byte blobs. Lets copy them to bin.
    _mm_storeu_si128(reinterpret_cast<__m128i*>(bin), dest);
    memmove(bin + 7, bin + 8, 7);
    bin += 14;
    ascii += 16;
  }

  end += 32;  // Bring back end.
  DCHECK(ascii < end);
  ascii_pack(ascii, end - ascii, bin);
}

Fronton answered 17/12, 2022 at 4:41 Comment(8)

Just FYI, modern compression schemes like LZ4 and Snappy can often do better than 12.5% on ASCII text, especially for large strings, and are quite fast, like maybe DRAM bandwidth at least for decode, at least on a big Xeon where per-core memory bandwidth is low. Presumably your use-case benefits from the fixed compression ratio and simplicity. But if you haven't looked at modern compression algos, it's worth considering for many use-cases. Snappy being byte-oriented could be combined with this. – Shawnshawna 17/12, 2022 at 5:2

Is the precise order of data inside a 7-byte block important? As it is, it can already be optimized, but there may be more opportunities if the order could be changed to whatever ends up being simpler to compute. Also is GFNI allowed? GF2P8AFFINEQB could pretty much just do this, I think. – Abscond 17/12, 2022 at 5:5

BMI2 pext could do this fast 64 bits at a time, but CPUs with it also have AVX2 normally, so that would be the competition. (And AMD doesn't have fast pext until Zen3, despite supporting it in Excavator and Zen1/2 via slow microcode.) – Shawnshawna 17/12, 2022 at 5:9

@PeterCordes I am using LZ4 for other use-cases, but I am pretty sure it's much slower than even plain(non-simd) bitpacking. A benchmark on my machine shows that plain packing of 1024-length string takes 300ns and SSE2 version takes 130ns. – Fronton 17/12, 2022 at 5:14

BTW even without PEXT, the scalar version could be optimized by using the old "move odd 7-bit elements right by 1 bit, then move odd 14-bit elements right by 2 bits" etc (which with SSE2 or AVX2 you could do within each 64-bit part of the vector) – Abscond 17/12, 2022 at 5:18

@harold care to write a snippet for a scalar version? or send a link to some other use-case that does this trick? – Fronton 17/12, 2022 at 5:23

I'm not sure what the state of the art is in fast light-weight compression. Yeah, I don't expect anything will be as fast as this, even without pext, or AVX2 for variable-count 32-bit shifts (or AVX-512 for 16-bit). (Or abusing pmulhuw for right shifts of different elements with power-of-2 multipliers?) – Shawnshawna 17/12, 2022 at 5:24

@Roman: You might want to post the version you're using as an answer. An edit to the question was the wrong place to put it. – Shawnshawna 30/12, 2022 at 4:58

The scalar trick (without requiring PEXT) which I referred to in the comments could be implemented like this:

uint64_t compress8x7bit(uint64_t x)
{
    x = ((x & 0x7F007F007F007F00) >> 1) | (x & 0x007F007F007F007F);
    x = ((x & 0x3FFF00003FFF0000) >> 2) | (x & 0x00003FFF00003FFF);
    x = ((x & 0x0FFFFFFF00000000) >> 4) | (x & 0x000000000FFFFFFF);
    return x;
}

The idea here is to concatenate together adjacent pairs, first concatenate 7-bit elements into 14-bit elements, then concatenate them into 28-bit elements, and finally concatenate them into one 56-bit chunk (which is the result).

With SSSE3, you could use pshufb to concatenate two of those 56-bit parts (before storing them) too.

SSE2 (and AVX2) can do the same thing as that scalar code with 64-bit elements, but this approach does not take advantage of any techniques that may be possible with special operations (which SSE2+ has plenty of, more with every version), there are probably better things to do than just implementing the scalar trick in SIMD.

For example just to throw something wild out there, gf2p8affineqb(0x8040201008040201, x) would put all the "discarded" bits in one place (namely the top byte of the result) and makes a solid 56-bit chunk out of the bits that we want to keep. But the bits do end up in a strange order (the first byte would contain bits 56, 48, 40, 32, 24, 16, 8, 0, in that order, listing the least significant bit first).

That order, strange as it is, can be easily unpacked using pshufb to reverse the bytes (you can also use this to insert the two zeroes) and then gf2p8affineqb(0x0102040810204080, reversedBytes) shuffles the bits back into the original order.

Here's a sketch of how that could work with actual AVX2+GFNI intrinsics. I'm not bothering to handle the extra parts at the end here, just the "main" loop, so the input text had better be a multiple of 32 bytes. Works on my PC ✔️

void compress8x7bit(const char* ascii, size_t len, uint8_t* bin)
{
    const char* end = ascii + len;
    while (ascii + 31 < end) {
        __m256i text = _mm256_loadu_si256((__m256i*)ascii);
        __m256i transposed = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x8040201008040201), text, 0);
        __m256i compressed = _mm256_shuffle_epi8(transposed, 
            _mm256_set_epi8(-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0,
                            -1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0));
        _mm_storeu_si128((__m128i*)bin, _mm256_castsi256_si128(compressed));
        _mm_storeu_si128((__m128i*)(bin + 14), _mm256_extracti128_si256(compressed, 1));
        bin += 28;
        ascii += 32;
    }
}

void uncompress8x7bit(char* ascii, size_t len, const uint8_t* bin)
{
    const char* end = ascii + len;
    while (ascii + 31 < end) {
        __m256i raw = _mm256_inserti128_si256(_mm256_castsi128_si256(_mm_loadu_si128((__m128i*)bin)), _mm_loadu_si128((__m128i*)(bin + 14)), 1);
        __m256i rev_with_zeroes = _mm256_shuffle_epi8(raw, 
            _mm256_set_epi8(7, 8, 9, 10, 11, 12, 13, -1, 0, 1, 2, 3, 4, 5, 6, -1,
                            7, 8, 9, 10, 11, 12, 13, -1, 0, 1, 2, 3, 4, 5, 6, -1));
        __m256i decompressed = _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x0102040810204080), rev_with_zeroes, 0);
        _mm256_storeu_si256((__m256i*)ascii, decompressed);
        bin += 28;
        ascii += 32;
    }
}

Perhaps there is a nicer solution than using two 128-bit stores in the compressor and two 128-bit loads in the uncompressor. With AVX512 that would be easy since it has full-register byte-granular permutes, but AVX2 has vpshufb, which is not able to move bytes between the two 128-bit halves that make up a 256-bit vector. The uncompressor could do a funny load that starts 2 bytes before the start of the data it wants, like this: _mm256_loadu_si256((__m256i*)(bin - 2)) (and a slightly different shuffle vector), at the cost of having to avoid a potential out-of-bounds error with either padding or a special first iteration, but the compressor cannot (not cheaply) use a trick like that with a store that start 2 bytes earlier (that would destroy two bytes of the result).

By the way I have some test code here that you can use to verify that your bit-compression functions do the right thing (well sort of - as long as the function is a bit-permutation where some of the bits may be zeroed this works as a check, but this would not detect every possible bug in general):

uint64_t bitindex[7];
bitindex[6] = compress8x7bit(0xFFFFFFFFFFFFFFFF);
bitindex[5] = compress8x7bit(0xFFFFFFFF00000000);
bitindex[4] = compress8x7bit(0xFFFF0000FFFF0000);
bitindex[3] = compress8x7bit(0xFF00FF00FF00FF00);
bitindex[2] = compress8x7bit(0xF0F0F0F0F0F0F0F0);
bitindex[1] = compress8x7bit(0xCCCCCCCCCCCCCCCC);
bitindex[0] = compress8x7bit(0xAAAAAAAAAAAAAAAA);

for (size_t i = 0; i < 64; i++)
{
    if (i != 0)
        std::cout << ", ";
    if (bitindex[6] & (1uLL << i))
    {
        int index = 0;
        for (size_t j = 0; j < 6; j++)
        {
            if (bitindex[j] & (1uLL << i))
                index |= 1 << j;
        }
        std::cout << index;
    }
    else
        std::cout << "_";
}
std::cout << "\n";

Abscond answered 17/12, 2022 at 6:5 Comment(4)

Thanks @harold. I chose using SSE-based solution (to support older hardware) that adopts your scalar trick but with two 64bit integers in parallel. At the end, I used _mm_shuffle_epi8 to pack the resulting bytes together. Now it takes 80ns to compress 1024 byte string compared to 400ns naive solution I had before. – Fronton 17/12, 2022 at 15:1

@Roman: A more relevant comparison is against your first SSE2 attempt, which took 130ns on your machine for that problem size. (What hardware?). And yes, SSSE3 _mm_shuffle_epi8 is widely available, and very useful for a final shuffle. – Shawnshawna 17/12, 2022 at 20:29

Parts of this problem look familiar; I seem to remember having previously thought about doing the first step with and/add to left shift some bits (via x+x = x<<1) while leaving others in place, to join 7-bit groups in the middle of 16-bit elements. Like x += x & 0x007f007f...; That saves one pand instruction at the start, but then costs a psrlq at the end (because my bit-groups wouldn't be at the bottoms of 16-bit elements). So it's worse for critical-path latency, unless we don't have mov-elimination. – Shawnshawna 17/12, 2022 at 20:36

With AVX512-VBMI, the decompress just take vpermb + vpmultishiftqb (parallel bitfield extract within each qword) + vpand. I guess your _mm256_gf2p8affine_epi64_epi8 is doing the multishift plus clearing the high bit of each byte result, so that's even better if available. Both require Ice Lake or Zen 4 or newer, and VGF2P8AFFINEQB is 5 cycle latency on port 0 or 1 on ICL (3c for on Zen 4, also 0.5c throughput), while VPMULTISHIFTQB is 3 cycle latency for port 5 on ICL. (Zen 4: 3c with 0.5c throughput). So the GFNI instruction is better, avoiding the VPAND. – Shawnshawna 18/12, 2022 at 3:33

You can improve the solution by @harold, if you replace the first two mask and shift steps by a vpmaddubsw and vpmaddwd (each using 1 instead of 4 uops) and the next step can be replaced by shifting every other 32bit element 4 to the left and afterwords shifting all 64bit elements 4 to the right. Of course, by using AVX2 instead of SSE, you can again double the throughput.

The final step of joining the lower and upper lane is likely most efficiently done by two separate stores which extract each lane directly to memory.

void ascii_pack32(char const* ascii, char* bin)
{
    const __m256i control = _mm256_set_epi8(-1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0,
                                            -1, -1, 14, 13, 12, 11, 10, 9, 8, 6, 5, 4, 3, 2, 1, 0);

    __m256i input = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(ascii));
    
    // only necessary if high bits of input might be set:
    input = _mm256_and_si256(input, _mm256_set1_epi8(0x7f));

    __m256i t1 = _mm256_maddubs_epi16(_mm256_set1_epi16(0x8001), input);
    __m256i t2 = _mm256_madd_epi16(_mm256_set1_epi32(0x40000001), t1);
    __m256i t3 = _mm256_srli_epi64(_mm256_sllv_epi32(t2, _mm256_set1_epi64x(4)), 4);


    __m256i val = _mm256_shuffle_epi8(t3, control);
    _mm_storeu_si128(reinterpret_cast<__m128i*>(bin), _mm256_castsi256_si128(val));
    _mm_storeu_si128(reinterpret_cast<__m128i*>(bin+14), _mm256_extracti128_si256(val, 1));
}

Godbolt link with short testcode: https://godbolt.org/z/hs7477h5W

Conscription answered 18/12, 2022 at 12:38 Comment(8)

Thanks @chtz. Is it possible to produce a similar solution with SSE3? – Fronton 18/12, 2022 at 13:29

Except for the _mm256_sllv_epi32 operation this can easily be translated to equivalent SSE code. If you have SSE4.1 you could replace that by _mm_mullo_epi32 with 0x10 and 0x1. Or you can use the and+shift+or approach. – Conscription 18/12, 2022 at 13:55

are you saying that _mm_maddubs_epi16(_mm_set1_epi16(0x8001), input) is equivalent to ``` rpart = _mm_and_si128(val, _mm_set1_epi64x(0x007F007F007F007F)); lpart = _mm_and_si128(val, _mm_set1_epi64x(0x7F007F007F007F00)); val = _mm_or_si128(_mm_srli_epi64(lpart, 1), rpart); ``` ? Can you explain what exactly _mm_maddubs_epi16 does in this context? (sorry, a complete newbie with SIMD). – Fronton 18/12, 2022 at 15:47

Ah, I am beginning to understand. I just need 2 bytes of the input to understand the interaction here. input[0] * 0x01 preserves the first byte but creates a uint16. input[1] * 0x80 shifts left by 7, or in other words it shifts right the second byte by one. the intermediate uint16 terms do not overlap, so you use "add" as "or" operation. It's super impressive I must say. – Fronton 18/12, 2022 at 16:7

for the record, using _mm_maddubs_epi16 etc works great for x86, but actually causes a regression for aarch64 architectures. – Fronton 18/12, 2022 at 19:42

For aarch64 you may want to ask a separate question. – Conscription 18/12, 2022 at 20:50

@Roman: AArch64 has better (or at least different) SIMD shuffle and bitfield-insert instructions. You certainly don't want to actually emulate multiply-and-horizontal-add if the ISA doesn't have that directly. I wouldn't be at all surprised if there's no drop-in replacement for x86's pmaddubsw that treats one input as signed bytes and the other as unsigned. We use it here because that's the only SIMD byte multiply x86 has, and the only single-uop way to combine horizontal pairs into 14-bit fields at the bottoms of 16-bit elements. The actual multiply is overkill, costing latency + power. – Shawnshawna 18/12, 2022 at 21:52

#74846999 – Fronton 19/12, 2022 at 5:14

SIMD unpack can benefit from blend instructions instead of and/andn/or because we can blend at dword / word / byte boundaries. We only need to AND once at the end to clear the high bit of each byte.

#include <immintrin.h>

static inline
__m128i ascii_unpack7x8_sse4(__m128i v)
{
   __m128i separate_7B_halves = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1,
                                              7, 8, 9,10,11,12,13, -1);
   v = _mm_shuffle_epi8(v, separate_7B_halves);

   // separate each u64 qword into 2 u32 halves, with the good bits at the bottom
    __m128i shifted = _mm_slli_epi64(v, 4);
#ifdef __AVX2__
    v = _mm_blend_epi32(v, shifted, 0b1010);  // vpblendd is very efficient, 1 uop any port
#else
    v = _mm_castps_si128(_mm_blend_ps(        // blendps has extra bypass latency between integer insns, but is single-uop
               _mm_castsi128_ps(v), _mm_castsi128_ps(shifted), 0b1010) );
#endif

    // Separate each u32 into u16
    shifted = _mm_slli_epi32(v, 2);
    v = _mm_blend_epi16(v, shifted, 0b10101010);  // high halves of pairs from shifted

    // Separate each u16 into bytes, with one of two strategies
#if 0  // this strategy is simpler but worse
  //  shifted = _mm_add_epi16(v, v);  // v<<1
  //  v = _mm_blendv_epi8(v, shifted, _mm_set1_epi16(0xff00));
  //  v = _mm_and_si128(v, _mm_set1_epi8(0x7f));  // clear garbage from high bits
#else
    __m128i hi = _mm_and_si128(v, _mm_set1_epi16(0x3f80)); // isolate hi half
    v = _mm_and_si128(v, _mm_set1_epi16(0x007f));  // clear high garbage
    v = _mm_add_epi16(v, hi);        // high halves left 1 (x+=x), low halves stay (x+=0)

   // both ways need two vector constants and 3 instructions, but pblendvb can be slower and has an awkward requirement of having the control vector in XMM0
#endif

    return v;
}

With AVX2 available, clang compiles it to this nice asm. Godbolt

# clang -O3 -march=x86-64-v3  (implies AVX2+BMI2, basically Haswell with generic tuning)
ascii_unpack7x8_sse4(long long __vector(2)):
        vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = xmm0[0,1,2,3,4,5,6],zero,xmm0[7,8,9,10,11,12,13],zero
        vpsllq  xmm1, xmm0, 4
        vpblendd        xmm0, xmm0, xmm1, 10            # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
        vpslld  xmm1, xmm0, 2
        vpblendw        xmm0, xmm0, xmm1, 170           # xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3],xmm0[4],xmm1[5],xmm0[6],xmm1[7]
        vpand   xmm1, xmm0, xmmword ptr [rip + .LCPI0_1]       # in a loop, these constants would be in registers
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
        vpaddw  xmm0, xmm0, xmm1
        ret

With just SSE4.1, compilers need several movdqa instructions, as in GCC's output. And out-of-order exec will have an extra 1 or 2 cycles of latency to hide because of bypass-forwarding delays for integer shifts forwarding to an FP blendps, on Intel CPUs. (https://agner.org/optimize/). But that's fine, we're doing this in a loop over an array, modern CPUs have deep enough out-of-order exec.

# gcc -O3 -march=x86-64-v2   # SSE4.2, Nehalem.  Actually only using SSE4.1
ascii_unpack7x8_sse4(long long __vector(2)):
        movdqa  xmm1, xmm0               # silly compiler wastes a MOV
        pshufb  xmm1, XMMWORD PTR .LC0[rip]
        movdqa  xmm0, xmm1               # save unshifted v
        psllq   xmm0, 4
        blendps xmm1, xmm0, 10           # 0b1010 = 0xA
        movdqa  xmm0, xmm1
        pslld   xmm0, 2
        pblendw xmm1, xmm0, 170          # 0b10101010 = 0xAA
        movdqa  xmm0, XMMWORD PTR .LC1[rip] # after inlining, probably a reg-copy
        pand    xmm0, xmm1                  # and two PAND xmm,xmm
        pand    xmm1, XMMWORD PTR .LC2[rip]
        paddw   xmm0, xmm1
        ret

If AVX2 is available, an __m256i version of this is straightforward and wouldn't need the blendps fallback. That may be better than scalar pdep (BMI2). AVX2 vpsrlvd or q (per-element shift counts) seem like they should help, but we find ourselves needing to move bits across dword boundaries, and it can only be left or right, not alternating directions. (AVX512 has variable-count rotates (32 and 64-bit), and 16-bit variable-count shifts. Rotates let you go right or left with the same instruction.)

The shift element size could be 64 each time; our blends drop bits that would get shifted into the low element of a pair. For the final step, paddw is 1 byte smaller than psllw/d/q because it has no immediate operand. And can run on more ports on most CPUs. Especially Haswell, where shifts can only run on port 0, but paddw can run on port 1 or 5. (This code has no instruction-level parallelism within one iteration, so we rely on out-of-order exec to overlap execution of multiple iterations.)

Skylake through Alder Lake run SIMD shifts on p01, SIMD inter adds on p015, blendps on p015, pblendw on p5 (p15 for Alder Lake), pblendvb as 1 uop for p015. (Only the non-AVX encoding; vpblendvb is 2 uops for p015). Zen 3 for example has plenty of throughput for all of these.

The final step avoiding _mm_blendv_epi8 has several advantages:

Both ways need two vector constants and 3 instructions. (And no difference in the minimum number of movdqa register-copies a compiler has to invent without non-destructive AVX instructions.)
The AND/AND/ADD version has better ILP; two ANDs in parallel.
SSE4.1 pblendvb can be slower (e.g. Haswell runs it as 2 uops for port 5) and has an awkward requirement of having the control vector in XMM0. Some compilers may waste instructions with hard-reg constraints. (Maybe even when inlining into a loop, unlike when we look at how this helper function would compile on its own.)
vpblendvb (the AVX encoding of it) is 2 uops (for any port) on newer Intel, or 3 on Alder Lake, presumably as the price for having 4 operands (3 inputs and a separate output). Also the AVX version is slow on Alder Lake E-cores (4 uops, 3.23 cycle throughput) https://uops.info/.

AMD CPUs don't have this problem; for example Zen 3 runs vpblendvb as 1 uop for either of two ports.
The only possible upside to the blend version is that the constants are easier to construct on the fly. GCC12 has started preferring to construct some constants on the fly when AVX is available, but does a rather bad job of it, using 10-byte mov r64, imm64 / vmovq / vpunpcklqdq instead of 5-byte mov reg, imm32 / ... / vpbroadcastd or pshufd v,v,0. Or instead of starting with an all-ones vector and shifting.

Actually, the constants for the non-blend way can be generated from an all-ones vector with psrlw xmm, 9 to get 0x007f, and then left shifting that 7-bit mask left by 7. So with AVX, 3 total instructions for both masks, without memory access. Unfortunately compilers don't know how to do this optimization so it's a moot point.

AVX-512F / BW, without AVX-512VBMI / AVX-512GFNI

If you have Ice Lake / Zen4 features, you want @Harold's answer; as I commented there, it's slightly better than AVX-512 vpmultishiftqb (parallel bitfield-extract within a qword).

But if not, with Skylake-X / Cascade Lake features (AVX-512BW and F) you have have masking and variable-count rotates. This saves 2 instructions vs. the SSE4 version (built with AVX2); it feels like there should be room to save more, especially at the final step within 16-bit elements. But masking has byte granularity, and there is no vprolvw, and still no byte shift, unlike AArch64 which can shift elements in 2 directions at byte granularity.

Splitting things apart and doing different things, then merging with a merge-masking vmovdqa could work, but I don't think would help.

#ifdef __AVX512BW__
// pre-Ice Lake, without AVX-512VBMI or AVX512-GFNI
__m128i ascii_unpack7x8_avx512bw(__m128i v)
{
    // for YMM or ZMM, use VPERMW, or VPERMB if we have AVX512VBMI since unfortunately VPERMW isn't single-uop on Intel CPUs that support both.
   __m128i separate_7B_halves = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1,
                                              7, 8, 9,10,11,12,13, -1);
   v = _mm_shuffle_epi8(v, separate_7B_halves);

    v = _mm_slli_epi64(v, 4);      // 00000HGFE | DCBA0000  // dword halves
    v = _mm_rolv_epi32(v, _mm_set_epi32(2, 32-2, 2, 32-2));
                                   // 00HG|FE00 | 00DC|BA00  // u16 chunks of a u64
    v = _mm_mask_srli_epi16(v, (__mmask8)0b0101'0101, v, 2); // 00HG | 00FE | 00DC | 00BA

    // Separate each u16 into bytes
    __m128i hi = _mm_and_si128(v, _mm_set1_epi16(0x3f80)); // isolate hi half
    v = _mm_add_epi16(v, hi);     // high halves left 1 (x+=x), low halves stay (x+=0)
    // 0H0G | 0F0E | 0D0C | 0B0A  in each qword.
    return v;
}
#endif

Clang (Godbolt) optimizes the masked right-shift to a variable-count right shift, which is a good idea for a stand-alone function not in a loop especially when we're loading other constants.

This uses more non-immediate constants, but fewer uops. A wider version of this using vpermw to unpack 14-byte chunks to 16-byte lanes might have to do something to introduce zero bits where they're needed, perhaps using zero-masking on the shuffle. But I think we'd still need vpshufb within lanes, so it can zero those high bits.

Having those known zeros that we move around with shifts and rotates is what lets us only use one and and add at the end, unlike the blending version where elements end up with high garbage so we need to mask both ways.

# clang -O3 -march=x86-64-v4
ascii_unpack7x8_avx512bw(long long __vector(2)):
        vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = xmm0[0,1,2,3,4,5,6],zero,xmm0[7,8,9,10,11,12,13],zero
        vpsllq  xmm0, xmm0, 4
        vprolvd xmm0, xmm0, xmmword ptr [rip + .LCPI1_1]
        vpsrlvw xmm0, xmm0, xmmword ptr [rip + .LCPI1_2]
        vpand   xmm1, xmm0, xmmword ptr [rip + .LCPI1_3]
        vpaddw  xmm0, xmm1, xmm0
        ret

These constants would of course be loaded into registers.

Just 6 uops; shifts run on port 0 or 1, shuffles on port 5, on Skylake, with VPAND and VPADD able to run on any of the 3 vector ALU ports. So it's a good balance, not running into back-end throughput bottlenecks on a specific port. (vs. 8 uops with clang's AVX build of the SSE4 version)

GCC using masking as requested, again the constant init will get hoisted out of loops, including k1.

# gcc -O3 -march=x86-64-v4
ascii_unpack7x8_avx512bw(long long __vector(2)):
        vpshufb xmm0, xmm0, XMMWORD PTR .LC0[rip]
        mov     eax, 85                   # 0x55
        vpsllq  xmm0, xmm0, 4
        kmovb   k1, eax
        movabs  rax, 4575727041462157184  # 0x3F803F803F803F80  silly to use a 64-bit immediate
        vprolvd xmm0, xmm0, XMMWORD PTR .LC3[rip]
        vpbroadcastq    xmm1, rax
        vpsrlw  xmm0{k1}, xmm0, 2
        vpand   xmm1, xmm0, xmm1
        vpaddw  xmm0, xmm0, xmm1
        ret

Same instructions doing the work, just setting up constants differently. (Except for vpsrlw xmm0{k1}, xmm0, 2 to shift some elements but not others.)

Shawnshawna answered 19/12, 2022 at 19:41 Comment(1)

Just a little visualization how the packing algorithm had shifted the data / where each input bit ended up where in the intermediates in ascii_unpack7x8_sse4: godbolt.org/z/48qvMaGoE – Luik 21/3 at 11:2

Backporting my arm64 answer to SSE2, we can simulate variadic shifts by mullo_epu16 and mulhi_epu16; first pack adjacent 7+7-bit values as consecutive:

// 0b'0aaaaaaa'0bbbbbbb + 0bbbbbbb = 0b'0aaaaaaa'bbbbbbb0
a0 = _mm_add_epi16(a, _mm_and_epi16(a, _mm_set1_epi16(0x7f)));

a0 =    0aaaaaaabbbbbbb0'0cccccccddddddd0'0eeeeeeefffffff0'0ggggggghhhhhhh0
a1 =    00000000aaaaaaab'000000cccccccddd'0000eeeeeeefffff'00ggggggghhhhhhh
a2 =    bbbbbb0000000000'dddd000000000000'ff00000000000000'0000000000000000
a3 =    0000000000000000'bbbbbb0000000000'dddd000000000000'ff00000000000000
    
a1 = _mm_mulhi_epu16(a0, kShift);  // 1 << {9,11,13,15}
a2 = _mm_mullo_epu16(a0, kShift);  // 1 << {9,11,13,15}
a3 = _mm_bsrli_si128(a2, 2);
return _mm_or_si128(a1,a3);

Kylie answered 25/12, 2022 at 12:22 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

AVX-512F / BW, without AVX-512VBMI / AVX-512GFNI

Recommended topics

Hot tags