Convert 16 bits mask to 16 bytes mask

Asked 21/4, 2021 at 18:19 Answered 21/4, 2021 at 21:18

Solved c++c bit-manipulation sse intrinsics

Is there any way to convert the following code:

int mask16 = 0b1010101010101010; // int or short, signed or unsigned, it does not matter

__uint128_t mask128 = ((__uint128_t)0x0100010001000100 << 64) | 0x0100010001000100;

So to be extra clear something like:

int mask16 = 0b1010101010101010; 
__uint128_t mask128 = intrinsic_bits_to_bytes(mask16);

or by applying directly the mask:

int mask16 = 0b1010101010101010; 
__uint128_t v = ((__uint128_t)0x2828282828282828 << 64) | 0x2828282828282828;
__uint128_t w = intrinsic_bits_to_bytes_mask(v, mask16); // w = ((__uint128_t)0x2928292829282928 << 64) | 0x2928292829282928;

Wittol answered 21/4, 2021 at 18:19 Comment(8)

Using int in int mask16 = 0b1010101010101010; is a poor choice of type. Consider unsigned types to avoid issues with setting the sign bit and sign extension. – Transaction 21/4, 2021 at 18:32

I'm confused. 16-bit mask is 2 bytes. What are you going to use for filler for the remaining 14 bytes? Is the filler in the front? Is the filler in the back? What's the pattern for the unused, 14, bytes? – Hankins 21/4, 2021 at 18:35

If you are converting one bit to one byte, where in the byte do you want the bit set? – Hankins 21/4, 2021 at 18:36

@chux - Reinstate Monica I am open to signed type solutions since I only use the 16 lower bits, in case there is an intrinsic working with signed int, obviously unsigned int 16 is the preferred way to go – Wittol 21/4, 2021 at 20:18

@nicomp It is kind of related but you would note that answers are related to MSB, while I am looking to have the mask with the least significant bit. – Wittol 21/4, 2021 at 20:20

@nicomp: Instead of pcmpeqb, use pminub with set1(1), so you get either 0 or 1. From How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD which was one of the links at the top of my answer on is there an inverse instruction to the movemask instruction in intel avx2? which summarizes techniques for this. – Wendelina 21/4, 2021 at 20:40

@nicomp: There were enough minor variations here, especially wanting the result as a unsigned __int128, that I decided to post an answer adapting those linked ones, and AVX-512. Although I don't know what operation intrinsic_bits_to_bytes_mask is supposed to use to combine the existing __int128 arg with the mask. |, +, ^ or what. SIMD could be more useful if this is really some binary -> ASCII base 2 effort. – Wendelina 22/4, 2021 at 0:33

Bit/byte order: Unless noted, these follow the question, putting the LSB of the uint16_t in the least significant byte of the __uint128_t (lowest memory address on little-endian x86). This is what you want for an ASCII dump of a bitmap for example, but it's opposite of place-value printing order for the base-2 representation of a single 16-bit number.

The discussion of efficiently getting values (back) into RDX:RAX integer registers has no relevance for most normal use-cases since you'd just store to memory from vector registers, whether that's 0/1 byte integers or ASCII '0'/'1' digits (which you can get most efficiently without ever having 0/1 integers in a __m128i, let alone in an unsigned __int128).

Table of contents:

SSE2 / SSSE3 version: good if you want the result in a vector, e.g. for storing a char array.
(SSE2 NASM version, shuffling into MSB-first printing order and converting to ASCII.)
BMI2 pdep: good for scalar unsigned __int128 on Intel CPUs with BMI2, if you're going to make use of the result in scalar registers. Slow on AMD.
Pure C++ with a multiply bithack: pretty reasonable for scalar
AVX-512: AVX-512 has masking as a first-class operation using scalar bitmaps. Possibly not as good as BMI2 pdep if you're using the result as scalar halves, otherwise even better than SSSE3.
AVX2 printing order (MSB at lowest address) dump of a 32-bit integer.
See also is there an inverse instruction to the movemask instruction in intel avx2? for other variations on element size and mask width. (SSE2 and multiply bithack were adapted from answers linked from that collection.)

With SSE2 (preferably SSSE3)

See @aqrit's How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD answer

Adapting that to work with 16 bits -> 16 bytes, we need a shuffle that replicates the first byte of the mask to the first 8 bytes of the vector, and the 2nd mask byte to the high 8 vector bytes. That's doable with one SSSE3 pshufb, or with punpcklbw same,same + punpcklwd same,same + punpckldq same,same to finally duplicate things up to two 64-bit qwords.

typedef unsigned __int128  u128;

u128 mask_to_u128_SSSE3(unsigned bitmap)
{
    const __m128i shuffle = _mm_setr_epi32(0,0, 0x01010101, 0x01010101);
    __m128i v = _mm_shuffle_epi8(_mm_cvtsi32_si128(bitmap), shuffle);  // SSSE3 pshufb

    const __m128i bitselect = _mm_setr_epi8(
        1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7,
        1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7 );
    v = _mm_and_si128(v, bitselect);
    v = _mm_min_epu8(v, _mm_set1_epi8(1));       // non-zero -> 1  :  0 -> 0
    // return v;   // if you want a SIMD vector result

    alignas(16) u128 tmp;
    _mm_store_si128((__m128i*)&tmp, v);
    return tmp;   // optimizes to movq / pextrq (with SSE4)
}

(To get 0 / 0xFF instead of 0 / 1, replace _mm_min_epu8 with v= _mm_cmpeq_epi8(v, bitselect). If you want a string of ASCII '0' / '1' characters, do cmpeq and _mm_sub_epi8(_mm_set1_epi8('0'), v). That avoids the set1(1) vector constant.)

Godbolt including test-cases. (For this and other non-AVX-512 versions.)

# clang -O3 for Skylake
mask_to_u128_SSSE3(unsigned int):
        vmovd   xmm0, edi                                  # _mm_cvtsi32_si128
        vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI2_0] # xmm0 = xmm0[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI2_1]    # 1<<0, 1<<1, etc.
        vpminub xmm0, xmm0, xmmword ptr [rip + .LCPI2_2]    # set1_epi8(1)

  # done here if you return __m128i v or store the u128 to memory
        vmovq   rax, xmm0
        vpextrq rdx, xmm0, 1
        ret

BMI2 `pdep`: good on Intel, bad on AMD

BMI2 pdep is fast on Intel CPUs that have it (since Haswell), but very slow on AMD (over a dozen uops, high latency.)

typedef unsigned __int128  u128;
inline u128 assemble_halves(uint64_t lo, uint64_t hi) {
    return ((u128)hi << 64) | lo; }
// could replace this with __m128i using _mm_set_epi64x(hi, lo) to see how that compiles

#ifdef __BMI2__
#include <immintrin.h>
auto mask_to_u128_bmi2(unsigned bitmap) {
    // fast on Intel, slow on AMD
    uint64_t tobytes = 0x0101010101010101ULL;
    uint64_t lo = _pdep_u64(bitmap, tobytes);
    uint64_t hi = _pdep_u64(bitmap>>8, tobytes);
    return assemble_halves(lo, hi);
}

Good if you want the result in scalar registers (not one vector) otherwise probably prefer the SSSE3 way.

# clang -O3
mask_to_u128_bmi2(unsigned int):
        movabs  rcx, 72340172838076673    # 0x0101010101010101
        pdep    rax, rdi, rcx
        shr     edi, 8
        pdep    rdx, rdi, rcx
        ret
      # returns in RDX:RAX

Portable C++ with a magic multiply bithack

Not bad on x86-64; AMD since Zen has fast 64-bit multiply, and Intel's had that since Nehalem. Some low-power CPUs still have slowish imul r64, r64

This version may be optimal for __uint128_t results, at least for latency on Intel without BMI2, and on AMD, since it avoids a round-trip to XMM registers. But for throughput it's quite a few instructions

See @phuclv's answer on How to create a byte out of 8 bool values (and vice versa)? for an explanation of the multiply, and for the reverse direction. Use the algorithm from unpack8bools once for each 8-bit half of your mask.

//#include <endian.h>     // glibc / BSD
auto mask_to_u128_magic_mul(uint32_t bitmap) {
    //uint64_t MAGIC = htobe64(0x0102040810204080ULL); // For MSB-first printing order in a char array after memcpy.  0x8040201008040201ULL on little-endian.
    uint64_t MAGIC = 0x0102040810204080ULL;    // LSB -> LSB of the u128, regardless of memory order
    uint64_t MASK  = 0x0101010101010101ULL;
    uint64_t lo = ((MAGIC*(uint8_t)bitmap) ) >> 7;
    uint64_t hi = ((MAGIC*(bitmap>>8)) ) >> 7;

    return assemble_halves(lo & MASK, hi & MASK);
}

If you're going to store the __uint128_t to memory with memcpy, you might want to control for host endianness by using htole64(0x0102040810204080ULL); (from GNU / BSD <endian.h>) or equivalent to always map the low bit of input to the lowest byte of output, i.e. to the first element of a char or bool array. Or htobe64 for the other order, e.g. for printing. Using that function on a constant instead of the variable data allows constant-propagation at compile time.

Otherwise, if you truly want a 128-bit integer whose low bit matches the low bit of the u16 input, the multiplier constant is independent of host endianness; there's no byte access to wider types.

clang 12.0 -O3 for x86-64:

mask_to_u128_magic_mul(unsigned int):
        movzx   eax, dil
        movabs  rdx, 72624976668147840   # 0x0102040810204080
        imul    rax, rdx
        shr     rax, 7
        shr     edi, 8
        imul    rdx, rdi
        shr     rdx, 7
        movabs  rcx, 72340172838076673   # 0x0101010101010101
        and     rax, rcx
        and     rdx, rcx
        ret

AVX-512

This is easy with AVX-512BW; you can use the mask for a zero-masked load from a repeated 0x01 constant.

__m128i bits_to_bytes_avx512bw(unsigned mask16) {
    return _mm_maskz_mov_epi8(mask16, _mm_set1_epi8(1));

//    alignas(16) unsigned __int128 tmp;
//    _mm_store_si128((__m128i*)&u128, v);  // should optimize into vmovq / vpextrq
//    return tmp;
}

Or avoid a memory constant (because compilers can do set1(-1) with just a vpcmpeqd xmm0,xmm0): Do a zero-masked absolute-value of -1. The constant setup can be hoisted, same as with set1(1).

__m128i bits_to_bytes_avx512bw_noconst(unsigned mask16) {
    __m128i ones = _mm_set1_epi8(-1);    // extra instruction *off* the critical path
    return _mm_maskz_abs_epi8(mask16, ones);
}

But note that if doing further vector stuff, the result of maskz_mov might be able to optimize into other operations. For example vec += maskz_mov could optimize into a merge-masked add. But if not, vmovdqu8 xmm{k}{z}, xmm needs an ALU port like vpabsb xmm{k}{z}, xmm, but vpabsb can't run on port 5 on Skylake/Ice Lake. (A zero-masked vpsubb from a zeroed register would avoid that possible throughput problem, but then you'd be setting up 2 registers just to avoid loading a constant. In hand-written asm, you'd just materialize set1(1) using vpcmpeqd / vpabsb yourself if you wanted to avoid a 4-byte broadcast-load of a constant.)

(Godbolt compiler explorer with gcc and clang -O3 -march=skylake-avx512. Clang sees through the masked vpabsb and compiles it the same as the first version, with a memory constant.)

Even better if you can use a vector 0 / -1 instead of 0 / 1: use return _mm_movm_epi8(mask16). Compiles to just kmovd k0, edi / vpmovm2b xmm0, k0

If you want a vector of ASCII characters like '0' or '1', you could use _mm_mask_blend_epi8(mask, ones, zeroes). (That should be more efficient than a merge-masked add into a vector of set1(1) which would require an extra register copy, and also better than sub between set1('0') and _mm_movm_epi8(mask16) which would require 2 instructions: one to turn the mask into a vector, and a separate vpsubb.)

AVX2 with bits in printing order (MSB at lowest address), bytes in mem order, as ASCII '0' / '1'

With [] delimiters and \t tabs like this output format, from this codereview Q&A:

[01000000]      [01000010]      [00001111]      [00000000]

Obviously if you want all 16 or 32 ASCII digits contiguous, that's easier and doesn't require shuffling the output to store each 8-byte chunk separately. Mostly of the reason for posting here is that it has the shuffle and mask constants in the right order for printing, and to show a version optimized for ASCII output after it turned out that's what the question really wanted.

Using How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?, basically a 256-bit version the SSSE3 code.

#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <immintrin.h>
#include <string.h>

// https://mcmap.net/q/14189/-how-to-perform-the-inverse-of-_mm256_movemask_epi8-vpmovmskb
void binary_dump_4B_avx2(const void *input)
{
    char buf[CHAR_BIT*4 + 2*4 + 3 + 1 + 1];  // bits, 4x [], 3x \t, \n, 0
    buf[0] = '[';
    for (int i=9 ; i<sizeof(buf) - 8; i+=11){ // GCC strangely doesn't unroll this loop
        memcpy(&buf[i], "]\t[", 4);       // 4-byte store as a single; we overlap the 0 later
    }
    __m256i  v = _mm256_castps_si256(_mm256_broadcast_ss(input));         // aliasing-safe load; use _mm256_set1_epi32 if you know you have an int
    const __m256i shuffle = _mm256_setr_epi64x(0x0000000000000000,        // low byte first, bytes in little-endian memory order
      0x0101010101010101, 0x0202020202020202, 0x0303030303030303);
    v =  _mm256_shuffle_epi8(v, shuffle);

//    __m256i bit_mask = _mm256_set1_epi64x(0x8040201008040201);    // low bits to low bytes
    __m256i bit_mask = _mm256_set1_epi64x(0x0102040810204080);      // MSB to lowest byte; printing order

    v = _mm256_and_si256(v, bit_mask);               // x & mask == mask
//    v = _mm256_cmpeq_epi8(v, _mm256_setzero_si256());       // -1  /  0  bytes
//    v = _mm256_add_epi8(v, _mm256_set1_epi8('1'));          // '0' / '1' bytes

    v = _mm256_cmpeq_epi8(v, bit_mask);              // 0 / -1  bytes
    v = _mm256_sub_epi8(_mm256_set1_epi8('0'), v);   // '0' / '1' bytes
    __m128i lo = _mm256_castsi256_si128(v);
    _mm_storeu_si64(buf+1, lo);
    _mm_storeh_pi((__m64*)&buf[1+8+3], _mm_castsi128_ps(lo));

    // TODO?: shuffle first and last bytes into the high lane initially to allow 16-byte vextracti128 stores, with later stores overlapping to replace garbage.
    __m128i hi = _mm256_extracti128_si256(v, 1);
    _mm_storeu_si64(buf+1+11*2, hi);
    _mm_storeh_pi((__m64*)&buf[1+11*3], _mm_castsi128_ps(hi));
//    buf[32 + 2*4 + 3] = '\n';
//    buf[32 + 2*4 + 3 + 1] = '\0';
//    fputs
    memcpy(&buf[32 + 2*4 + 2], "]", 2);  // including '\0'
    puts(buf);                           // appends a newline
     // appending our own newline and using fputs or fwrite is probably more efficient.
}

void binary_dump(const void *input, size_t bytecount) {
}
 // not shown: portable version, see Godbolt, or my or @chux's answer on the codereview question


int main(void)
{
    int t = 1000000;
    binary_dump_4B_avx2(&t);
    binary_dump(&t, sizeof(t));
    t++;
    binary_dump_4B_avx2(&t);
    binary_dump(&t, sizeof(t));
}

Runnable Godbolt demo with gcc -O3 -march=haswell.

Note that GCC10.3 and earlier are dumb and duplicate the AND/CMPEQ vector constant, once as bytes and once as qwords. (In that case, comparing against zero would be better, or using OR with an inverted mask and comparing against all-ones). GCC11.1 fixes that with a .set .LC1,.LC2, but still loads it twice, as memory operands instead of loading once into a register. Clang doesn't have either of these problems.

Fun fact: clang -march=icelake-client manages to turn the 2nd part of this into an AVX-512 masked blend between '0' and '1' vectors, but instead of just kmov it uses a broadcast-load, vpermb byte shuffle, then test-into-mask with the bitmask.

Wendelina answered 21/4, 2021 at 21:18 Comment(15)

Wonder how that compares to using something like byte to bool array. I.e godbolt – Acropolis 21/4, 2021 at 21:35

@Noah: The OP apparently wants an __int128 not a __m128i, so it's better for latency if we want the result back in integer regs. godbolt.org/z/qf9G4o59q. (But more instructions: note that you missed isolating the low byte of mask before the lo multiply, and forgot to mask and shift the multiply results.) – Wendelina 21/4, 2021 at 23:14

@Noah: Updated with worked out examples since I already got them tested on Godbolt. – Wendelina 22/4, 2021 at 0:34

Would it work if mask16 = 0b1011001010100110; ? I should have mentioned that the mask could be any combination of 0 and 1, on 2 bits to up to 16 bits – Wittol 22/4, 2021 at 0:54

@AntoninGAVREL: Would what work? AFAIK, all the code in my answer works (for all possible 16-bit inputs), that's why I put it there / wrote it that way. – Wendelina 22/4, 2021 at 0:56

I dont have avx512 and the SSE one has a TODO so I cannot test it, also I do not see where you are using an uint16_t. Edit: ok I refreshed, thank you will try it now – Wittol 22/4, 2021 at 0:57

@AntoninGAVREL: Apparently you missed my edit 30 minutes before you commented. There's no TODO anymore, all the non-AVX512 versions are tested and working – Wendelina 22/4, 2021 at 0:58

The code works fine without optimization flags but if I add O2 or O3 I get

cannot convert ‘__int128 unsigned’ to ‘__m128i {aka __vector(2) long long int}’ for argument ‘1’ to ‘__m128i _mm_srli_si128(__m128i, int)’          __int64_t a = _mm_cvtsi128_si64 (_mm_srli_si128(w, 0));

for the SSE example – Wittol 22/4, 2021 at 1:28

@AntoninGAVREL: My answer doesn't include that code or even mention a byte-shift. I just used a store to let the compiler choose how to optimize __m128i -> u128. It looks like you introduced bugs there yourself, e.g. by trying to do conversion from __m128i to u128 on a value that was already a u128. – Wendelina 22/4, 2021 at 1:37

I convert the 128 bits register to a char array, each byte being an ASCII value. Edit: my bad this is from my code. – Wittol 22/4, 2021 at 1:38

I am fine with __m128i as well, sorry I didn't think it would make a huge difference. Could you add your method to convert from m128i to char array (or string) ? – Wittol 22/4, 2021 at 1:41

@AntoninGAVREL: The whole operation doesn't cost much more than movq / pextrq (or movq + punpckhqdq / movq). And if you ultimately want a base-2 ASCII string of '0' or '1' digits, subtracting a pcmpeqb result from set1('0') is efficient and avoids the need for one of the vector constants in the SSSE3 version. – Wendelina 22/4, 2021 at 1:45

Thanks for the great answer Peter and sorry if we are going a bit off topic! You mean adding, not substracting right? – Wittol 22/4, 2021 at 1:58

@AntoninGAVREL, No, _mm_cmpeq_epi produces a 0 / -1 result. Instead of masking or otherwise converting that to 0 / 1 and adding, the usual trick to conditionally increment based on a compare result is _mm_sub_epi8(x, _mm_cmpeq_epi8(y,z)) I also mentioned that in a new paragraph in an edit to my answer. – Wendelina 22/4, 2021 at 2:6

@PeterCordes: see godbolt.org/z/6Wre9Wa56 for fixed version of the magic multiplier algorithm. Also note, that we can preshift (MAGIC>>7) before multiplication saving the post shift. – Carpometacarpus 28/10, 2021 at 8:23

For each bit in the mask, you want to move a bit at position n to the low-order bit of the byte at position n, i.e. bit position 8 * n. You can do this with a loop:

__uint128_t intrinsic_bits_to_bytes(uint16_t mask)
{
    int i;
    __uint128_t result = 0;

    for (i=0; i<16; i++) {
        result |= (__uint128_t )((mask >> i) & 1) << (8 * i);
    }
    return result;
}

Inherence answered 21/4, 2021 at 18:30 Comment(1)

Hi dbush, I already had this function created with while (--i >= 0) result |= (__uint128_t )((mask >> i) & 1) << (i << 3); with i starting at 16 but probably would get same assembly output as your answer. I will wait for Vlad update or other answers before selecting if I cannot avoid loops, still upvoting, many thanks! – Wittol 21/4, 2021 at 20:16

If you can use AVX512, you can do it in one instruction, no loop:

#include <immintrin.h>

__m128i intrinsic_bits_to_bytes(uint16_t mask16) {
    const __m128i zeroes = _mm_setzero_si128();
    const __m128i ones = _mm_set1_epi8(1);;
    return _mm_mask_blend_epi8(mask16, ones, zeroes);
}

For building with gcc, I use:

g++ -std=c++11 -march=native -O3 src.cpp -pthread

This will build OK, but if your processor doesn't support AVX512, it will throw an illegal instruction at run time.

Briefs answered 21/4, 2021 at 19:14 Comment(9)

Hi Vlad, that is very interesting! Could you add the compilation instructions ? with gcc I was looking for a way to do it without a loop – Wittol 21/4, 2021 at 20:13

@AntoninGAVREL - please see my edited answer. – Briefs 21/4, 2021 at 20:24

Don't use static const __m128i, that compiles to much worse asm (godbolt.org/z/WYYjecnEE) than regular const __m128i. Let the compiler take care of it, like it does for string literals and FP constants. See godbolt.org/z/ofTvxfnT9 (Also use a zero-masked _mm_maskz_mov_epi8 to load 1s or zeros, not a blend.) – Wendelina 21/4, 2021 at 20:38

Posted an answer with my improved version, and non-AVX-512 versions. – Wendelina 21/4, 2021 at 21:18

@PeterCordes Thank you for the great answer, and tips! – Briefs 21/4, 2021 at 21:45

Feel free to upvote it if you think it's good :P – Wendelina 21/4, 2021 at 22:19

@PeterCordes - I would accept it, too, if that was my Q :) – Briefs 21/4, 2021 at 22:21

Thank you Vlad for the short and interesting answer, +1 – Wittol 22/4, 2021 at 1:59

@VladFeinstein: I'd suggest at least editing your answer to remove the actively harmful static on your constants. After that it's fine and clang will probably still optimize it to what you really want, or even constant-propagated into an add with set1('0') or something. – Wendelina 22/4, 2021 at 8:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++