Bit/byte order: Unless noted, these follow the question, putting the LSB of the uint16_t
in the least significant byte of the __uint128_t
(lowest memory address on little-endian x86). This is what you want for an ASCII dump of a bitmap for example, but it's opposite of place-value printing order for the base-2 representation of a single 16-bit number.
The discussion of efficiently getting values (back) into RDX:RAX integer registers has no relevance for most normal use-cases since you'd just store to memory from vector registers, whether that's 0
/1
byte integers or ASCII '0'
/'1'
digits (which you can get most efficiently without ever having 0
/1
integers in a __m128i
, let alone in an unsigned __int128
).
Table of contents:
- SSE2 / SSSE3 version: good if you want the result in a vector, e.g. for storing a char array.
(SSE2 NASM version, shuffling into MSB-first printing order and converting to ASCII.)
- BMI2
pdep
: good for scalar unsigned __int128
on Intel CPUs with BMI2, if you're going to make use of the result in scalar registers. Slow on AMD.
- Pure C++ with a multiply bithack: pretty reasonable for scalar
- AVX-512: AVX-512 has masking as a first-class operation using scalar bitmaps. Possibly not as good as BMI2
pdep
if you're using the result as scalar halves, otherwise even better than SSSE3.
- AVX2 printing order (MSB at lowest address) dump of a 32-bit integer.
- See also is there an inverse instruction to the movemask instruction in intel avx2? for other variations on element size and mask width. (SSE2 and multiply bithack were adapted from answers linked from that collection.)
With SSE2 (preferably SSSE3)
See @aqrit's How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD answer
Adapting that to work with 16 bits -> 16 bytes, we need a shuffle that replicates the first byte of the mask to the first 8 bytes of the vector, and the 2nd mask byte to the high 8 vector bytes. That's doable with one SSSE3 pshufb
, or with punpcklbw same,same
+ punpcklwd same,same
+ punpckldq same,same
to finally duplicate things up to two 64-bit qwords.
typedef unsigned __int128 u128;
u128 mask_to_u128_SSSE3(unsigned bitmap)
{
const __m128i shuffle = _mm_setr_epi32(0,0, 0x01010101, 0x01010101);
__m128i v = _mm_shuffle_epi8(_mm_cvtsi32_si128(bitmap), shuffle); // SSSE3 pshufb
const __m128i bitselect = _mm_setr_epi8(
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7,
1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1U<<7 );
v = _mm_and_si128(v, bitselect);
v = _mm_min_epu8(v, _mm_set1_epi8(1)); // non-zero -> 1 : 0 -> 0
// return v; // if you want a SIMD vector result
alignas(16) u128 tmp;
_mm_store_si128((__m128i*)&tmp, v);
return tmp; // optimizes to movq / pextrq (with SSE4)
}
(To get 0 / 0xFF instead of 0 / 1, replace _mm_min_epu8
with v= _mm_cmpeq_epi8(v, bitselect)
. If you want a string of ASCII '0'
/ '1'
characters, do cmpeq and _mm_sub_epi8(_mm_set1_epi8('0'), v)
. That avoids the set1(1) vector constant.)
Godbolt including test-cases. (For this and other non-AVX-512 versions.)
# clang -O3 for Skylake
mask_to_u128_SSSE3(unsigned int):
vmovd xmm0, edi # _mm_cvtsi32_si128
vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI2_0] # xmm0 = xmm0[0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1]
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI2_1] # 1<<0, 1<<1, etc.
vpminub xmm0, xmm0, xmmword ptr [rip + .LCPI2_2] # set1_epi8(1)
# done here if you return __m128i v or store the u128 to memory
vmovq rax, xmm0
vpextrq rdx, xmm0, 1
ret
BMI2 pdep
: good on Intel, bad on AMD
BMI2 pdep
is fast on Intel CPUs that have it (since Haswell), but very slow on AMD (over a dozen uops, high latency.)
typedef unsigned __int128 u128;
inline u128 assemble_halves(uint64_t lo, uint64_t hi) {
return ((u128)hi << 64) | lo; }
// could replace this with __m128i using _mm_set_epi64x(hi, lo) to see how that compiles
#ifdef __BMI2__
#include <immintrin.h>
auto mask_to_u128_bmi2(unsigned bitmap) {
// fast on Intel, slow on AMD
uint64_t tobytes = 0x0101010101010101ULL;
uint64_t lo = _pdep_u64(bitmap, tobytes);
uint64_t hi = _pdep_u64(bitmap>>8, tobytes);
return assemble_halves(lo, hi);
}
Good if you want the result in scalar registers (not one vector) otherwise probably prefer the SSSE3 way.
# clang -O3
mask_to_u128_bmi2(unsigned int):
movabs rcx, 72340172838076673 # 0x0101010101010101
pdep rax, rdi, rcx
shr edi, 8
pdep rdx, rdi, rcx
ret
# returns in RDX:RAX
Portable C++ with a magic multiply bithack
Not bad on x86-64; AMD since Zen has fast 64-bit multiply, and Intel's had that since Nehalem. Some low-power CPUs still have slowish imul r64, r64
This version may be optimal for __uint128_t
results, at least for latency on Intel without BMI2, and on AMD, since it avoids a round-trip to XMM registers. But for throughput it's quite a few instructions
See @phuclv's answer on How to create a byte out of 8 bool values (and vice versa)? for an explanation of the multiply, and for the reverse direction. Use the algorithm from unpack8bools
once for each 8-bit half of your mask
.
//#include <endian.h> // glibc / BSD
auto mask_to_u128_magic_mul(uint32_t bitmap) {
//uint64_t MAGIC = htobe64(0x0102040810204080ULL); // For MSB-first printing order in a char array after memcpy. 0x8040201008040201ULL on little-endian.
uint64_t MAGIC = 0x0102040810204080ULL; // LSB -> LSB of the u128, regardless of memory order
uint64_t MASK = 0x0101010101010101ULL;
uint64_t lo = ((MAGIC*(uint8_t)bitmap) ) >> 7;
uint64_t hi = ((MAGIC*(bitmap>>8)) ) >> 7;
return assemble_halves(lo & MASK, hi & MASK);
}
If you're going to store the __uint128_t
to memory with memcpy
, you might want to control for host endianness by using htole64(0x0102040810204080ULL);
(from GNU / BSD <endian.h>) or equivalent to always map the low bit of input to the lowest byte of output, i.e. to the first element of a char
or bool
array. Or htobe64
for the other order, e.g. for printing. Using that function on a constant instead of the variable data allows constant-propagation at compile time.
Otherwise, if you truly want a 128-bit integer whose low bit matches the low bit of the u16 input, the multiplier constant is independent of host endianness; there's no byte access to wider types.
clang 12.0 -O3 for x86-64:
mask_to_u128_magic_mul(unsigned int):
movzx eax, dil
movabs rdx, 72624976668147840 # 0x0102040810204080
imul rax, rdx
shr rax, 7
shr edi, 8
imul rdx, rdi
shr rdx, 7
movabs rcx, 72340172838076673 # 0x0101010101010101
and rax, rcx
and rdx, rcx
ret
AVX-512
This is easy with AVX-512BW; you can use the mask for a zero-masked load from a repeated 0x01
constant.
__m128i bits_to_bytes_avx512bw(unsigned mask16) {
return _mm_maskz_mov_epi8(mask16, _mm_set1_epi8(1));
// alignas(16) unsigned __int128 tmp;
// _mm_store_si128((__m128i*)&u128, v); // should optimize into vmovq / vpextrq
// return tmp;
}
Or avoid a memory constant (because compilers can do set1(-1)
with just a vpcmpeqd xmm0,xmm0
): Do a zero-masked absolute-value of -1
. The constant setup can be hoisted, same as with set1(1).
__m128i bits_to_bytes_avx512bw_noconst(unsigned mask16) {
__m128i ones = _mm_set1_epi8(-1); // extra instruction *off* the critical path
return _mm_maskz_abs_epi8(mask16, ones);
}
But note that if doing further vector stuff, the result of maskz_mov
might be able to optimize into other operations. For example vec += maskz_mov could optimize into a merge-masked add. But if not, vmovdqu8 xmm{k}{z}, xmm
needs an ALU port like vpabsb xmm{k}{z}, xmm
, but vpabsb
can't run on port 5 on Skylake/Ice Lake. (A zero-masked vpsubb
from a zeroed register would avoid that possible throughput problem, but then you'd be setting up 2 registers just to avoid loading a constant. In hand-written asm, you'd just materialize set1(1)
using vpcmpeqd
/ vpabsb
yourself if you wanted to avoid a 4-byte broadcast-load of a constant.)
(Godbolt compiler explorer with gcc and clang -O3 -march=skylake-avx512
. Clang sees through the masked vpabsb
and compiles it the same as the first version, with a memory constant.)
Even better if you can use a vector 0 / -1 instead of 0 / 1: use return _mm_movm_epi8(mask16)
. Compiles to just kmovd k0, edi
/ vpmovm2b xmm0, k0
If you want a vector of ASCII characters like '0'
or '1'
, you could use _mm_mask_blend_epi8(mask, ones, zeroes)
. (That should be more efficient than a merge-masked add into a vector of set1(1)
which would require an extra register copy, and also better than sub between set1('0')
and _mm_movm_epi8(mask16)
which would require 2 instructions: one to turn the mask into a vector, and a separate vpsubb.)
AVX2 with bits in printing order (MSB at lowest address), bytes in mem order, as ASCII '0' / '1'
With []
delimiters and \t
tabs like this output format, from this codereview Q&A:
[01000000] [01000010] [00001111] [00000000]
Obviously if you want all 16 or 32 ASCII digits contiguous, that's easier and doesn't require shuffling the output to store each 8-byte chunk separately. Mostly of the reason for posting here is that it has the shuffle and mask constants in the right order for printing, and to show a version optimized for ASCII output after it turned out that's what the question really wanted.
Using How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?, basically a 256-bit version the SSSE3 code.
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <immintrin.h>
#include <string.h>
// https://mcmap.net/q/14189/-how-to-perform-the-inverse-of-_mm256_movemask_epi8-vpmovmskb
void binary_dump_4B_avx2(const void *input)
{
char buf[CHAR_BIT*4 + 2*4 + 3 + 1 + 1]; // bits, 4x [], 3x \t, \n, 0
buf[0] = '[';
for (int i=9 ; i<sizeof(buf) - 8; i+=11){ // GCC strangely doesn't unroll this loop
memcpy(&buf[i], "]\t[", 4); // 4-byte store as a single; we overlap the 0 later
}
__m256i v = _mm256_castps_si256(_mm256_broadcast_ss(input)); // aliasing-safe load; use _mm256_set1_epi32 if you know you have an int
const __m256i shuffle = _mm256_setr_epi64x(0x0000000000000000, // low byte first, bytes in little-endian memory order
0x0101010101010101, 0x0202020202020202, 0x0303030303030303);
v = _mm256_shuffle_epi8(v, shuffle);
// __m256i bit_mask = _mm256_set1_epi64x(0x8040201008040201); // low bits to low bytes
__m256i bit_mask = _mm256_set1_epi64x(0x0102040810204080); // MSB to lowest byte; printing order
v = _mm256_and_si256(v, bit_mask); // x & mask == mask
// v = _mm256_cmpeq_epi8(v, _mm256_setzero_si256()); // -1 / 0 bytes
// v = _mm256_add_epi8(v, _mm256_set1_epi8('1')); // '0' / '1' bytes
v = _mm256_cmpeq_epi8(v, bit_mask); // 0 / -1 bytes
v = _mm256_sub_epi8(_mm256_set1_epi8('0'), v); // '0' / '1' bytes
__m128i lo = _mm256_castsi256_si128(v);
_mm_storeu_si64(buf+1, lo);
_mm_storeh_pi((__m64*)&buf[1+8+3], _mm_castsi128_ps(lo));
// TODO?: shuffle first and last bytes into the high lane initially to allow 16-byte vextracti128 stores, with later stores overlapping to replace garbage.
__m128i hi = _mm256_extracti128_si256(v, 1);
_mm_storeu_si64(buf+1+11*2, hi);
_mm_storeh_pi((__m64*)&buf[1+11*3], _mm_castsi128_ps(hi));
// buf[32 + 2*4 + 3] = '\n';
// buf[32 + 2*4 + 3 + 1] = '\0';
// fputs
memcpy(&buf[32 + 2*4 + 2], "]", 2); // including '\0'
puts(buf); // appends a newline
// appending our own newline and using fputs or fwrite is probably more efficient.
}
void binary_dump(const void *input, size_t bytecount) {
}
// not shown: portable version, see Godbolt, or my or @chux's answer on the codereview question
int main(void)
{
int t = 1000000;
binary_dump_4B_avx2(&t);
binary_dump(&t, sizeof(t));
t++;
binary_dump_4B_avx2(&t);
binary_dump(&t, sizeof(t));
}
Runnable Godbolt demo with gcc -O3 -march=haswell
.
Note that GCC10.3 and earlier are dumb and duplicate the AND/CMPEQ vector constant, once as bytes and once as qwords. (In that case, comparing against zero would be better, or using OR with an inverted mask and comparing against all-ones). GCC11.1 fixes that with a .set .LC1,.LC2
, but still loads it twice, as memory operands instead of loading once into a register. Clang doesn't have either of these problems.
Fun fact: clang -march=icelake-client
manages to turn the 2nd part of this into an AVX-512 masked blend between '0'
and '1'
vectors, but instead of just kmov
it uses a broadcast-load, vpermb
byte shuffle, then test-into-mask with the bitmask.
int
inint mask16 = 0b1010101010101010;
is a poor choice of type. Consider unsigned types to avoid issues with setting the sign bit and sign extension. – Transactionpminub
with set1(1), so you get either 0 or 1. From How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD which was one of the links at the top of my answer on is there an inverse instruction to the movemask instruction in intel avx2? which summarizes techniques for this. – Wendelinaunsigned __int128
, that I decided to post an answer adapting those linked ones, and AVX-512. Although I don't know what operationintrinsic_bits_to_bytes_mask
is supposed to use to combine the existing __int128 arg with the mask.|
,+
,^
or what. SIMD could be more useful if this is really some binary -> ASCII base 2 effort. – Wendelina