How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?

Asked 7/2, 2014 at 7:55 Answered 27/8, 2015 at 11:8

The intrinsic:

int mask = _mm256_movemask_epi8(__m256i s1)

creates a mask, with its 32 bits corresponding to the most significant bit of each byte of s1. After manipulating the mask using bit operations (BMI2 for example) I would like to perform the inverse of _mm256_movemask_epi8, i.e., create a __m256i vector with the most significant bit of each byte containing the corresponding bit of the uint32_t mask.

What is the best way to do this?

Edit: I need to perform the inverse because the intrinsic _mm256_blendv_epi8 accepts only __m256i type mask instead of uint32_t. As such, in the resulting __m256i mask, I can ignore the bits other than the MSB of each byte.

Enlist answered 7/2, 2014 at 7:55 Comment(3)

with AVX512, you can use _mm256_mask_blend_epi8(__mmask32 k, __m256i a, __m256i b) using your integer as the mask – Nunhood 29/8, 2015 at 18:9

See also my answer on a possible duplicate question. Use a vpsllvd variable-shift to put different bits of the mask in the sign bit of each element. This is great for an element size of 32b, but not for 8b. – Anticipate 8/4, 2016 at 5:24

is there an inverse instruction to the movemask instruction in intel avx2? has a list of different versions, SSE and AVX, for different element sizes. – Anticipate 7/7, 2022 at 20:30

Here is an alternative to LUT or pdep instructions that might be more efficient:

Copy your 32-bit mask to both low bytes of some ymm register and bytes 16..19 of the same register. You could use temporary array and _mm256_load_si256. Or you could move single copy of 32-bit mask to low bytes of some ymm register, then broadcast it with VPBROADCASTD (_mm_broadcastd_epi32) or other broadcast/shuffle instructions.
Rearrange bytes of the register so that low 8 bytes (each) contain low 8 bits of your mask, next 8 bytes - next 8 bits, etc. This could be done with VPSHUFB (_mm256_shuffle_epi8) with control register containing '0' in low 8 bytes, '1' in next 8 bytes, etc.
Select proper bit for each byte with VPOR (_mm256_or_si256) or VPAND (_mm256_and_si256).
Set MSB of appropriate bytes with VPCMPEQB (_mm256_cmpeq_epi8). Compare each byte to 0xFF. If you want each bit of the mask toggled, use VPAND on previous step and compare to zero.

Additional flexibility of this approach is that you could choose different control register for step #2 and different mask for step #3 to shuffle bits of your bit mask (for example you could copy this mask to ymm register in reversed order).

Middlebreaker answered 7/2, 2014 at 10:41 Comment(3)

Just use _mm256_set1_epi32 and let the compiler do a broadcast-load with vpbroadcastd ymm, [mem] if it wants to. – Anticipate 8/4, 2016 at 5:23

After the shuffle, use VPAND and VPCMPEQB to implement bitmap & (1<<bit) == (1<<bit). You only need one vector constant. – Anticipate 8/11, 2017 at 22:13

If you want 0/1 instead of 0/0xff, use _mm256_min_epu8(and_result, _mm256_set1_epi8(1)) instead of cmpeq against the AND mask. Elements with a non-zero byte will have a min of 1, vs. min(0,1) = 0. (this trick from How to efficiently convert an 8-bit bitmap to array of 0/1 integers with x86 SIMD) – Anticipate 31/8, 2018 at 0:16

I have implemented the above three approaches on a Haswell machine. Evgeny Kluev's approach is the fastest (1.07 s), followed by Jason R's (1.97 s) and Paul R's (2.44 s). The code below was compiled with -march=core-avx2 -O3 optimization flags.

#include <immintrin.h>
#include <boost/date_time/posix_time/posix_time.hpp>

//t_icc = 1.07 s
//t_g++ = 1.09 s
__m256i get_mask3(const uint32_t mask) {
  __m256i vmask(_mm256_set1_epi32(mask));
  const __m256i shuffle(_mm256_setr_epi64x(0x0000000000000000,
      0x0101010101010101, 0x0202020202020202, 0x0303030303030303));
  vmask = _mm256_shuffle_epi8(vmask, shuffle);
  const __m256i bit_mask(_mm256_set1_epi64x(0x7fbfdfeff7fbfdfe));
  vmask = _mm256_or_si256(vmask, bit_mask);
  return _mm256_cmpeq_epi8(vmask, _mm256_set1_epi64x(-1));
}

//t_icc = 1.97 s
//t_g++ = 1.97 s
__m256i get_mask2(const uint32_t mask) {
  __m256i vmask(_mm256_set1_epi32(mask));
  const __m256i shift(_mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0));
  vmask = _mm256_sllv_epi32(vmask, shift);
  const __m256i shuffle(_mm256_setr_epi64x(0x0105090d0004080c,
      0x03070b0f02060a0e, 0x0105090d0004080c, 0x03070b0f02060a0e));
  vmask = _mm256_shuffle_epi8(vmask, shuffle);
  const __m256i perm(_mm256_setr_epi64x(0x0000000000000004, 0x0000000100000005,
      0x0000000200000006, 0x0000000300000007));
  return _mm256_permutevar8x32_epi32(vmask, perm);
}

//t_icc = 2.44 s
//t_g++ = 2.45 s
__m256i get_mask1(uint32_t mask) {
  const uint64_t pmask = 0x8080808080808080ULL; // bit unpacking mask for PDEP
  uint64_t amask0, amask1, amask2, amask3; 
  amask0 = _pdep_u64(mask, pmask);
  mask >>= 8;
  amask1 = _pdep_u64(mask, pmask);
  mask >>= 8;
  amask2 = _pdep_u64(mask, pmask);
  mask >>= 8;
  amask3 = _pdep_u64(mask, pmask);
  return _mm256_set_epi64x(amask3, amask2, amask1, amask0);
}

int main() {
  __m256i mask;
  boost::posix_time::ptime start(
      boost::posix_time::microsec_clock::universal_time()); 
  for(unsigned i(0); i != 1000000000; ++i)
    { 
      mask = _mm256_xor_si256(mask, get_mask3(i));
    }
  boost::posix_time::ptime end(
      boost::posix_time::microsec_clock::universal_time());
  std::cout << "duration:" << (end-start) << 
    " mask:" << _mm256_movemask_epi8(mask) << std::endl;
  return 0;
}

Enlist answered 10/2, 2014 at 9:24 Comment(4)

+1 for following up on all three suggestions and providing a nice summary of the results! Out of interest, what compiler did you use? – Lowlife 10/2, 2014 at 17:36

Thanks! I used both icc and g++. I have updated the timings with optimization flags. – Enlist 11/2, 2014 at 4:16

FWIW I ran some benchmarks with clang here and got similar results. – Lowlife 11/2, 2014 at 7:39

clang results: get_mask3: 0.9968 ns, get_mask2: 1.7413 ns, get_mask1: (check = 0) 2.291 ns – Lowlife 11/2, 2014 at 8:28

Here is an alternative to LUT or pdep instructions that might be more efficient:

Copy your 32-bit mask to both low bytes of some ymm register and bytes 16..19 of the same register. You could use temporary array and _mm256_load_si256. Or you could move single copy of 32-bit mask to low bytes of some ymm register, then broadcast it with VPBROADCASTD (_mm_broadcastd_epi32) or other broadcast/shuffle instructions.
Rearrange bytes of the register so that low 8 bytes (each) contain low 8 bits of your mask, next 8 bytes - next 8 bits, etc. This could be done with VPSHUFB (_mm256_shuffle_epi8) with control register containing '0' in low 8 bytes, '1' in next 8 bytes, etc.
Select proper bit for each byte with VPOR (_mm256_or_si256) or VPAND (_mm256_and_si256).
Set MSB of appropriate bytes with VPCMPEQB (_mm256_cmpeq_epi8). Compare each byte to 0xFF. If you want each bit of the mask toggled, use VPAND on previous step and compare to zero.

Middlebreaker answered 7/2, 2014 at 10:41 Comment(3)

Just use _mm256_set1_epi32 and let the compiler do a broadcast-load with vpbroadcastd ymm, [mem] if it wants to. – Anticipate 8/4, 2016 at 5:23

After the shuffle, use VPAND and VPCMPEQB to implement bitmap & (1<<bit) == (1<<bit). You only need one vector constant. – Anticipate 8/11, 2017 at 22:13

My initial approach to this was similar to @Jason R's because that is how "normal" operations work, but most of these operations only care about the high bit -- ignoring all the other bits. Once I realized this, the _mm*_maskz_broadcast*_epi*(mask,__m128i) series of functions made the most sense. You will need to enable -mavx512vl and -mavx512bw (gcc)

To get a vector with the highest bit of each byte set according to a mask:

/* convert 16 bit mask to __m128i control byte mask */
_mm_maskz_broadcastb_epi8((__mmask16)mask,_mm_set1_epi32(~0))
/* convert 32 bit mask to __m256i control byte mask */
_mm256_maskz_broadcastb_epi8((__mmask32)mask,_mm_set1_epi32(~0))
/* convert 64 bit mask to __m512i control byte mask */
_mm512_maskz_broadcastb_epi8((__mmask64)mask,_mm_set1_epi32(~0))

To get a vector with the highest bit of each word set according to a mask:

/* convert 8 bit mask to __m128i control word mask */
_mm_maskz_broadcastw_epi16((__mmask8)mask,_mm_set1_epi32(~0))
/* convert 16 bit mask to __m256i control word mask */
_mm256_maskz_broadcastw_epi16((__mmask16)mask,_mm_set1_epi32(~0))
/* convert 32 bit mask to __m512i control word mask */
_mm512_maskz_broadcastw_epi16((__mmask32)mask,_mm_set1_epi32(~0))

To get a vector with the highest bit of each double word set according to a mask:

/* convert 8 bit mask to __m256i control mask */
_mm256_maskz_broadcastd_epi32((__mmask8)mask,_mm_set1_epi32(~0))
/* convert 16 bit mask to __m512i control mask */
_mm512_maskz_broadcastd_epi32((__mmask16)mask,_mm_set1_epi32(~0))

To get a vector with the highest bit of each quad word set according to a mask:

/* convert 8 bit mask to __m512i control mask */
_mm512_maskz_broadcastq_epi64((__mmask8)mask,_mm_set1_epi32(~0))

The one specific to this question is: _mm256_maskz_broadcastb_epi8((__mmask32)mask,_mm_set1_epi32(~0)) but I include the others for reference/comparison.

Note that each byte/word/... will either be all ones or all zeroes according to the mask (not just the highest bit). This can also be useful for doing vectorized bit operations (&'ing with another vector for instance to zero out unwanted bytes/words).

Another note: each _mm_set1_epi32(~0) could/should be converted to a constant (either manually or by the compiler), so it should compile to just one fairly quick operation, though it may be slightly faster in testing than in real life since the constant will likely stay in a register. Then these are converted to VPMOVM2{b,w,d,q} instructions

Edit: In case your compiler doesn't support AVX512, the inline assembly version should look like:

inline __m256i dmask2epi8(__mmask32 mask){
  __m256i ret;
  __asm("vpmovm2b   %1, %0":"=x"(ret):"k"(mask):);
  return ret;
}

The other instructions are similar.

Nunhood answered 27/8, 2015 at 11:8 Comment(1)

If you want 0 / -1, use _mm256_movm_epi8(mask), not a zero-masked broadcast. Another option for a value other than -1 is _mm256_maskz_mov_epi8(mask32, _mm256_set1_epi8(1)). If not for vpmovm2b, broadcast would be interesting because 128-bit all-ones is slightly cheaper to create (vpcmpeqd same,same is special-cased as dep-breaking) than 512-bit (vpternlogd z,z,z, 0xff), but broadcasts are shuffles that can only run on port 5. See also the AVX-512 section of Convert 16 bits mask to 16 bytes mask (which mostly wants 0 / 1, not a normal 0 / -1) – Anticipate 1/5, 2021 at 3:36

The only reasonably efficient way I can think of is with an 8 bit LUT: do 4 x 8 bit lookups and then load the results into a vector, e.g.

static const uint64_t LUT[256] = { 0x0000000000000000ULL,
                                   ...
                                   0xffffffffffffffffULL };

uint64_t amask[4] __attribute__ ((aligned(32)));

uint32_t mask;
__m256i vmask;

amask[0] = LUT[mask & 0xff];
amask[1] = LUT[(mask >> 8) & 0xff];
amask[2] = LUT[(mask >> 16) & 0xff];
amask[3] = LUT[mask >> 24];
vmask = _mm256_load_si256((__m256i *)amask);

Alternatively you could use registers instead of the temporary array and see if your compiler can do something more efficient that doesn't involve going via memory:

static const uint64_t LUT[256] = { 0x0000000000000000ULL,
                                   ...
                                   0xffffffffffffffffULL };

uint64_t amask0, amask1, amask2, amask3;

uint32_t mask;
__m256i vmask;

amask0 = LUT[mask & 0xff];
amask1 = LUT[(mask >> 8) & 0xff];
amask2 = LUT[(mask >> 16) & 0xff];
amask3 = LUT[mask >> 24];
vmask = _mm256_set_epi64x(amask3, amask2, amask1, amask0);

Afterthought: an interesting challenge might be to use e.g. Haswell BMI instructions to perform the equivalent of the 8 -> 64 bit LUT operation and thereby get rid of the LUT. It looks like you could use PDEP for this, e.g.

const uint64_t pmask = 0x8080808080808080ULL; // bit unpacking mask for PDEP

uint64_t amask0, amask1, amask2, amask3;

uint32_t mask;
__m256i vmask;

amask0 = _pdep_u64(mask, pmask); mask >>= 8;
amask1 = _pdep_u64(mask, pmask); mask >>= 8;
amask2 = _pdep_u64(mask, pmask); mask >>= 8;
amask3 = _pdep_u64(mask, pmask);
vmask = _mm256_set_epi64x(amask3, amask2, amask1, amask0);

Lowlife answered 7/2, 2014 at 8:19 Comment(1)

Yes I want to avoid LUT if possible, they are very costly compared to the register-based operations I am performing. – Enlist 7/2, 2014 at 9:8

Here's another implementation that might work on AVX2 since you had that tag on your question (it is untested since I don't have a Haswell machine). It is similar to Evgeny Kluev's answer, but it might take fewer instructions. It requires two constant __m256i masks, though. If you're doing this many times in a loop, then the overhead of setting up those constants once ahead of time may be negligible.

Take your 32-bit mask and broadcast it to all 8 slots of a ymm register using _mm_broadcastd_epi32().
Create a __m256i holding 8 32-bit integers with values [0, 1, 2, 3, 4, 5, 6, 7] (from the least-significant to most-significant element).
Use that constant mask to rotate each of the 32-bit integers in your ymm register left by a different amount, using _mm256_sllv_epi32().
Now, if we view the ymm register as holding 8-bit integers and look at their MSBs, then the register now holds the MSBs for byte indices [7, 15, 23, 31, 6, 14, 22, 30, 5, 13, 21, 29, 4, 12, 20, 28, 3, 11, 19, 27, 2, 10, 18, 26, 1, 9, 17, 25, 0, 8, 16, 24] (from the least-significant to the most-significant element).
Use a bitwise-AND against a constant mask of [0x80, 0x80, 0x80, ...] to isolate the MSBs from each byte.
Use a sequence of shuffles and/or permutes to get the elements back in the order that you want. Unfortunately, there is no any-to-any permute for 8-bit integers like there are for floating-point values in AVX2.

Johny answered 7/2, 2014 at 13:54 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags