is there an inverse instruction to the movemask instruction in intel avx2?
Asked Answered
G

1

15

The movemask instruction(s) take an __m256i and return an int32 where each bit (either the first 4, 8 or all 32 bits depending on the input vector element type) is the most significant bit of the corresponding vector element.

I would like to do the inverse: take a 32 (where only the 4, 8 or 32 least significant bits are meaningful), and get a __m256i where the most significant bit of each int8, int32 or int64 sized block is set to the original bit.

Basically, I want to go from a compressed bitmask to one that is usable as a mask by other AVX2 instructions (such as maskstore, maskload, mask_gather).

I couldn't quickly find an instruction that does it, so I am asking here. If there isn't one instruction with that functionality, is there a clever hack you can think of that achieves this in very few instructions?

My current method is to use a 256 element lookup table. I want to use this operation within a loop where not much else is happening, to speed it up. Note, I'm not too interested in long multi-instruction sequences or little loops that implement this operation.

Genarogendarme answered 7/4, 2016 at 23:1 Comment(3)
Possible duplicate of How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?Phaih
Many good answers over on that potential duplicate, but they're mostly considering the 8bit element case. My answer here only really covered the 32bit element case. (because variable-shifts don't exist for narrower elements)Phaih
Just curious, why did you accept no answer?Herzig
P
23

There is no single instruction in AVX2 or earlier. (AVX512 can use masks in bitmap form directly, and has an instruction to expand masks to vectors).



For your case, if you're loading the bitmap from memory, loading it straight into vector registers for an ALU strategy should work well even for 4-bit masks.

If you have the bitmap as a computation result, then it will be in an integer register where you can use it as a LUT index easily, so that's a good choice if you're aiming for 64-bit elements. Otherwise probably still go ALU for 32-bit elements or smaller, instead of a giant LUT or doing multiple chunks.


We'll have to wait for AVX-512's mask registers before cheap conversion from integer bitmasks to vector masks are possible. (With kmovw k1, r/m16, which compilers generate implicitly for int => __mmask16). There's an AVX512 insn to set a vector from a mask (VPMOVM2D zmm1, k1, _mm512_movm_epi8/16/32/64, with other versions for different element sizes), but you generally don't need it since everything that used to use mask vectors now uses mask registers. Maybe if you want to count elements that meet some comparison condition? (where you'd use pcmpeqd / psubd to generate and accumulate the vector of 0 or -1 elements). But scalar popcnt on the mask results would be a better bet.

But note that vpmovm2d requires the mask to be in an AVX512 k0..7 mask register. Getting it there will take extra instructions unless it came from a vector compare result, and instructions that move into mask registers need a uop for port 5 on Intel Skylake-X and similar CPUs so this can be a bottleneck (especially if you do any shuffles). Especially if it starts in memory (loading a bitmap) and you only need the high bit of each element, you're probably still better off with a broadcast load + variable shift even if 256-bit and 512-bit AVX512 instructions are available.

Also possible (for a 0/1 result instead of 0/-1) is a zero-masking load from a constant like _mm_maskz_mov_epi8(mask16, _mm_set1_epi8(1)). https://godbolt.org/z/1sM8hY8Tj


For 64-bit elements, the mask only has 4 bits, so a lookup table is reasonable. You can compress the LUT by loading it with VPMOVSXBQ ymm1, xmm2/m32. (_mm256_cvtepi8_epi64). This gives you a LUT size of (1<<4) = 16 * 4 bytes = 64B = 1 cache line. Unfortunately, pmovsx is inconvenient to use as a narrow load with intrinsics.

Especially if you already have your bitmap in an integer register (instead of memory), a vpmovsxbq LUT should be excellent inside an inner loop for 64-bit elements. Or if instruction throughput or shuffle throughput is a bottleneck, use an uncompressed LUT. This can let you (or the compiler) use the mask vector as a memory operand for something else, instead of needing a separate instruction to load it.


LUT for 32-bit elements: probably not optimal but here's how you could do it

With 32-bit elements, an 8-bit mask gives you 256 possible vectors, each 8 elements long. 256 * 8B = 2048 bytes, which is a pretty big cache footprint even for the compressed version (load with vpmovsxbd ymm, m64).

To work around this, you can split the LUT into 4-bit chunks. It takes about 3 integer instructions to split up an 8-bit integer into two 4-bit integers (mov/and/shr). Then with an uncompressed LUT of 128b vectors (for 32-bit element size), vmovdqa the low half and vinserti128 the high half. You could still compress the LUT, but I wouldn't recommend it because you'll need vmovd / vpinsrd / vpmovsxbd, which is 2 shuffles (so you probably bottleneck on uop throughput).

Or 2x vpmovsxbd xmm, [lut + rsi*4] + vinserti128 is probably even worse on Intel.


ALU alternative: good for 16/32/64-bit elements

When the whole bitmap fits in each element: broadcast it, AND with a selector mask, and VPCMPEQ against the same constant (which can stay in a register across multiple uses of this in a loop).

vpbroadcastd  ymm0,  dword [mask]            ; _mm256_set1_epi32
vpand         ymm0, ymm0,  setr_epi32(1<<0, 1<<1, 1<<2, 1<<3, ..., 1<<7)
vpcmpeqd      ymm0, ymm0,  [same constant]   ; _mm256_cmpeq_epi32
      ; ymm0 =  (mask & bit) == bit
      ; where bit = 1<<element_number

The mask could come from an integer register with vmovd + vpbroadcastd, but a broadcast-load is cheap if it's already in memory, e.g. from a mask array to apply to an array of elements. We actually only care about the low 8 bits of that dword because 8x 32-bit elements = 32 bytes. (e.g. that you got from vmovmaskps). With a 16-bit mask for 16x 16-bit elements, you need vpbroadcastw. To get such a mask in the first place from 16-bit integer vectors, you might vpacksswb two vectors together (which preserves the sign bit of each element), vpermq to put the elements into sequential order after in-lane pack, then vpmovmskb.

For 8-bit elements, you will need to vpshufb the vpbroadcastd result to get the relevant bit into each byte. See How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?. But for 16-bit and wider elements, the number of elements is <= the element width, so a broadcast-load does this for free. (16-bit broadcast loads do cost a micro-fused ALU shuffle uop, unlike 32 and 64-bit broadcast loads which are handled entirely in the load ports.)

vpbroadcastd/q doesn't even cost any ALU uops, it's done right in the load port. (b and w are load+shuffle). Even if there your masks are packed together (one per byte for 32 or 64-bit elements), it might still be more efficient to vpbroadcastd instead of vpbroadcastb. The x & mask == mask check doesn't care about garbage in the high bytes of each element after the broadcast. The only worry is cache-line / page splits.


Variable shift (cheaper on Skylake) if you need just the sign bit

Variable blends and masked loads/stores only care about the sign bit of the mask elements.

This is only 1 uop (on Skylake) once you have the 8-bit mask broadcast to dword elements.

vpbroadcastd  ymm0, dword [mask]

vpsllvd       ymm0, ymm0, [vec of 24, 25, 26, 27, 28, 29, 30, 31]  ; high bit of each element = corresponding bit of the mask

;vpsrad        ymm0, ymm0, 31                          ; broadcast the sign bit of each element to the whole element
;vpsllvd + vpsrad has no advantage over vpand / vpcmpeqb, so don't use this if you need all the bits set.

vpbroadcastd is as cheap as a load from memory (no ALU uop at all on Intel CPUs and Ryzen). (Narrower broadcasts, like vpbroadcastb y,mem take an ALU shuffle uop on Intel, but maybe not on Ryzen.)

The variable-shift is slightly expensive on Haswell/Broadwell (3 uops, limited execution ports), but as cheap as immediate-count shifts on Skylake! (1 uop on port 0 or 1.) On AMD before Zen 3 they don't cost extra uops, but are slow (3c latency and 1/4 the throughput of a normal shift uop). On Zen 1 this is extra bad because 256-bit operations in general run as 2 uops. But it's not a disaster, especially if other uops can use other execution units on the same port while they're taking extra cycles (IDK if that's possible). On Zen 3 and later they perform as well as on Skylake, 1c latency with 0.5c throughput.

See the tag wiki for perf info, especially Agner Fog's insn tables and https://uops.info/.

For 64-bit elements, note that arithmetic right shifts are only available in 16 and 32-bit element size. Use a different strategy if you want the whole element set to all-zero / all-one for 4 bits -> 64-bit elements.

With intrinsics:

// AVX2, most efficient on Skylake and Zen 3 and later
// if you just need the MSBs set.  Otherwise still use and/cmpeq
__m256i bitmap2vecmask(int m) {
    const __m256i vshift_count = _mm256_set_epi32(24, 25, 26, 27, 28, 29, 30, 31);
    __m256i bcast = _mm256_set1_epi32(m);
    __m256i shifted = _mm256_sllv_epi32(bcast, vshift_count);  // high bit of each element = corresponding bit of the mask
    return shifted;

    // use _mm256_and and _mm256_cmpeq if you need all bits set, not two shifts.
    // would work but not worth it: return _mm256_srai_epi32(shifted, 31);             // broadcast the sign bit to the whole element
}

Inside a loop, a LUT might be worth the cache footprint, depending on the instruction mix in the loop. Especially for 64-bit element size where it's not much cache footprint, but possibly even for 32-bit.


Another option, instead of variable shift, is to use BMI2 to unpack each bit to a byte with that mask element in the high bit, then vpmovsx:

; 8bit mask bitmap in eax, constant in rdi

pdep      rax, rax, rdi   ; rdi = 0b1000000010000000... repeating
vmovq     xmm0, rax
vpmovsxbd ymm0, xmm0      ; each element = 0xffffff80 or 0

; optional
;vpsrad    ymm0, ymm0, 8   ; arithmetic shift to get -1 or 0

If you already have masks in an integer register (where you'd have to vmovq / vpbroadcastd separately anyway), then this way is probably better even on Skylake where variable-count shifts are cheap.

If your masks start in memory, the other ALU method (vpbroadcastd directly into a vector) is probably better, because broadcast-loads are so cheap.

Note that pdep is 6 dependent uops on Zen 1 and Zen 2 (18c latency, 18c throughput, or worse depending on the bits), so this method is horrible on Ryzen even if your masks do start in integer regs. Zen 3 and later have dedicated pext/pdep hardware and run them as efficiently as Intel, as a single uop.

(Future readers, feel free to edit in an intrinsics version of this. It's easier to write asm because it's a lot less typing, and the asm mnemonics are easier to read (no stupid _mm256_ clutter all over the place).)

Phaih answered 8/4, 2016 at 4:42 Comment(8)
"It's worse if your masks start in memory, since broadcast-loading into a vector is so cheap." - could you clarify this? What's worse and what's better? My masks start in memory (and I'm on Ryzen), so what should I use?Sheila
@SergeRogatch: Then both factors are in favour of the variable-shift method. (Or maybe the compressed-LUT since you have 64-bit elements.)Phaih
@PeterCordes: ALU alternative: good for 16/32/64-bit elements - I don't see how this can work for 16 shorts. Am I missing something?Leptosome
@DenisYaroshevskiy: I'm not sure what problem you think there would be, since you didn't mention one. _mm256_set1_epi16 repeats the 16-bit mask 16 times. A vector constant of _mm256_setr_epi16(1<<0, 1<<1, ..., 1<<15) can match one bit in each element because an element is at least as wide as the mask. vpbroadcastw, vpand and vpcmpeqw all exist in AVX2.Phaih
@PeterCordes sorry - what I meant is that you need two different parts of the mask for two pieces. High 128 bits will have top 16 bits and low 128 bits - low 16 bits. Another way to say it would be that the whole 32 bit mask won't fit into epi16. I might not understand something.Leptosome
@DenisYaroshevskiy: What 32-bit mask? A 32-byte YMM vector holds 16x 16-bit elements, so you only need a 16-bit mask. If you have 32 mask bits, then each 16-bit half can be expanded separately into separate __m256i vars. For 32-bit elements I used a vpbroadcastd load because it's cheaper than vpbroadcastb, and we only need the 8 mask bits in the bottom of each dword vector element.Phaih
@PeterCordes - movemask will return a 32 bit mask, right? Every second bit is duplicated but you need to deal with that.Leptosome
@DenisYaroshevskiy: That's not the case I'm talking about. My answer is for 1 bit per 2-byte element, where you did pack your bitmask. e.g. with vpacksswb +vpermq before vpmovmskb, to narrow vector elements preserving the sign bit. 32/64-bit elements are easier, just use vmovmskps/d. If you take a _mm256_movemask_epi8 result directly, it's still a byte mask for 8-bit elements and you have to unpack it as such. (Possibly some optimizations are possible when you know about the redundancy). I'll think about an update for this answer in case anyone else has the same misunderstanding.Phaih

© 2022 - 2024 — McMap. All rights reserved.