Here's what I use as AVX2 fallbacks for both vpcompressb and vpexpandb. It's based on the solutions posted in this previous question.
Note that Decompress() may load out of bounds and fault when crossing a page. If this is an issue, the input can be copied into a temporary buffer like done in Compress().
void MaskCompress_AVX2(const uint8_t src[64], uint8_t* dest, uint64_t mask) {
alignas(64) uint8_t temp[64];
int j = 0;
for (int i = 0; i < 64; i += 16) {
// ABC... -> AAAABBBBCCCC...: replicate each bit to fill a nibble
uint64_t extended_mask = _pdep_u64(mask, 0x1111'1111'1111'1111) * 0xF;
uint64_t dest_idx_nib = _pext_u64(0xFEDC'BA98'7654'3210, extended_mask);
__m128i dest_idx = _mm_cvtsi64_si128((int64_t)dest_idx_nib);
dest_idx = _mm_unpacklo_epi8(dest_idx, _mm_srli_epi16(dest_idx, 4));
dest_idx = _mm_and_si128(dest_idx, _mm_set1_epi8(15));
// load will never be out of bounds because `j < i = always true`
auto values = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)&src[i]), dest_idx);
_mm_storeu_si128((__m128i*)&temp[j], values);
j += _mm_popcnt_u32(mask & 0xFFFF);
mask >>= 16;
}
__builtin_memcpy(dest, temp, (uint32_t)j);
}
void MaskDecompress_AVX2(const uint8_t* src, uint8_t dest[64], uint64_t mask) {
for (int i = 0, j = 0; i < 64; i += 16) {
// ABC... -> AAAABBBBCCCC...: replicate each bit to fill a nibble
uint64_t extended_mask = _pdep_u64(mask, 0x1111'1111'1111'1111) * 0xF;
uint64_t dest_idx_nib = _pdep_u64(0xFEDC'BA98'7654'3210, extended_mask);
__m128i dest_idx = _mm_cvtsi64_si128((int64_t)dest_idx_nib);
dest_idx = _mm_unpacklo_epi8(dest_idx, _mm_srli_epi16(dest_idx, 4));
dest_idx = _mm_and_si128(dest_idx, _mm_set1_epi8(15));
__m128i zero_mask = _mm_cvtsi64_si128((int64_t)~extended_mask);
zero_mask = _mm_unpacklo_epi8(zero_mask, _mm_srli_epi16(zero_mask, 4));
// shuffle_epi8 outputs zeroes when high index bit is set
dest_idx = _mm_or_si128(dest_idx, _mm_slli_epi16(zero_mask, 4));
// load will never be out of bounds because `j < i = always true`
auto values = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)&src[j]), dest_idx);
_mm_storeu_si128((__m128i*)&dest[i], values);
j += _mm_popcnt_u32(mask & 0xFFFF);
mask >>= 16;
}
}
void MaskCompress_Scalar(const uint8_t src[64], uint8_t* dest, uint64_t mask) {
for (uint32_t i = 0, j = 0; mask != 0; i++) {
dest[j] = src[i];
j += (mask & 1);
mask >>= 1;
}
}
void MaskDecompress_Scalar(const uint8_t* src, uint8_t dest[64], uint64_t mask) {
for (uint32_t i = 0, j = 0; i < 64; i++) {
uint8_t v = src[j];
dest[i] = (mask & 1) ? v : 0;
j += (mask & 1);
mask >>= 1;
}
}
Some benchmarks from my tigerlake CPU (masks are generated at random):
ns/op |
op/s |
err% |
total |
benchmark |
2.69 |
371,063,085.72 |
1.9% |
0.04 |
Compress AVX512 |
10.22 |
97,820,644.12 |
1.2% |
0.17 |
Compress AVX2 |
24.36 |
41,053,932.84 |
0.7% |
0.39 |
Compress Scalar |
1.09 |
918,672,592.02 |
1.2% |
0.03 |
Decompress AVX512 |
6.44 |
155,358,119.20 |
0.5% |
0.11 |
Decompress AVX2 |
24.83 |
40,274,581.73 |
0.7% |
0.40 |
Decompress Scalar |
compressstore
is apparently twice as slow compared to compress+store
, so it might be a good idea to avoid it when possible.
vpcompressd
(and right-shift the mask each time). Then you only have to handle the 3 boundaries between 16-element chunks when recombining. (Presumably repack down to 1-byte chunks for that, or maybe 2-byte forvpermt2w
.) Or even store/reload if the extra latency is ok from letting the store buffer recombine multiple overlapping stores for one vector reload. – Marijok1
mask register. Because if you had to do this without AVX512, see AVX2 what is the most efficient way to pack left based on a mask? for basically emulatingvpcompressd
for dword elements with AVX2 + BMI2pext
/pdep
. Although maybe an entirely different strategy would make more sense for bytes. – Marijovpermb
(AVX512VBMI: Ice Lake), or splitting that up somehow into multiplepshufb
lookups for small chunks like 8 elements at most. But vpcompressd can do 16 elements at once efficiently so use that. Yes, it's expensive to left-pack 64 elements. – Marijovpcompressd
provides the primitive operation you want. – Marijo