Emulate AVX512 VPCOMPRESSB byte packing without AVX512_VBMI2
Asked Answered
O

1

2

I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows will contain data.

I am looking for an instruction (or instructions) to do the same thing with a byte array in a zmm register as VPCOMPRESSQ does with a qword array in a zmm register.

Consider a situation where rows 3, 6, 7, 9 and 12 contain data and all the other rows (a total of 64 rows or fewer) are empty. Mask register k1 now contains 64 bits set to 001001001001000 ... 0.

With the VPCOMPRESSQ instruction, I can use k1 to compress the zmm register so that non-zero elements are arranged contiguously from the start of the register, but there is no equivalent instruction for bytes.

PSHUFB appears as a candidate, but the shuffle control mask must be integers, not bits. I can get an integer array 0 0 3 0 0 6, etc, but I still have elements set to zero, and I need them arranged contiguously from the start of a zmm register.

So my question is, given a zmm register with a byte array where k1 is set as shown above, how can I get the non-zero elements arranged contiguously from the beginning of the zmm register, like VPCOMPRESSQ does for qwords?

AVX512VBMI2 has VPCOMPRESSB, but that's only available in Ice Lake and later. What can be done on Skylake-avx512 / Cascade Lake?

Oxazine answered 10/5, 2020 at 19:28 Comment(8)
You could expand 1 vector of bytes to 4 vectors of dwords so you can use vpcompressd (and right-shift the mask each time). Then you only have to handle the 3 boundaries between 16-element chunks when recombining. (Presumably repack down to 1-byte chunks for that, or maybe 2-byte for vpermt2w.) Or even store/reload if the extra latency is ok from letting the store buffer recombine multiple overlapping stores for one vector reload.Marijo
When you say AVX, you mean AVX512, right? You're still talking about actually using the k1 mask register. Because if you had to do this without AVX512, see AVX2 what is the most efficient way to pack left based on a mask? for basically emulating vpcompressd for dword elements with AVX2 + BMI2 pext / pdep. Although maybe an entirely different strategy would make more sense for bytes.Marijo
Yes, AVX512. Your first comment sounds expensive, but I expect to find that most of the time not more than 16 rows will be filled, in which case it could work, using the 17th and later only when needed. I'll see your links about not using AVX512.Oxazine
The hardware left-packing options are dword with AVX512F or bit in a 64-bit register with BMI2, or looking up a shuffle-control vector from a table of masks. Unfortunately for your case, the last option would be a 64-bit index into a table of 2^64 x 64-byte control vectors for vpermb (AVX512VBMI: Ice Lake), or splitting that up somehow into multiple pshufb lookups for small chunks like 8 elements at most. But vpcompressd can do 16 elements at once efficiently so use that. Yes, it's expensive to left-pack 64 elements.Marijo
I just learned that there is now a VPCOMPRESSB instruction, which does not show up at felixcloutier.com/x86 so it must be fairly new, but it's in the Oct 2019 release of the Intel combined volumes set.Oxazine
Oh yeah, github.com/HJLebbink/asm-dude/wiki/VPCOMPRESS says it's part of AVX512_VBMI2. I'd forgotten or never noticed that. So yes, if you have an Ice Lake CPU, you're all set! en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512. According to uops.info, it's 2 uops for port 5 / 2c throughput for the ZMM version: uops.info/…. Amusingly they tested it both with and without masking. Without masking it's just a copy :PMarijo
See, that's the problem with a lot of AVX-512 instructions -- coding for older hardware. For that the best solution I can see now is your first suggestion of using the 32-bit VPCOMPRESS instruction. That would work. At least Intel now has the VPCOMPRESSB instruction I need, even if it's not available on most existing chips.Oxazine
That's the same problem as always, just the naming has changed. Like when AVX2 was brand-new, you'd still want an AVX1 or SSE4.2 version of something that could run on existing CPUs. At least vpcompressd provides the primitive operation you want.Marijo
F
2

Here's what I use as AVX2 fallbacks for both vpcompressb and vpexpandb. It's based on the solutions posted in this previous question.

Note that Decompress() may load out of bounds and fault when crossing a page. If this is an issue, the input can be copied into a temporary buffer like done in Compress().

void MaskCompress_AVX2(const uint8_t src[64], uint8_t* dest, uint64_t mask) {
    alignas(64) uint8_t temp[64];
    int j = 0;
    for (int i = 0; i < 64; i += 16) {
        // ABC... -> AAAABBBBCCCC...: replicate each bit to fill a nibble
        uint64_t extended_mask = _pdep_u64(mask, 0x1111'1111'1111'1111) * 0xF;
        uint64_t dest_idx_nib = _pext_u64(0xFEDC'BA98'7654'3210, extended_mask);

        __m128i dest_idx = _mm_cvtsi64_si128((int64_t)dest_idx_nib);
        dest_idx = _mm_unpacklo_epi8(dest_idx, _mm_srli_epi16(dest_idx, 4));
        dest_idx = _mm_and_si128(dest_idx, _mm_set1_epi8(15));

        // load will never be out of bounds because `j < i = always true`
        auto values = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)&src[i]), dest_idx);
        _mm_storeu_si128((__m128i*)&temp[j], values);

        j += _mm_popcnt_u32(mask & 0xFFFF);
        mask >>= 16;
    }
    __builtin_memcpy(dest, temp, (uint32_t)j);
}
void MaskDecompress_AVX2(const uint8_t* src, uint8_t dest[64], uint64_t mask) {
    for (int i = 0, j = 0; i < 64; i += 16) {
        // ABC... -> AAAABBBBCCCC...: replicate each bit to fill a nibble
        uint64_t extended_mask = _pdep_u64(mask, 0x1111'1111'1111'1111) * 0xF;
        uint64_t dest_idx_nib = _pdep_u64(0xFEDC'BA98'7654'3210, extended_mask);

        __m128i dest_idx = _mm_cvtsi64_si128((int64_t)dest_idx_nib);
        dest_idx = _mm_unpacklo_epi8(dest_idx, _mm_srli_epi16(dest_idx, 4));
        dest_idx = _mm_and_si128(dest_idx, _mm_set1_epi8(15));

        __m128i zero_mask = _mm_cvtsi64_si128((int64_t)~extended_mask);
        zero_mask = _mm_unpacklo_epi8(zero_mask, _mm_srli_epi16(zero_mask, 4));
    
        // shuffle_epi8 outputs zeroes when high index bit is set
        dest_idx = _mm_or_si128(dest_idx, _mm_slli_epi16(zero_mask, 4));

        // load will never be out of bounds because `j < i = always true`
        auto values = _mm_shuffle_epi8(_mm_loadu_si128((__m128i*)&src[j]), dest_idx);
        _mm_storeu_si128((__m128i*)&dest[i], values);

        j += _mm_popcnt_u32(mask & 0xFFFF);
        mask >>= 16;
    }
}

void MaskCompress_Scalar(const uint8_t src[64], uint8_t* dest, uint64_t mask) {
    for (uint32_t i = 0, j = 0; mask != 0; i++) {
        dest[j] = src[i];
        j += (mask & 1);
        mask >>= 1;
    }
}
void MaskDecompress_Scalar(const uint8_t* src, uint8_t dest[64], uint64_t mask) {
    for (uint32_t i = 0, j = 0; i < 64; i++) {
        uint8_t v = src[j];
        dest[i] = (mask & 1) ? v : 0;
        j += (mask & 1);
        mask >>= 1;
    }
}

Some benchmarks from my tigerlake CPU (masks are generated at random):

ns/op op/s err% total benchmark
2.69 371,063,085.72 1.9% 0.04 Compress AVX512
10.22 97,820,644.12 1.2% 0.17 Compress AVX2
24.36 41,053,932.84 0.7% 0.39 Compress Scalar
1.09 918,672,592.02 1.2% 0.03 Decompress AVX512
6.44 155,358,119.20 0.5% 0.11 Decompress AVX2
24.83 40,274,581.73 0.7% 0.40 Decompress Scalar

compressstore is apparently twice as slow compared to compress+store, so it might be a good idea to avoid it when possible.

Frustule answered 22/10, 2024 at 2:40 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.