Is using AVX2 can implement a faster processing of LZCNT on a word array?
Asked Answered
M

2

9

I need to bit scan reverse with LZCNT an array of words: 16 bits.

The throughput of LZCNT is 1 execution per clock on an Intel latest generation processors. The throughput on an AMD Ryzen seems to be 4.

I am trying to find an algorithm using the AVX2 instruction set to be faster.

I know AVX-512 has VPLZCNTD for 32-bit elements, so if I had AVX512CD I could unpack and use that.

With just the AVX2 instruction set, it is possible to code an algorithm faster than using the x86 asm LZCNT instruction?

Milan answered 15/5, 2019 at 15:43 Comment(10)
Do you need an array of results, one per element? Or are you doing one scan over a large array to find the highest set bit in the whole array? If the latter, yes use AVX2 vpcmpeqb to simply search for a non-zero byte, then bitscan it.Jarrell
What do you need to do with the result? Store it? If so, having the result in a vector is nice even on Ryzen. 4-per-clock lzcnt and 2-per-clock loads don't help if you're limited to 1-per-clock store.Jarrell
I doubt you'll be able to beat it without AVX512. 1 op per clock is 16 clocks per AVX structure on amounts for intel and 64 ops on AMD. The smallest algorithms I know for this type of thing require lookup tables and or far more operations and branching, so you'd lose out moving to AVX without a dedicated instruction.Innovation
You might be able to build something branchless out of vpcmpgtb to check if any bits above the low 4 are set, then the low 2, then the low 1. Hmm no, you need to be able to produce a unique result for all 8 positions within a byte. (And then something to combine results from pairs of bytes into words). Even with 16x 16-bit elements per vector, scalar at 1 per clock may still come out ahead. Or maybe even better on Ryzen if you can shl eax, 16 / mov ax, cx to merge 2x 16-bit results into a 32-bit result to store both at once. No CPUs with AVX2 rename AX separately from RAX so no stallsJarrell
@user2927848: You can use vpshufb as a 4-bit LUT after unpacking, then use pmaxub to merge results from high/low halves of each byte.Jarrell
Hmm, might have to be vpmaxsb so we can use negative to indicate that there are no zero bit in this nibble. But yeah, similar to AVX2 popcnt, split into nibbles. Look up how that algorithm works. Except for this we probably want 2 separate LUTs, and we still have to vpaddb for the high bytes of each pair.Jarrell
One option could be to convert uint16 -> int32 -> float and extract the exponent (which needs to be adjusted, of course). Another problem here is handling the 0 case. If you don't use the result afterwards (in a vectorized way), I doubt that this is worth the effort ...Spotty
I need an array of results, one per element.Milan
@GuyB And do you just want to store the results, or do more operations on it? Do you want to store as uint8, uint16, uint32? And what shall be the result for 0 (or will 0 not happen as input)?Spotty
I need to store them as an array of uint8Milan
A
10
#include <immintrin.h>

__m256i avx2_lzcnt_epi16(__m256i v) {
    const __m256i lut_lo = _mm256_set_epi8(
        4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 16,
        4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 16
    );
    const __m256i lut_hi = _mm256_set_epi8(
        0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 16,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 16
    );
    const __m256i nibble_mask = _mm256_set1_epi8(0x0F);
    const __m256i byte_offset = _mm256_set1_epi16(0x0008);
    __m256i t;

    t = _mm256_and_si256(nibble_mask, v);
    v = _mm256_and_si256(_mm256_srli_epi16(v, 4), nibble_mask);
    t = _mm256_shuffle_epi8(lut_lo, t);
    v = _mm256_shuffle_epi8(lut_hi, v);
    v = _mm256_min_epu8(v, t);

    t = _mm256_srli_epi16(v, 8);
    v = _mm256_or_si256(v, byte_offset);
    v = _mm256_min_epu8(v, t);

    return v;
}

// 16 - lzcnt_u16(subwords)
__m256i avx2_ms1b_epi16(__m256i v) {
    const __m256i lut_lo = _mm256_set_epi8(
        12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 9, 0,
        12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 9, 0
    );
    const __m256i lut_hi = _mm256_set_epi8(
        16, 16, 16, 16, 16, 16, 16, 16, 15, 15, 15, 15, 14, 14, 13, 0,
        16, 16, 16, 16, 16, 16, 16, 16, 15, 15, 15, 15, 14, 14, 13, 0
    );
    const __m256i nibble_mask = _mm256_set1_epi8(0x0F);
    const __m256i adj = _mm256_set1_epi16(0x1F08);
    __m256i t;

    t = _mm256_and_si256(nibble_mask, v);
    v = _mm256_and_si256(_mm256_srli_epi16(v, 4), nibble_mask);
    t = _mm256_shuffle_epi8(lut_lo, t);
    v = _mm256_shuffle_epi8(lut_hi, v);
    v = _mm256_max_epu8(v, t);

    t = _mm256_srli_epi16(v, 8);
    v = _mm256_sub_epi8(v, adj);
    v = _mm256_max_epi8(v, t);

    return v;
}

For results packed into uint8 use _mm256_packs_epi16(). For packed results in the correct order also use _mm256_permute4x64_epi64().

Solution from r/SIMD. This solution was also described in the comments here.

Alidia answered 15/5, 2019 at 15:43 Comment(1)
+1, this is exactly what I was thinking of in comments on the question. (With all the fine details worked out like vpminub which I had wrong, and the LUT values worked out, which is farther than I got in my head.) At 9 total vector ALU instructions per input vector, this should run at about one vector of 16x 16-bit results per 3 cycles on Haswell/Skylake. (With some front-end bandwidth to spare for load/store / loop overhead.) vpackswb + vpermq will probably bottleneck it on shuffle throughput, but still be much better than scalar 16-bit lzcnt. Also a win on Ryzen.Jarrell
L
3

Another possible solution over the lines of this answer, using conversion-to-float hack. In my tests on Zen4 has slightly better performance and uses less registers.

__m256i avx2_bit_width_epu16(__m256i v)
{
    const __m256i mask = _mm256_set1_epi32(0x0000FFFF);
    __m256i t = _mm256_and_si256(mask, v); // even indices
    v = _mm256_srli_epi32(v, 16); // odd indices - this prevents rounding

    t = _mm256_castps_si256(_mm256_cvtepi32_ps(t));
    v = _mm256_castps_si256(_mm256_cvtepi32_ps(v)); // convert an integer to float

    t = _mm256_alignr_epi8(t, t, 2); // put exponents inplace
    v = _mm256_blend_epi16(t, v, 0b10101010); // restore

    v = _mm256_srli_epi16(v, 23 - 16); // shift down the exponent
    v = _mm256_sub_epi16(v, _mm256_set1_epi16(126)); // undo bias
    v = _mm256_max_epi16(v, _mm256_set1_epi16(0)); // clamp negative for 0 to 0

    return v;
}

UPD: updated for large values from 1 << 15 on - gives correct 16.

Linnet answered 2/3 at 16:6 Comment(2)
The max at the end can be avoided if you use subs_epu16 (unsigned-saturating subtract). That should be faster on Intel, and better throughput but equal latency on Zen 4. godbolt.org/z/8xzz1rE5K . You can also use _mm256_bsrli_epi128(t, 2) instead of _mm256_alignr_epi8(t,t,2) - simpler instruction, can run on more ports on Intel Ice Lake and later, and smaller machine-code size (2-byte VEX). Or 2 shifts that can run in parallel to feed the blend instead of shuffle -> blend -> shift, although in a loop on Intel that might bottleneck on ports 0 / 1 (also vcvtdq2ps.)Jarrell
My Godbolt link has a test main that checks it against 16 - std::countl_zero( (uint16_t)i ) for every i from 0 .. 0xFFFF, for the version with 2 shifts feeding a blend and vpsubusw instead of vpmaxuw.Jarrell

© 2022 - 2024 — McMap. All rights reserved.