Trying to write a vectorized implementation of Gerd Isenberg's Bit Scan Forward as an exercise

Asked 3/10, 2022 at 3:31 Answered 4/10, 2022 at 14:12

c++bit-manipulation vectorization simd sse

I'm trying to write a vectorized implementation of BSF as an exercise, but I'm stuck, it doesn't work.

The algorithm:

 short bitScanForward(int16_t bb)
 {
     constexpr uint16_t two = static_cast<uint16_t>(2);
     constexpr uint16_t zero = static_cast<uint16_t>(0);
     uint16_t lsb;
     bb &= -bb;
     lsb = (unsigned short)bb
               | (unsigned short)(bb >> short(8));
     return static_cast<short>(((((((unsigned short)(bb >> 
       short(8)) != zero) * two)
    + ((lsb & unsigned short(0xf0f0)) != zero)) * two)
    + ((lsb & unsigned short(0xcccc)) != zero)) * two)
    + ((lsb & unsigned short(0xaaaa)) != zero);
}

See: Gerd Isenberg BSF

My Vector code:

 [[nodiscard]] inline __m128i _mm_cmpneq_epi16(const __m128i& a, const __m128i& b) noexcept
{
    const __m128i _NEG_ONE = _mm_set1_epi16(static_cast<int16_t>(-1));
    __m128i _mask = _mm_setzero_si128();

    _mask = _mm_cmpeq_epi16(a, b);
    _mask = _mm_xor_si128(_mask, _NEG_ONE);//Not Equal

    return _mask;
}//End of _mm_neq_epi16

 [[nodiscard]] inline __m128i _mm_bsf_epi16(__m128i x) noexcept
{
    __m128i _lsb = _mm_setzero_si128();
    __m128i _temp1 = _mm_setzero_si128();
    __m128i _temp2 = _mm_setzero_si128();
    __m128i _result = _mm_setzero_si128();

    const __m128i _zero = _mm_setzero_si128();
    const __m128i _one = _mm_set1_epi16(static_cast<uint16_t>(1));
    const __m128i _two = _mm_set1_epi16(static_cast<uint16_t>(2));
    const __m128i _hex2 = _mm_set1_epi16(static_cast<uint16_t>(0xf0f0));
    const __m128i _hex3 = _mm_set1_epi16(static_cast<uint16_t>(0xcccc));
    const __m128i _hex4 = _mm_set1_epi16(static_cast<uint16_t>(0xaaaa));

    x = _mm_and_si128(x, _mm_sub_epi16(_zero, x));

    _lsb = _mm_or_si128(x, _mm_srli_epi16(x, 8));

    _temp1 = _mm_mullo_epi16(_mm_abs_epi16(_mm_cmpneq_epi16(_mm_srli_epi16(x, 8), _zero)), _two);
    _temp2 = _mm_abs_epi16(_mm_cmpneq_epi16(_mm_and_si128(_lsb, _hex2), _zero));

    _result = _mm_add_epi16(_temp1, _temp2);
    _result = _mm_mullo_epi16(_result, _two);

    _temp1 = _mm_abs_epi16(_mm_cmpneq_epi16(_mm_and_si128(_lsb, _hex3), _zero));
    _temp2 = _mm_abs_epi16(_mm_cmpneq_epi16(_mm_and_si128(_lsb, _hex4), _zero));

    _result = _mm_add_epi16(_result, _temp1);
    _result = _mm_add_epi16(_result, _temp2);
            
    return _result;
}//End of _mm_bsf_epi16

Here's the results I'm getting for a const vector:

-32,768 1000000000000000 bsf: 15
  8,192 0010000000000000 bsf: 13
  2,048 0000100000000000 bsf: 11
  8,704 0010001000000000 bsf: 9
  8,832 0010001010000000 bsf: 7
-24,544 1010000000100000 bsf: 5
-24,568 1010000000001000 bsf: 3
 -8,190 1110000000000010 bsf: 1

As you can see most of them are wrong. It's possible I just messed up a nested function call, but I could just be way off base too. I'd be curious to see if it's faster than the scaler BSF instruction. Any help at all would be appreciated.

Okay, I got it working, it turns out I severely misread the mess of brackets above and was doing the algorithm out of order.

As for the performance this version does indeed outperform common algorithms like:

 x = x & -x;

if ((x & 0xff00ff00) != 0) index += 8;
if ((x & 0xf0f0f0f0) != 0) index += 4;
if ((x & 0xcccccccc) != 0) index += 2;
if ((x & 0xaaaaaaaa) != 0) index += 1;

return index;

There is no BSF instruction for 16-bit ints on x86.

My SIMD version takes 138ms to commute the ffs on 1 billion int16_t s (using multithreading) whilst the other above takes 374ms (using multithreading).

Alienist answered 3/10, 2022 at 3:31 Comment(8)

Your return statement is an unreadable mess of nested intrinsics. Intrinsic names are much longer than operators, and are prefix not infix. Trying to match the actual formatting of the original is a mistake. Also, _mm_cmpneq_epi16 isn't a real intrinsic, it has to invert a cmpeq, so you should try to optimize that and the 0/1 instead of 0/-1, instead of doing abs separately. e.g. andn(cmp, set1(1)), or with set1(2) for the first one to avoid shifting. Also, mullo_epi16 is not a fast way to double an integer! Shift by 1 or add to itself. – Unarm 3/10, 2022 at 3:59

Also, why this implementation for SIMD? I wouldn't be surprised if there's something one can do with pshub as a nibble LUT to at least shortcut the 0xaaaaaaaa and 0xcccccccc steps, although that might mean shifting and masking both ways and doing two pshufb per input vector. AVX-512 has SIMD vplzcntd/q, so a bit-reverse (with some vpshufb as nibble LUT and byte shuffle) would be best if you can use that. For 16-bit chunks, I guess you'd want to unpack to 32-bit as you reverse for vplzcntd and re-pack – Unarm 3/10, 2022 at 4:4

Or maybe a totally different implementation, like using the magic multiply from Position of least significant bit that is set to get a vector of indices into MultiplyDeBruijnBitPosition for pshufb (4-bit LUT of byte values). Conveniently, the odd bytes would already be 0, thus look up to 0. – Unarm 3/10, 2022 at 4:9

You edited, but you didn't replace that mess of nested intrinsics. I hope you don't expect anyone else to read that for you. If you assign some intermediate results to named variables, you can look at them with a debugger. – Unarm 3/10, 2022 at 4:50

@PeterCordes I'm working on some different algorithms, I don't think this one is gonna work out here. It's overcomplicated. – Alienist 3/10, 2022 at 5:17

Perhaps an FP bithack would work, like isolate the lowest set bit with v &= -v. Then convert to float and extract shift the exponent field down to an integer, and unbias it. (Powers of 2 convert exactly to float; INT_MIN has the same magnitude as unsigned, but the sign bit is set so you'd have to mask). Unfortunately there's no packed int16 -> fp16 until AVX512 FP16 or BF16, so you'd have to unpack to 32-bit. So the DeBruijn sequence method is probably better for uint16_t, but the FP bithack might win for uint32_t where a 4-bit LUT of bytes doesn't work for the 32 possible results. – Unarm 3/10, 2022 at 5:47

Anyway, that's getting off topic for this debugging question; you might want to replace it with a question about how to efficiently implement SIMD tzcnt instead of debugging / optimizing this attempt. If you want it to still be a debugging question, rewrite the code to be readable and do some actual debugging (intermediate values), not just looking at the final results. Agreed, though, that it's probably not worth pursuing overall, except as a code-review and exercise in vectorizing the pieces of this algorithm. (A compiler can probably auto-vectorize, BTW.) – Unarm 3/10, 2022 at 5:49

There is no BSF instruction for 16-bit ints on x86. - incorrect, scalar bsf and tzcnt support 16, 32, and 64-bit. Maybe you're only looking at intrinsics? And of course you can just zero-extend into a 32-bit register and use bsf r32, r32; the result for BSF is only well-defined for non-zero inputs, so high zeros don't matter. You can even emulate 16-bit tzcnt using 32-bit bsf, as bts reg, 16 / bsf reg,reg to get a 16 if the low 16 bits are all zero. – Unarm 4/10, 2022 at 1:6

The SIMD BSF strategy you picked is not efficient. Taking advantage of other primitive operations that the CPU can do as a single instruction will be better. Even a best-case implementation of that strategy needs a lot of different mask constants, and lots of instructions per vector.

Your choice to implement *2 with _mm_mullo_epi16 instead of _mm_slli_epi16 by 1 is particularly unfortunate. (Or _mm_add_epi16(same,same)). Fortunately some compilers will optimize the mullo by a constant into an add for you, but that whole strategy still takes many more instructions than necessary. But others like MSVC and ICC take intrinsics fairly literally and will actually use a hardware multiply with it's relatively high latency for that.

There are multiple good strategies, with the best choice depending on the SIMD element width and the level of ISA extension available (many require SSSE3 for pshufb). And some micro-optimizations in the implementation details may depend on Intel vs. AMD or microachitecture differences between generations from the same vendor.

With AVX-512 vpopcntb/w/d/q available: vpopcnt(~v & (v-1))
(vpadd -1/vpandn/vpopcnt), i.e. make a mask up to and not including the lowest set bit, and popcount it. ~v & (v-1) gives all-ones for an input of zero, so it can produce 17 different output values for a 16-bit input, not needing any fixup to fully work for 0.

3 instructions, two of them very cheap. (And vpopcnt is cheap on CPUs that support it, Ice Lake and later except Alder Lake, and Zen 4. AVX-512 VPOPCNTDQ and BITALG (for the b/w versions).) Clang vectorizes __tzcnt_u16 this way if you use it in a loop.

Note that v ^ (v-1) to get a mask up to and including like scalar blsmsk would count one too many, and couldn't distinguish 0 from 0x8000; both produce 0xffff.
32 or 64-bit elements with AVX-512: vplzcntd/q is always available (all AVX-512 CPUs have AVX-512CD). tzcntd = 31-lzcntd(v&-v) for non-zero inputs. That would give you a -1 for an all-zero element. (So one final vpminud(tz, set1(32)) would clamp that UINT_MAX to 32 if you need that.)
16-bit elements with SSSE3: DeBruijn sequence multiply to generate a 4-bit value for a pshufb LUT: excellent, especially if you don't care about the input=0 case. This strategy doesn't work for 32 or 64-bit elements, not without AVX-512 VBMI vpermb for a wider LUT, in which case you'd normally also have vpopcnt.

5 single-uop instructions per vector (with AVX), 2 vector constants. (Or 7 or 8 instructions if you want full tzcnt behaviour, producing 16 for input=0. Slightly cheaper if -1 is ok for that case.) pmullw (_mm_mullo_epi16) is single-uop on modern CPUs, unlike pmulld

I think this strategy is better than aqrit's clever strategy to combine pshufb results with pminub (9 instructions with gcc or clang).
32-bit elements: @Soonts' FP strategy is very good, especially if you only want to assume SSE2. Converting to FP to take advantage of the hardware that does this to compute an exponent field. 32-bit is the natural width for packed SIMD int->float conversion. You do have to deal with the sign bit being set if the input had its MSB set, i.e. an extra and instruction after shifting the exponent down.

@aqrit's strategy of using 2x pshufb as a 4-bit LUT for each nibble of the original integer is interesting, too, but I think it'll need an extra merging step vs. @Soonts' needing fewer steps, not needing to split low/high and merge.

@aqrit's SSE2-only strategy with _mm_avg_epu16(r, _mm_cmpeq_epi16(_mm_and_si128(x3333, v), x0000)); and so on looks slower than the FP strategy, especially for 32-bit where it would take more work, but the FP strategy takes less work per vector.
64-bit elements: packed 64-bit integer -> FP conversion isn't available until AVX-512. Skylake-X has AVX-512 but not AVX-512VPOPCNTDQ.

Even without direct support for SIMD popcount, the popcnt(~v & (v-1)) idea is probably good. SIMD popcnt is a known technique, e.g. splitting into low/high nibbles for 2x vpshufb as a 4-bit LUT. Then _mm_add_epi8 those high/low halves together and psadbw against 0 to sum bytes within qword chunks.

(This is basically how clang auto-vectorizes sum += __tzcnt_u16(arr[i]) even without -march=icelake-client`, but with some wasted shuffles and inefficient summing.)

BSF for 16-bit elements with SSSE3

An answer on Position of least significant bit that is set can be adapted to 16-bit, and the 16-entry lookup table of 8-bit values can then be vectorized with SSSE3 pshufb.

A De Bruijn sequence has every 4-bit bit-pattern in there somewhere, overlapping. Multiplying it by a power of 2 (single bit set) shifts one of those sequences to be the top n bits, and a right shift by type_width - n brings them to the bottom. So we get a 4-bit value at the bottom of a byte, ready for use as a LUT index.

SSE2 pmullw is fast on all modern CPUs, even Alder Lake E-cores. Single uop, although latency is 5 cycles on P-cores Haswell/Skylake/Ice Lake. But since SKL, it has 2/clock throughput, running on port 0 or 1. Also fast on Zen 2 for example, 1/clock throughput, 3 cycle latency. https://uops.info/.

SIMD integer shifts (psrlw) compete for the same ports as pmullw, but fortunately that 2/clock throughput should be enough to avoid a bottleneck. pshufb runs on port 5 on Intel, not competing with shift / pmul.

__m128i bsf_epi16_debruijn(__m128i v)
{
    const __m128i debruijn_magic = _mm_set1_epi16( 0x09AF );
    const __m128i bit_table = _mm_setr_epi8(
         0,  1,  2,  5,  3,  9,  6, 11, 
        15,  4,  8, 10, 14,  7, 13, 12  );

    __m128i blsi = _mm_sub_epi16(_mm_setzero_si128(), v);
    blsi = _mm_and_si128(blsi, v);       // v &= -v;  a power of 2; multiplying by it is like a shift

    __m128i idx = _mm_mullo_epi16(blsi, debruijn_magic);
    idx = _mm_srli_epi16(idx, 12);       // leaving a 4-bit index from the selected position in the DeBruijn sequence
// maybe TODO: avoid the shift with PMULHW with a debruijn sequence and table crafted to use the bits "shifted" into the high half?
// But then would need to mask before pshufb without AVX-512VBMI vpermb xmm
// And if we have that (Ice Lake) we normally have AVX-512 BITALG for vpopcntw(~v & (v-1)) or vpopcntw(pandn(v, v-1))  (vpaddw / vpandn)

    __m128i bsf = _mm_shuffle_epi8(bit_table, idx);  // high half of each word looks up to 0 so no fixup needed
    // input = 0 produces output = 0, same as input=1, unless we fixup the result

#if 1
    // optional: produce -1 or 16 for input==0, instead of 0
    __m128i was_zero = _mm_cmpeq_epi16(v, _mm_setzero_si128());
    //bsf = _mm_blendv_epi8(bsf, _mm_set1_epi16(16), was_zero);  // single-uop on AMD, 2 uops on Intel; 3 on Alder Lake P and 4 on E cores.  Single uop for the legacy SSE version, though.

    // was_zero = _mm_and_si128(was_zero, _mm_set1_epi16(16));
    bsf = _mm_or_si128(bsf, was_zero);  // return -1 for v==0 .  (Or with the &16, return 16
      // alternative: bsf = _mm_sub_epi16(bsf, _mm_slli_epi16(was_zero,4));  // subtract (-1<<4) or (0).  Avoids a constant.
#endif
    return bsf;
}

I generated the 16-bit De Bruijn sequence and lookup table using the program from https://sites.google.com/site/sydfhd/articles-tutorials/de-bruijn-sequence-generator with the compile error fixed by commenting out the 2 lines with an if involving is_mulshift, as that's not defined in the program. Also g++ -O2 -fpermissive to silence other warnings.

Godbolt with this, the original, and (my tweak to) Soonts' answer, plus aqrit's answers. Also a scalar loop that clang auto-vectorizes.

bsf_epi16_debruijn(long long __vector(2)):            # @bsf_epi16_debruijn(long long __vector(2))
        vpxor   xmm1, xmm1, xmm1              # constant can be hoisted out of loops
        vpsubw  xmm2, xmm1, xmm0
        vpand   xmm2, xmm2, xmm0
        vpmullw xmm2, xmm2, xmmword ptr [rip + .LCPI5_0]
        vpsrlw  xmm2, xmm2, 12
        vmovdqa xmm3, xmmword ptr [rip + .LCPI5_1] # xmm3 = [0,1,2,5,3,9,6,11,15,4,8,10,14,7,13,12]
        vpshufb xmm2, xmm3, xmm2
        vpcmpeqw        xmm0, xmm0, xmm1      # fixup for v==0
        vpor    xmm0, xmm2, xmm0              # fixup for v==0
        ret

So not counting instructions that set registers to a constant (since those can get hoisted out of loops with AVX to allow non-destructive use of them), this is 5 instructions for the main work. Two for the multiply/shift ports, two simple integer that can run on any port, and one shuffle that Intel CPUs only run on port 5.

And 2 more instructions for this fixup strategy that gives -1 for elements that were 0, instead of output = 0 without a fixup. (That's why we can just OR in instead of vpblendvb even if we want to set it to 16, not just to -1. -1 | anything == -1 so this works even if the LUT didn't produce 0 for an input of 0.)

This trivially widens to 256-bit vectors (AVX2) or 512-bit (AVX-512BW). I haven't tried writing it scalar to see if GCC or clang will auto-vectorize the shift and LUT lookup; I'm not optimistic but wouldn't rule it out.

There is no BSF instruction for 16-bit ints on x86.

Incorrect: bsf allows operand-sizes of 16, 32, or 64-bit. Same for BMI1 tzcnt. Intrinsics and builtins for BSF are not well standardized across compilers (and AFAIK there aren't intrinsics for 16-bit bsf), but Intel does document _tzcnt_u16. GCC only supports __tzcnt_u16 (two leading underscores), not Intel's name, but clang supports both names (one and two underscores).

That's fine; bsf with a zero input produces a garbage value (intrinsics for it don't expose the asm behaviour of leaving the destination register unmodified; behaviour which AMD documents, but both Intel and AMD implement). And for non-zero 16-bit inputs, the bits above the low 16 don't affect the value. So having 16-bit bsf wouldn't help, but 16-bit tzcnt does let you get a 16 when the input is zero, without having to do _tzcnt_u32(0x10000 | x) to let a 32-bit tzcnt find a set bit at the position you want.

Unarm answered 4/10, 2022 at 14:12 Comment(1)

In the last step I think you can optimize away _mm_set1_epi16(16) constant/broadcast with _mm_slli_epi16( was_zero, 4 ) and then _mm_sub_epi16 instead of _mm_or_si128 The instructions are slightly more expensive than pand+por, however. – Descriptive 4/10, 2022 at 22:19

I don’t like that algorithm, too many instructions. Try the following version instead.

// Count number of trailing zero bits in 16-bit integers
__m128i tzcnt_epi16( __m128i vec )
{
    // Isolate the rightmost set bit by computing tmp = vec & (-vec)
    const __m128i zero = _mm_setzero_si128();
    __m128i tmp = _mm_sub_epi16( zero, vec );
    tmp = _mm_and_si128( tmp, vec );

    // Expand int16 lanes to int32, even/odd lanes in different vectors
    __m128i low = _mm_blend_epi16( zero, tmp, 0b01010101 );
    __m128i high = _mm_srli_epi32( tmp, 16 );
    // Convert int32 to fp32
    low = _mm_castps_si128( _mm_cvtepi32_ps( low ) );
    high = _mm_castps_si128( _mm_cvtepi32_ps( high ) );
    // The mantissa is 0, the input is either 0 or 2^n where n is a small integer
    // The sign bit is unset, the only part of these fp32 numbers is exponent
    // Merge two vectors into a single one
    low = _mm_srli_epi32( low, 23 );
    high = _mm_srli_epi32( high, 23 - 16 );
    tmp = _mm_or_si128( low, high );

    // Now we have a vector of 16 bit lanes containing the exponents
    // When 0, we should return 16
    // Otherwise, we should return ( val - 127 )
    const __m128i bias = _mm_set1_epi16( 127 );
    tmp = _mm_sub_epi16( tmp, bias );
    return _mm_min_epu16( tmp, _mm_set1_epi16( 16 ) );
}

The above code requires SSE 4.1 due to _mm_blend_epi16 and _mm_min_epu16 instructions.

Descriptive answered 3/10, 2022 at 22:56 Comment(6)

Thanks for your contribution, I'll keep that in mind, but in my testing both of these algorithms perform almost exactly the same (running an i7 7700K @ 4.2GHz) – Alienist 4/10, 2022 at 0:15

Without AVX, vec &= 0 - vec is probably more efficient than vec andnot (vec + -1), since the 0 or all-ones constant has to get regenerated every time, or movdqa copies. xor-zeroing is cheaper than pcmpeqd xmm0,xmm0 on Intel: like a NOP that doesn't need a SIMD ALU. (Skylake / Kaby Lake only has 3 vector ALU ports). Although probably after inlining into a loop, a compiler will just movdqa from a constant vector, so mov-elimination makes either way equivalent. Still, I'd prefer _mm_setzero_si128() or _mm_set1_epi16(-1), not tempting the compiler into caring about using vec – Unarm 4/10, 2022 at 1:15

@dave_thenerd: How did you benchmark? The version in your question compiles to many more instructions with -O3 -march=skylake (godbolt.org/z/o15oWr685), all of them being single-uop, and not huge dep chains. And enough mix to keep all ports busy, although lots of shift and FP. If you were testing them over a giant array (creating a possible memory bottleneck vs. repeated loops over the same small array), perhaps you forgot to warm up the array and had page faults inside one of the timed regions? (Idiomatic way of performance evaluation?) – Unarm 4/10, 2022 at 1:20

@Soonts: sub( _mm_setzero_si128(), vec) saves an instruction for clang, which chooses to vpblendw with zero instead of one of the ands with 0x0000ffff. That's single-uop on the OP's Skylake, but port 5 only. Probably ok, the FP conversions and integer shifts all run on ports 0 and 1. Same for vpminuw. BTW, GCC12 and later choose to generate constants from mov-immediate / movd / vpbroadcastw, instead of loading. Interesting choice; I wonder if that's based on profiling and finding vector constants often cache miss more than I-cache. Should get hoisted after inlining either way – Unarm 4/10, 2022 at 1:29

@PeterCordes Good points, updated – Descriptive 4/10, 2022 at 10:29

Posted an answer with my De Bruijn sequence version. – Unarm 4/10, 2022 at 14:19

See the answer by Peter Cordes. This answer would only be interesting for 8-bit lanes.

__m128i sse2_tzcnt_epi16(__m128i v) {
    const __m128i x0000 = _mm_setzero_si128();
    const __m128i x5555 = _mm_set1_epi16(0x5555);
    const __m128i x3333 = _mm_set1_epi16(0x3333);
    const __m128i x0F0F = _mm_set1_epi16(0x0F0F);
    const __m128i x00FF = _mm_set1_epi16(0x00FF);

    __m128i r;
    v = _mm_and_si128(v, _mm_sub_epi16(x0000, v));
    r = _mm_slli_epi16(_mm_cmpeq_epi16(_mm_and_si128(x5555, v), x0000), 15);
    r = _mm_avg_epu16(r, _mm_cmpeq_epi16(_mm_and_si128(x3333, v), x0000));
    r = _mm_avg_epu16(r, _mm_cmpeq_epi16(_mm_and_si128(x0F0F, v), x0000));
    r = _mm_avg_epu16(r, _mm_cmpeq_epi16(_mm_and_si128(x00FF, v), x0000));
    r = _mm_sub_epi16(_mm_srli_epi16(r, 12), _mm_cmpeq_epi16(v, x0000));
    return r;
}

__m128i ssse3_tzcnt_epi16(__m128i v) {
    const __m128i lut_lo = _mm_set_epi8(8, 9, 8, 10, 8, 9, 8, 11, 8, 9, 8, 10, 8, 9, 8, 16);
    const __m128i lut_hi = _mm_set_epi8(12, 13, 12, 14, 12, 13, 12, 15, 12, 13, 12, 14, 12, 13, 12, 16);
    const __m128i nibble_mask = _mm_set1_epi8(0x0F);
    __m128i t;

    t = _mm_and_si128(nibble_mask, v);
    v = _mm_and_si128(_mm_srli_epi16(v, 4), nibble_mask);
    t = _mm_shuffle_epi8(lut_lo, t);
    v = _mm_shuffle_epi8(lut_hi, v);
    v = _mm_min_epu8(v, t);
    t = _mm_xor_si128(_mm_set1_epi8(8), v);
    v = _mm_min_epu8(_mm_srli_epi16(v, 8), t);
    return v;
}

Anglicism answered 4/10, 2022 at 2:4 Comment(5)

Neat ideas. Clang auto-vectorizing sum += __tzcnt_u16 uses pshufb but combines it differently (godbolt.org/z/j5zT3K1j3). I think a DeBruijn sequence will probably still be best; working on that as an answer. – Unarm 4/10, 2022 at 7:57

In your SSSE3 version, if we were doing lzcnt, the first t = _mm_and_si128(nibble_mask, v); might be avoidable: a higher bit being set means the result in the lower nibbles are irrelevant. If we can arrange things so it's ok if pshufb in that element produces 0 because the high bit of the selector byte was set. e.g. if we can start with 0, 1, 0, 2, ... and do the XOR to set the bit instead of clear it? Or if the min or max works anyway. But that's a minor tweak for a different problem you weren't trying to answer; just noting it while I think of it. – Unarm 4/10, 2022 at 7:58

@PeterCordes Clang does popcnt(tzmsk(v)) which does seem better. – Anglicism 4/10, 2022 at 8:46

Ah, right, the table of constants is nibble popcounts. Makes sense, exact same strategy as when AVX512BITALG vpopcntw is available which is obviously excellent. But without that, it's 11 instructions with AVX2 (not counting the vpmovzx/vpaddd loop sum which should be vpsadbw ymm / vpaddd to go full width), vs. 9 for your SSSE3 version (not counting constant setup that can be hoisted), or 7 for my DeBruijn sequence version that only needs 2 constants. godbolt.org/z/zdnen9PcK (Or 5 instructions for DeBruijn without a fixup for v==0). – Unarm 4/10, 2022 at 9:25

Posted an answer with my De Bruijn sequence version. – Unarm 4/10, 2022 at 14:19

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

BSF for 16-bit elements with SSSE3

Recommended topics

Hot tags