The SIMD BSF strategy you picked is not efficient. Taking advantage of other primitive operations that the CPU can do as a single instruction will be better. Even a best-case implementation of that strategy needs a lot of different mask constants, and lots of instructions per vector.
Your choice to implement *2
with _mm_mullo_epi16
instead of _mm_slli_epi16
by 1 is particularly unfortunate. (Or _mm_add_epi16(same,same)
). Fortunately some compilers will optimize the mullo
by a constant into an add for you, but that whole strategy still takes many more instructions than necessary. But others like MSVC and ICC take intrinsics fairly literally and will actually use a hardware multiply with it's relatively high latency for that.
There are multiple good strategies, with the best choice depending on the SIMD element width and the level of ISA extension available (many require SSSE3 for pshufb
). And some micro-optimizations in the implementation details may depend on Intel vs. AMD or microachitecture differences between generations from the same vendor.
With AVX-512 vpopcntb/w/d/q
available: vpopcnt(~v & (v-1))
(vpadd -1
/vpandn
/vpopcnt
), i.e. make a mask up to and not including the lowest set bit, and popcount it. ~v & (v-1)
gives all-ones for an input of zero, so it can produce 17 different output values for a 16-bit input, not needing any fixup to fully work for 0
.
3 instructions, two of them very cheap. (And vpopcnt
is cheap on CPUs that support it, Ice Lake and later except Alder Lake, and Zen 4. AVX-512 VPOPCNTDQ and BITALG (for the b/w versions).) Clang vectorizes __tzcnt_u16
this way if you use it in a loop.
Note that v ^ (v-1)
to get a mask up to and including like scalar blsmsk
would count one too many, and couldn't distinguish 0
from 0x8000
; both produce 0xffff
.
32 or 64-bit elements with AVX-512: vplzcntd
/q
is always available (all AVX-512 CPUs have AVX-512CD). tzcntd = 31-lzcntd(v&-v)
for non-zero inputs. That would give you a -1
for an all-zero element. (So one final vpminud(tz, set1(32))
would clamp that UINT_MAX to 32 if you need that.)
16-bit elements with SSSE3: DeBruijn sequence multiply to generate a 4-bit value for a pshufb
LUT: excellent, especially if you don't care about the input=0 case. This strategy doesn't work for 32 or 64-bit elements, not without AVX-512 VBMI vpermb
for a wider LUT, in which case you'd normally also have vpopcnt
.
5 single-uop instructions per vector (with AVX), 2 vector constants. (Or 7 or 8 instructions if you want full tzcnt
behaviour, producing 16
for input=0
. Slightly cheaper if -1
is ok for that case.) pmullw
(_mm_mullo_epi16
) is single-uop on modern CPUs, unlike pmulld
I think this strategy is better than aqrit's clever strategy to combine pshufb
results with pminub
(9 instructions with gcc or clang).
32-bit elements: @Soonts' FP strategy is very good, especially if you only want to assume SSE2. Converting to FP to take advantage of the hardware that does this to compute an exponent field. 32-bit is the natural width for packed SIMD int->float conversion. You do have to deal with the sign bit being set if the input had its MSB set, i.e. an extra and
instruction after shifting the exponent down.
@aqrit's strategy of using 2x pshufb
as a 4-bit LUT for each nibble of the original integer is interesting, too, but I think it'll need an extra merging step vs. @Soonts' needing fewer steps, not needing to split low/high and merge.
@aqrit's SSE2-only strategy with _mm_avg_epu16(r, _mm_cmpeq_epi16(_mm_and_si128(x3333, v), x0000));
and so on looks slower than the FP strategy, especially for 32-bit where it would take more work, but the FP strategy takes less work per vector.
64-bit elements: packed 64-bit integer -> FP conversion isn't available until AVX-512. Skylake-X has AVX-512 but not AVX-512VPOPCNTDQ.
Even without direct support for SIMD popcount, the popcnt(~v & (v-1))
idea is probably good. SIMD popcnt is a known technique, e.g. splitting into low/high nibbles for 2x vpshufb
as a 4-bit LUT. Then _mm_add_epi8
those high/low halves together and psadbw
against 0
to sum bytes within qword chunks.
(This is basically how clang auto-vectorizes sum += __tzcnt_u16(arr[i])
even without -march=icelake-client`, but with some wasted shuffles and inefficient summing.)
BSF for 16-bit elements with SSSE3
An answer on Position of least significant bit that is set can be adapted to 16-bit, and the 16-entry lookup table of 8-bit values can then be vectorized with SSSE3 pshufb
.
A De Bruijn sequence has every 4-bit bit-pattern in there somewhere, overlapping. Multiplying it by a power of 2 (single bit set) shifts one of those sequences to be the top n
bits, and a right shift by type_width - n
brings them to the bottom. So we get a 4-bit value at the bottom of a byte, ready for use as a LUT index.
SSE2 pmullw
is fast on all modern CPUs, even Alder Lake E-cores. Single uop, although latency is 5 cycles on P-cores Haswell/Skylake/Ice Lake. But since SKL, it has 2/clock throughput, running on port 0 or 1. Also fast on Zen 2 for example, 1/clock throughput, 3 cycle latency. https://uops.info/.
SIMD integer shifts (psrlw
) compete for the same ports as pmullw
, but fortunately that 2/clock throughput should be enough to avoid a bottleneck. pshufb
runs on port 5 on Intel, not competing with shift / pmul.
__m128i bsf_epi16_debruijn(__m128i v)
{
const __m128i debruijn_magic = _mm_set1_epi16( 0x09AF );
const __m128i bit_table = _mm_setr_epi8(
0, 1, 2, 5, 3, 9, 6, 11,
15, 4, 8, 10, 14, 7, 13, 12 );
__m128i blsi = _mm_sub_epi16(_mm_setzero_si128(), v);
blsi = _mm_and_si128(blsi, v); // v &= -v; a power of 2; multiplying by it is like a shift
__m128i idx = _mm_mullo_epi16(blsi, debruijn_magic);
idx = _mm_srli_epi16(idx, 12); // leaving a 4-bit index from the selected position in the DeBruijn sequence
// maybe TODO: avoid the shift with PMULHW with a debruijn sequence and table crafted to use the bits "shifted" into the high half?
// But then would need to mask before pshufb without AVX-512VBMI vpermb xmm
// And if we have that (Ice Lake) we normally have AVX-512 BITALG for vpopcntw(~v & (v-1)) or vpopcntw(pandn(v, v-1)) (vpaddw / vpandn)
__m128i bsf = _mm_shuffle_epi8(bit_table, idx); // high half of each word looks up to 0 so no fixup needed
// input = 0 produces output = 0, same as input=1, unless we fixup the result
#if 1
// optional: produce -1 or 16 for input==0, instead of 0
__m128i was_zero = _mm_cmpeq_epi16(v, _mm_setzero_si128());
//bsf = _mm_blendv_epi8(bsf, _mm_set1_epi16(16), was_zero); // single-uop on AMD, 2 uops on Intel; 3 on Alder Lake P and 4 on E cores. Single uop for the legacy SSE version, though.
// was_zero = _mm_and_si128(was_zero, _mm_set1_epi16(16));
bsf = _mm_or_si128(bsf, was_zero); // return -1 for v==0 . (Or with the &16, return 16
// alternative: bsf = _mm_sub_epi16(bsf, _mm_slli_epi16(was_zero,4)); // subtract (-1<<4) or (0). Avoids a constant.
#endif
return bsf;
}
I generated the 16-bit De Bruijn sequence and lookup table using the program from https://sites.google.com/site/sydfhd/articles-tutorials/de-bruijn-sequence-generator with the compile error fixed by commenting out the 2 lines with an if
involving is_mulshift
, as that's not defined in the program. Also g++ -O2 -fpermissive
to silence other warnings.
Godbolt with this, the original, and (my tweak to) Soonts' answer, plus aqrit's answers. Also a scalar loop that clang auto-vectorizes.
bsf_epi16_debruijn(long long __vector(2)): # @bsf_epi16_debruijn(long long __vector(2))
vpxor xmm1, xmm1, xmm1 # constant can be hoisted out of loops
vpsubw xmm2, xmm1, xmm0
vpand xmm2, xmm2, xmm0
vpmullw xmm2, xmm2, xmmword ptr [rip + .LCPI5_0]
vpsrlw xmm2, xmm2, 12
vmovdqa xmm3, xmmword ptr [rip + .LCPI5_1] # xmm3 = [0,1,2,5,3,9,6,11,15,4,8,10,14,7,13,12]
vpshufb xmm2, xmm3, xmm2
vpcmpeqw xmm0, xmm0, xmm1 # fixup for v==0
vpor xmm0, xmm2, xmm0 # fixup for v==0
ret
So not counting instructions that set registers to a constant (since those can get hoisted out of loops with AVX to allow non-destructive use of them), this is 5 instructions for the main work. Two for the multiply/shift ports, two simple integer that can run on any port, and one shuffle that Intel CPUs only run on port 5.
And 2 more instructions for this fixup strategy that gives -1
for elements that were 0
, instead of output = 0
without a fixup. (That's why we can just OR in instead of vpblendvb
even if we want to set it to 16, not just to -1
. -1 | anything == -1
so this works even if the LUT didn't produce 0 for an input of 0.)
This trivially widens to 256-bit vectors (AVX2) or 512-bit (AVX-512BW). I haven't tried writing it scalar to see if GCC or clang will auto-vectorize the shift and LUT lookup; I'm not optimistic but wouldn't rule it out.
There is no BSF instruction for 16-bit ints on x86.
Incorrect: bsf
allows operand-sizes of 16, 32, or 64-bit. Same for BMI1 tzcnt
. Intrinsics and builtins for BSF are not well standardized across compilers (and AFAIK there aren't intrinsics for 16-bit bsf
), but Intel does document _tzcnt_u16
. GCC only supports __tzcnt_u16
(two leading underscores), not Intel's name, but clang supports both names (one and two underscores).
That's fine; bsf
with a zero input produces a garbage value (intrinsics for it don't expose the asm behaviour of leaving the destination register unmodified; behaviour which AMD documents, but both Intel and AMD implement). And for non-zero 16-bit inputs, the bits above the low 16 don't affect the value. So having 16-bit bsf
wouldn't help, but 16-bit tzcnt
does let you get a 16
when the input is zero, without having to do _tzcnt_u32(0x10000 | x)
to let a 32-bit tzcnt find a set bit at the position you want.
return
statement is an unreadable mess of nested intrinsics. Intrinsic names are much longer than operators, and are prefix not infix. Trying to match the actual formatting of the original is a mistake. Also,_mm_cmpneq_epi16
isn't a real intrinsic, it has to invert acmpeq
, so you should try to optimize that and the 0/1 instead of 0/-1, instead of doing abs separately. e.g.andn(cmp, set1(1))
, or withset1(2)
for the first one to avoid shifting. Also,mullo_epi16
is not a fast way to double an integer! Shift by 1 or add to itself. – Unarmpshub
as a nibble LUT to at least shortcut the0xaaaaaaaa
and0xcccccccc
steps, although that might mean shifting and masking both ways and doing two pshufb per input vector. AVX-512 has SIMDvplzcntd
/q
, so a bit-reverse (with somevpshufb
as nibble LUT and byte shuffle) would be best if you can use that. For 16-bit chunks, I guess you'd want to unpack to 32-bit as you reverse forvplzcntd
and re-pack – UnarmMultiplyDeBruijnBitPosition
forpshufb
(4-bit LUT of byte values). Conveniently, the odd bytes would already be 0, thus look up to 0. – Unarmv &= -v
. Then convert tofloat
and extract shift the exponent field down to an integer, and unbias it. (Powers of 2 convert exactly to float; INT_MIN has the same magnitude as unsigned, but the sign bit is set so you'd have to mask). Unfortunately there's no packed int16 -> fp16 until AVX512 FP16 or BF16, so you'd have to unpack to 32-bit. So the DeBruijn sequence method is probably better for uint16_t, but the FP bithack might win for uint32_t where a 4-bit LUT of bytes doesn't work for the 32 possible results. – Unarmtzcnt
instead of debugging / optimizing this attempt. If you want it to still be a debugging question, rewrite the code to be readable and do some actual debugging (intermediate values), not just looking at the final results. Agreed, though, that it's probably not worth pursuing overall, except as a code-review and exercise in vectorizing the pieces of this algorithm. (A compiler can probably auto-vectorize, BTW.) – Unarmbsf
andtzcnt
support 16, 32, and 64-bit. Maybe you're only looking at intrinsics? And of course you can just zero-extend into a 32-bit register and usebsf r32, r32
; the result for BSF is only well-defined for non-zero inputs, so high zeros don't matter. You can even emulate 16-bit tzcnt using 32-bit bsf, asbts reg, 16
/bsf reg,reg
to get a16
if the low 16 bits are all zero. – Unarm