(Work in progress, power just went out here; posting now to save my work.)
Crusty old x86 CPUs have very slow C++20 std::countl_zero
/ GNU C __builtin_clz
(386 bsr
= Bit Scan Reverse actually finds the position of the highest set bit, like 31-clz, and is weird for an input of 0 so you need to branch on that.) For CPUs before Pentium Pro / Pentium II, Harold's answer is your best bet, generating a mask directly instead of a count.
(Before 386, shifting by large counts might be better done with partial register shenanigans like mov al, ah
/ mov ah, 0
instead of shr ax, 8
, since 286 and earlier didn't have a barrel shifter for constant-time shifts. But in C++, that's something for the compiler to figure out. Shift by 16 is free since a 32-bit integer can only be kept in a pair of 16-bit registers on 286 or earlier.)
8086 to 286 - no instruction available.
386: bsf
/bsr
: 10+3n cycles. Worst-case: 10+3*31 = 103c
486: bsf
(16 or 32-bit registers): 6-42 cycles; bsr
7-104 cycles (1 cycle less for 16-bit regs).
P5 Pentium: bsf
: 6-42 cycles (6-34 for 16-bit); bsr
7-71 cycles. (or 7-39 for 16-bit). Non-pairable.
Intel P6 and later: bsr
/bsr
: 1 uop with 1 cycle throughput, 3 cycle latency. (PPro / PII and later).
AMD K7/K8/K10/Bulldozer/Zen: bsf
/bsr
are slowish for a modern CPU. e.g. K10 3 cycle throughput, 4 cycle latency, 6 / 7 m-ops respectively.
Intel Haswell / AMD K10 : lzcnt
introduced (as part of BMI1 for Intel, or with its own feature bit for AMD, before tzcnt
and the rest of BMI1).
For an input of 0, they return the operand-size, so they fully implement C++20 std::countl_zero
/ countr_zero
respectively, unlike bsr
/bsf
. (Which leave the destination unmodified on input=0. AMD documents this, Intel implements it in practice on current CPUs at least, but documents the destination register as "undefined" contents. Perhaps some older Intel CPUs are different, otherwise it's just annoying that they don't document the behaviour so software can take advantage.)
On AMD, they're fast, single uop for lzcnt
, with tzcnt
taking one more (probably a bit-reverse to feed the lzcnt
execution unit), so a nice win vs. bsf
/bsr
. This is why compilers typically use rep bsf
when for countr_zero
/ __builtin_ctz
, so it will run as tzcnt
on CPUs that support it, but as bsf
on older CPUs. They produce the same results for non-zero inputs, unlike bsr
/lzcnt
.
On Intel, same fast performance as bsf
/bsr
, even including the output dependency until Skylake fixed that; it's a true dependency for bsf
/bsr
, but false dependency for tzcnt
/lzcnt
and popcnt
.
Fast algorithm with a bit-scan building block
But on P6 (Pentium Pro) and later, a bit-scan for the highest set bit is likely to be a useful building block for an even faster strategy than log2(width) shift/or operations, especially for uint64_t
on a 64-bit machine. (Or maybe even moreso for uint64_t
on a 32-bit machine, where each shift would require shifting bits across the gap.)
Cycle counts from https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html which has instructions timings for 8088 through Pentium. (But not counting the instruction-fetch bottleneck which usually dominates 8086 and especially 8088 performance.)
bsr
(index of highest set bit) is fast on modern x86: 1 cycle throughput on P6 and later, not bad on AMD. On even more recent x86, BMI1 lzcnt
is 1 cycle on AMD as well, and avoids an output dependency (on Skylake and newer). Also it works for an input of 0
(producing the type width aka operand size), unlike bsr
which leaves the destination register unmodified.
I think the best version of this (if BMI2 is available) is one inspired by Ted Lyngmo's answer, but changed to shift left / right instead of generating a mask. ISO C++ doesn't guarantee that >>
is an arithmetic right shift on signed integer types, but all sane compilers choose that as their implementation-defined behaviour. (For example, GNU C documents it.)
https://godbolt.org/z/hKohn8W8a has that idea, which indeed is great if we don't need to handle x==0.
Also an idea with BMI2 bzhi, if we're considering what's efficient with BMI2 available. Like x | ~ _bzhi_u32(-1, 32-lz);
Unfortunately requires two inversions, the 32-lzcnt
and the ~
. We have BMI1 andn
, but not an equivalent orn
. And we can't just use neg
because bzhi
doesn't mask the count; that's the whole point, it has unique behaviour for 33 different inputs. Will probably post these as an answer tomorrow.
int set_leading_zeros(int x){
int lz = __builtin_clz(x|1); // clamp the lzcount to 31 at most
int tmp = (x<<lz); // shift out leading zeros, leaving a 1 (or 0 if x==0)
tmp |= 1ULL<<(CHAR_BIT * sizeof(tmp) - 1); // set the MSB in case x==0
return tmp>>lz; // sign-extend with an arithmetic right shift.
}
#include <immintrin.h>
uint32_t set_leading_zeros_bmi2(uint32_t x){
int32_t lz = _lzcnt_u32(x); // returns 0 to 32
uint32_t mask = _bzhi_u32(-1, lz); // handles all 33 possible values, producing 0 for lz=32
return x | ~mask;
}
On x86-64 you can
Combined with BMI2 shlx
/ sarx
for single-uop variable-count shifts even on Intel CPUs.
With efficient shifts (BMI2, or non-Intel such as AMD), it's maybe better to do (x << lz) >> lz
to sign-extend. Except if lz
is the type width; if you need to handle that, generating a mask is probably more efficient.
Unfortunately shl/sar reg, cl
costs 3 uops on Sandybridge-family (because of x86 legacy baggage where shifts don't set FLAGS if the count happens to be zero), so you need BMI2 shlx
/ sarx
for it to be better than bsr ecx, dsr
/ mov tmp, -1
/ not ecx
/ shl tmp, cl
/ or dst,reg
clear_leading_ones(x) = ~set_leading_zeroes(~x)
– Marrisrather expensive leading zero count
all modern architectures have a very cheap instruction to get the leading zero count – Autolycus