It it safe to assume that x64 builds can use TZCNT without checking its support through cpu flags?
No, certainly not! x86-64 was new in late 2003 (AMD K8), with only the legacy bsf
and bsr
bit-scan instructions, and none of the rest of BMI1.
The first Intel CPU to support BMI1 was Haswell in 2013. (Also introducing BMI2.)
The first AMD CPU to support BMI1 was Piledriver in 2012.
AMD ABM (Advanced Bit Manipulation) in K10 and later AMD CPUs only added popcnt
and lzcnt
, not tzcnt
.
Wikipedia Bit Manipulation Instruction Sets: Supporting CPUs. Note that Celeron/Pentium branded CPUs don't decode VEX prefixes, so they have AVX and BMI1/BMI2 disabled because BMI1 and 2 each include some VEX-coded instructions like andn
and blsr
. This sucks; BMI1/2 are most useful when compilers can use it everywhere throughout an executable for more efficient variable-count shifts, and peepholes, so still selling new CPUs without BMI1/2 is not getting us closer to being able to treat them as baseline like we do for P6 cmov
in 32-bit mode.
TZCNT decoding on old CPUs
Since you mention tzcnt
specifically, its machine-code encoding is rep bsf
so older CPUs will execute it as BSF. This produces the same result as tzcnt
if the input is non-zero. i.e. tzcnt
"works" on all x86 CPUs (since 386) when the input is non-zero.
But when it is zero, tzcnt
would produce the operand-size (e.g. 64), but bsf
leaves the destination register unmodified. tzcnt
sets FLAGS based on the result, bsf
based on the input. AMD documents the dst-unmodified behaviour in their ISA reference manual. Intel only documents it as "undefined value" but implements the same behaviour as AMD, at least in existing CPUs.
(This is why bsf
/ bsr
have an output dependency on all CPUs, like add
. Unfortunately tzcnt
/ lzcnt
also have a false dependency on Intel Sandybridge-family before Skylake: Why does breaking the "output dependency" of LZCNT matter?. And why popcnt
does on SnB-family before Cannon / Ice Lake, because it shares the same execution unit.)
tzcnt
is significantly faster on AMD, so compilers tuning for "generic" or AMD CPUs will often use tzcnt
instead of bsf
without checking for CPU features.
e.g. for GNU C __builtin_ctz
. That intrinsic has undefined behaviour for input=0 so it's allowed to just use bsf
without checking for 0. And thus also allowed to use tzcnt
because the result in that case is not guaranteed by anything.
Why does TZCNT work for my Sandy Bridge processor?
No such backward / forward compat exists for lzcnt
. Having it decode as rep bsr
with the meaningless rep
prefix ignored would give you 31 - lzcnt(x)
, the bit-index. https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/
One handy trick is ctz( x | 0x80000000 )
because OR is cheap1, and guarantees there's always a non-zero bit for bsf
to find. It doesn't change the result for any non-zero x
because it's the last bit bsf
will look at. This trick also works for __builtin_clz(x|1)
/ bsr
, where it's even better because or reg, imm8
is even shorter than imm32
.
Footnote 1: or reg, imm32
works for a 32-bit constant; bts reg,63
is less cheap on some CPUs to implement x|(1ULL<<63)
for a 64-bit input.
© 2022 - 2024 — McMap. All rights reserved.
tzcnt
not faulting on previous CPUs, or saw a compiler using it without-mbmi1
or-march=haswell
and doubted other sources, so I went into detail about that in my answer. But on the face of it, this question about BMI1 in general could certainly be downvoted for lack of research effort. (I considered it but didn't.) – Purposebsf
for x86 (they check for 0 anyways) and use tzcnt for x64 builds. But wasn't able to find clear info if all x64 CPUs support it or not. – Turbotrep bsf
=tzcnt
unconditionally (including 32-bit mode), so it's faster on AMD, same on Intel, at the cost of 1 instruction byte. – Purpose_tzcnt_u32
. Err wait, that's in an ifdef _MSC_VER so it's never used with compilers that would require-mbmi1
to be enabled for_tzcnt_u32
. So yes, they should use_tzcnt_u32
. I'm curious how horrible this is for MSVC whenx86_cpu_has_tzcnt
isn't a compile-time constant. – Purposex86_cpu_has_tzcnt
isn't compile time constant (if it was, compiler would obviously remove it). It's set at runtime if CPU supports BMI1 – Turbot