Does x64 support imply BMI1 support?

No, certainly not! x86-64 was new in late 2003 (AMD K8), with only the legacy bsf and bsr bit-scan instructions, and none of the rest of BMI1.

The first Intel CPU to support BMI1 was Haswell in 2013. (Also introducing BMI2.)
The first AMD CPU to support BMI1 was Piledriver in 2012.
AMD ABM (Advanced Bit Manipulation) in K10 and later AMD CPUs only added popcnt and lzcnt, not tzcnt.

Wikipedia Bit Manipulation Instruction Sets: Supporting CPUs. Note that Celeron/Pentium branded CPUs don't decode VEX prefixes, so they have AVX and BMI1/BMI2 disabled because BMI1 and 2 each include some VEX-coded instructions like andn and blsr. This sucks; BMI1/2 are most useful when compilers can use it everywhere throughout an executable for more efficient variable-count shifts, and peepholes, so still selling new CPUs without BMI1/2 is not getting us closer to being able to treat them as baseline like we do for P6 cmov in 32-bit mode.

TZCNT decoding on old CPUs

Since you mention tzcnt specifically, its machine-code encoding is rep bsf so older CPUs will execute it as BSF. This produces the same result as tzcnt if the input is non-zero. i.e. tzcnt "works" on all x86 CPUs (since 386) when the input is non-zero.

But when it is zero, tzcnt would produce the operand-size (e.g. 64), but bsf leaves the destination register unmodified. tzcnt sets FLAGS based on the result, bsf based on the input. AMD documents the dst-unmodified behaviour in their ISA reference manual. Intel only documents it as "undefined value" but implements the same behaviour as AMD, at least in existing CPUs.

(This is why bsf / bsr have an output dependency on all CPUs, like add. Unfortunately tzcnt / lzcnt also have a false dependency on Intel Sandybridge-family before Skylake: Why does breaking the "output dependency" of LZCNT matter?. And why popcnt does on SnB-family before Cannon / Ice Lake, because it shares the same execution unit.)

tzcnt is significantly faster on AMD, so compilers tuning for "generic" or AMD CPUs will often use tzcnt instead of bsf without checking for CPU features.

e.g. for GNU C __builtin_ctz. That intrinsic has undefined behaviour for input=0 so it's allowed to just use bsf without checking for 0. And thus also allowed to use tzcnt because the result in that case is not guaranteed by anything.

Why does TZCNT work for my Sandy Bridge processor?

No such backward / forward compat exists for lzcnt. Having it decode as rep bsr with the meaningless rep prefix ignored would give you 31 - lzcnt(x), the bit-index. https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/

One handy trick is ctz( x | 0x80000000 ) because OR is cheap¹, and guarantees there's always a non-zero bit for bsf to find. It doesn't change the result for any non-zero x because it's the last bit bsf will look at. This trick also works for __builtin_clz(x|1) / bsr, where it's even better because or reg, imm8 is even shorter than imm32.

Footnote 1: or reg, imm32 works for a 32-bit constant; bts reg,63 is less cheap on some CPUs to implement x|(1ULL<<63) for a 64-bit input.

TZCNT decoding on old CPUs

Recommended topics

Hot tags