Does x64 support imply BMI1 support?
Asked Answered
T

1

1

It it safe to assume that x64 builds can use TZCNT without checking its support through cpu flags?

Turbot answered 25/4, 2020 at 8:18 Comment(5)
I'm curious why you even have to ask. Surely it's easy enough to google BMI1 and see that it was new in Haswell for Intel. I'm guessing you ran into tzcnt not faulting on previous CPUs, or saw a compiler using it without -mbmi1 or -march=haswell and doubted other sources, so I went into detail about that in my answer. But on the face of it, this question about BMI1 in general could certainly be downvoted for lack of research effort. (I considered it but didn't.)Purpose
I saw that zlib-ng uses tzcount conditionally at runtime. I wanted to suggest them to always use bsf for x86 (they check for 0 anyways) and use tzcnt for x64 builds. But wasn't able to find clear info if all x64 CPUs support it or not.Turbot
If they rule out input=0 already, then they should use rep bsf = tzcnt unconditionally (including 32-bit mode), so it's faster on AMD, same on Intel, at the cost of 1 instruction byte.Purpose
Oh, well it's C not asm so they can't just use _tzcnt_u32. Err wait, that's in an ifdef _MSC_VER so it's never used with compilers that would require -mbmi1 to be enabled for _tzcnt_u32. So yes, they should use _tzcnt_u32. I'm curious how horrible this is for MSVC when x86_cpu_has_tzcnt isn't a compile-time constant.Purpose
It's bad because x86_cpu_has_tzcnt isn't compile time constant (if it was, compiler would obviously remove it). It's set at runtime if CPU supports BMI1Turbot
P
12

No, certainly not! x86-64 was new in late 2003 (AMD K8), with only the legacy bsf and bsr bit-scan instructions, and none of the rest of BMI1.

The first Intel CPU to support BMI1 was Haswell in 2013. (Also introducing BMI2.)
The first AMD CPU to support BMI1 was Piledriver in 2012.
AMD ABM (Advanced Bit Manipulation) in K10 and later AMD CPUs only added popcnt and lzcnt, not tzcnt.

Wikipedia Bit Manipulation Instruction Sets: Supporting CPUs. Note that Celeron/Pentium branded CPUs don't decode VEX prefixes, so they have AVX and BMI1/BMI2 disabled because BMI1 and 2 each include some VEX-coded instructions like andn and blsr. This sucks; BMI1/2 are most useful when compilers can use it everywhere throughout an executable for more efficient variable-count shifts, and peepholes, so still selling new CPUs without BMI1/2 is not getting us closer to being able to treat them as baseline like we do for P6 cmov in 32-bit mode.


TZCNT decoding on old CPUs

Since you mention tzcnt specifically, its machine-code encoding is rep bsf so older CPUs will execute it as BSF. This produces the same result as tzcnt if the input is non-zero. i.e. tzcnt "works" on all x86 CPUs (since 386) when the input is non-zero.

But when it is zero, tzcnt would produce the operand-size (e.g. 64), but bsf leaves the destination register unmodified. tzcnt sets FLAGS based on the result, bsf based on the input. AMD documents the dst-unmodified behaviour in their ISA reference manual. Intel only documents it as "undefined value" but implements the same behaviour as AMD, at least in existing CPUs.

(This is why bsf / bsr have an output dependency on all CPUs, like add. Unfortunately tzcnt / lzcnt also have a false dependency on Intel Sandybridge-family before Skylake: Why does breaking the "output dependency" of LZCNT matter?. And why popcnt does on SnB-family before Cannon / Ice Lake, because it shares the same execution unit.)


tzcnt is significantly faster on AMD, so compilers tuning for "generic" or AMD CPUs will often use tzcnt instead of bsf without checking for CPU features.

e.g. for GNU C __builtin_ctz. That intrinsic has undefined behaviour for input=0 so it's allowed to just use bsf without checking for 0. And thus also allowed to use tzcnt because the result in that case is not guaranteed by anything.

Why does TZCNT work for my Sandy Bridge processor?

No such backward / forward compat exists for lzcnt. Having it decode as rep bsr with the meaningless rep prefix ignored would give you 31 - lzcnt(x), the bit-index. https://fgiesen.wordpress.com/2013/10/18/bit-scanning-equivalencies/

One handy trick is ctz( x | 0x80000000 ) because OR is cheap1, and guarantees there's always a non-zero bit for bsf to find. It doesn't change the result for any non-zero x because it's the last bit bsf will look at. This trick also works for __builtin_clz(x|1) / bsr, where it's even better because or reg, imm8 is even shorter than imm32.

Footnote 1: or reg, imm32 works for a 32-bit constant; bts reg,63 is less cheap on some CPUs to implement x|(1ULL<<63) for a 64-bit input.

Purpose answered 25/4, 2020 at 9:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.