You should check for all the CPU features you actually depend on just in case of future weird CPUs or VMs, or (unlikely) features disabled due to CPU bugs and microcode updates. But if you're wondering whether to write two AVX2 versions of your function, one with and one without BMI1/2 instructions: no unless it's with/without pdep
/pext
. Checking for BMI2 as well won't stop any real CPUs from running your AVX2 version.
All real hardware with AVX2 has also had BMI2
AMD Zen 2 and earlier have unusably slow pdep
/pext
, so you'll want to check for those CPU models instead of availability of BMI2 if you're doing CPU detection to set up function pointers, for functions that use either instruction inside loops. Other BMI2 instructions are fine if supported.
Almost all AVX2 hardware has FMA as well, but not quite1.
BMI1/2 and FMA3 are part of the -march=x86-64-v3
feature level (essentially Haswell, but without TSX, AES-NI, rdrand
and some other stuff.
https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels).
MSVC's /arch:AVX2
is like GCC/Clang -march=x86-64-v3
, also enabling FMA3 and BMI1/2.
It's fairly likely all future CPUs will have both AVX2+BMI2, or neither, at least in commercially-relevant mainstream CPUs, although pdep
and pext
do need a significant amount of transistors for an execution unit separate from anything else needed for any other instruction. (A bitwise version of AVX-512 vpcompressb
/vpexpandb
.) Or slow microcode.
AVX2 and BMI2 have separate feature bits so an emulator or VM could disable BMI2 while leaving AVX2 enabled, so it's a good idea to check both. (And that the OS has enabled AVX: xgetbv
after using CPUID to check that xgetbv
is supported). An emulator might even fault if you try to run BMI2 instructions (unlike a VM: there's no control-register bit that will make the CPU hardware fault on BMI2 instructions it normally supports, unlike SSE/AVX/AVX-512.)
You don't need a separate AVX2-without-BMI2 version of your functions, unless you wanted to use pdep
/pext
inside a loop. If someone sets up a weird emulator or VM that stops your code from using its AVX2 functions because it lacks BMI2, that's their problem, and is unlikely to happen by accident.
CPUs so far
- Intel Haswell: introduced AVX2 and BMI2. (Also Intel's first BMI1 CPU).
- Intel Gracemont (Alder Lake E-cores): AVX2 and BMI2. First low-power silvermont-family with AVX1 or BMI1.
- AMD Excavator: AMD's first AVX2 CPU was also their first BMI2 CPU. (With horribly slow microcoded
pdep
/ pext
)
- AMD Zen 3: the first AMD with usable
pdep
/ pext
(same as Intel, 1 uop with 3c latency, 1c throughput).
- VIA Nano C QuadCore C4650 (Isiah) from 2015: AVX2 + BMI2. (Notably without FMA31). I think this was VIA's first AVX2 CPU.
- ZHAOXIN KaiXian ZX-C+ C4580: AVX2 + BMI2 (slow
pdep
/ pext
, but maybe not as bad as AMD? InstLatx64 doesn't say what inputs they tested with, and this might just be a very special case like 0
). Based on VIA Nano C.
- Centaur CNS: AVX512, AVX2, BMI2 (fast
pdep
/pext
)
Unusably slow pdep
/ pext
on AMD Zen 2 and earlier
AMD before Zen 3 (so Excavator, Zen 1, and Zen 2) have disastrously slow pdep
and pext
where the number of uops depends on the data, e.g. https://uops.info/ measured 64-bit pext
at 133 uops on Zen 1&2 with one per 52 cycle throughput.
All other BMI/BMI2 instructions are fast on CPUs that support them, at most 2 uops for stuff like blsr
on AMD before Zen 4, or single-uop on Intel.
See also What is a fast fallback algorithm which emulates PDEP and PEXT in software? re: options for fallbacks. If you were using it with a constant mask as a way to avoid some shift/OR work, just don't unless you also make a version tuned for AVX2-without-fast-pdep for such CPUs, or if you don't care much about non-current CPUs. (e.g. you know what cloud servers you'll run on.)
AVX1 implies popcnt
AVX1 implies SSE4.2, and SSE4.2 at least de-facto implies popcnt
.
popcnt
does have its own feature bit so CPUs can have popcnt
without SSE4.2 support, but in practice the opposite hasn't happened. And enough software assumes that SSE4.2 implies popcnt
that if a CPU violated that assumption, it would be the CPUs fault, not software. It's not really a plausible situation; popcnt
is cheap to implement compared to SSE4.2 string instructions.
Footnote 1: Mysticial commented
The VIA Isaiah C4650 has AVX2 but not FMA3. Breaks a lot of programs that assume FMA3 in the presence of AVX2
Btw, I spoke to one of the VIA architects at Hot Chips about it. And he was pissed that they they allowed that to happen. IIRC, he hinted that they should've either turned off the CPUID for AVX2 or microcoded the FMA.