Do all CPUs that support AVX2 also support BMI2 or popcnt?
Asked Answered
P

2

5

From here, I learned that the support of AVX doesn't imply the support of BMI1. So how about AVX2: Do all CPUs that support AVX2 also support BMI2? Further, does the support of AVX2 imply the support of popcnt?

Searched all over Google and cannot locate a definite answer. The closest thing I got is Does AVX support imply BMI1 support?.

Palermo answered 8/6, 2023 at 1:33 Comment(2)
I found some Zhaoxin processors that support BMI2 without AVX2 (unexpectedly) but none of the reverse. Theoretically there is no such promise AFAIKChurlish
@harold: Interesting. I didn't know where to look for conveniently decoded CPUID results for Zhaoxin CPUs; users.atw.hu/instlatx64 has a couple, but many lack even raw CPUID dumps.Vena
V
7

You should check for all the CPU features you actually depend on just in case of future weird CPUs or VMs, or (unlikely) features disabled due to CPU bugs and microcode updates. But if you're wondering whether to write two AVX2 versions of your function, one with and one without BMI1/2 instructions: no unless it's with/without pdep/pext. Checking for BMI2 as well won't stop any real CPUs from running your AVX2 version.

All real hardware with AVX2 has also had BMI2

AMD Zen 2 and earlier have unusably slow pdep/pext, so you'll want to check for those CPU models instead of availability of BMI2 if you're doing CPU detection to set up function pointers, for functions that use either instruction inside loops. Other BMI2 instructions are fine if supported.

Almost all AVX2 hardware has FMA as well, but not quite1.

BMI1/2 and FMA3 are part of the -march=x86-64-v3 feature level (essentially Haswell, but without TSX, AES-NI, rdrand and some other stuff. https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels).

MSVC's /arch:AVX2 is like GCC/Clang -march=x86-64-v3, also enabling FMA3 and BMI1/2.


It's fairly likely all future CPUs will have both AVX2+BMI2, or neither, at least in commercially-relevant mainstream CPUs, although pdep and pext do need a significant amount of transistors for an execution unit separate from anything else needed for any other instruction. (A bitwise version of AVX-512 vpcompressb/vpexpandb.) Or slow microcode.

AVX2 and BMI2 have separate feature bits so an emulator or VM could disable BMI2 while leaving AVX2 enabled, so it's a good idea to check both. (And that the OS has enabled AVX: xgetbv after using CPUID to check that xgetbv is supported). An emulator might even fault if you try to run BMI2 instructions (unlike a VM: there's no control-register bit that will make the CPU hardware fault on BMI2 instructions it normally supports, unlike SSE/AVX/AVX-512.)

You don't need a separate AVX2-without-BMI2 version of your functions, unless you wanted to use pdep/pext inside a loop. If someone sets up a weird emulator or VM that stops your code from using its AVX2 functions because it lacks BMI2, that's their problem, and is unlikely to happen by accident.

CPUs so far

  • Intel Haswell: introduced AVX2 and BMI2. (Also Intel's first BMI1 CPU).
  • Intel Gracemont (Alder Lake E-cores): AVX2 and BMI2. First low-power silvermont-family with AVX1 or BMI1.
  • AMD Excavator: AMD's first AVX2 CPU was also their first BMI2 CPU. (With horribly slow microcoded pdep / pext)
  • AMD Zen 3: the first AMD with usable pdep / pext (same as Intel, 1 uop with 3c latency, 1c throughput).
  • VIA Nano C QuadCore C4650 (Isiah) from 2015: AVX2 + BMI2. (Notably without FMA31). I think this was VIA's first AVX2 CPU.
  • ZHAOXIN KaiXian ZX-C+ C4580: AVX2 + BMI2 (slow pdep / pext, but maybe not as bad as AMD? InstLatx64 doesn't say what inputs they tested with, and this might just be a very special case like 0). Based on VIA Nano C.
  • Centaur CNS: AVX512, AVX2, BMI2 (fast pdep/pext)

Unusably slow pdep / pext on AMD Zen 2 and earlier

AMD before Zen 3 (so Excavator, Zen 1, and Zen 2) have disastrously slow pdep and pext where the number of uops depends on the data, e.g. https://uops.info/ measured 64-bit pext at 133 uops on Zen 1&2 with one per 52 cycle throughput.

All other BMI/BMI2 instructions are fast on CPUs that support them, at most 2 uops for stuff like blsr on AMD before Zen 4, or single-uop on Intel.

See also What is a fast fallback algorithm which emulates PDEP and PEXT in software? re: options for fallbacks. If you were using it with a constant mask as a way to avoid some shift/OR work, just don't unless you also make a version tuned for AVX2-without-fast-pdep for such CPUs, or if you don't care much about non-current CPUs. (e.g. you know what cloud servers you'll run on.)


AVX1 implies popcnt

AVX1 implies SSE4.2, and SSE4.2 at least de-facto implies popcnt.

popcnt does have its own feature bit so CPUs can have popcnt without SSE4.2 support, but in practice the opposite hasn't happened. And enough software assumes that SSE4.2 implies popcnt that if a CPU violated that assumption, it would be the CPUs fault, not software. It's not really a plausible situation; popcnt is cheap to implement compared to SSE4.2 string instructions.


Footnote 1: Mysticial commented

The VIA Isaiah C4650 has AVX2 but not FMA3. Breaks a lot of programs that assume FMA3 in the presence of AVX2

Btw, I spoke to one of the VIA architects at Hot Chips about it. And he was pissed that they they allowed that to happen. IIRC, he hinted that they should've either turned off the CPUID for AVX2 or microcoded the FMA.

Vena answered 8/6, 2023 at 5:44 Comment(0)
P
2

I have a laptop with an Intel Core i5 10210U processor that supports AVX and AVX2, but neither BMI1, BMI2, nor POPCNT.

Click here a for CPU-Z screenshot.

I suspect most recent hardware with AVX2 will support BMI2 and POPCNT, but clearly not all hardware with AVX2 does.

My apologies for the misleading information. Using this code shows the following support for an Intel Core i5 10210U (irrelevant features snipped):

* Vendor    = GenuineIntel
* Processor = Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
...
YES - POPCNT        (Advanced Bit Manipulation - Bit Population Count Instruction)
...
YES - AVX2          (Advanced Vector Extensions 2)
YES - BMI1          (Bit Manipulations Instruction Set 1)
YES - BMI2          (Bit Manipulations Instruction Set 2)
YES - ADX           (Multi-Precision Add-Carry Instruction Extensions)
...

I'm not terribly upset to be wrong here as it means I definitely have access to a decent PEXT/PDEP implementation without needing different hardware.

As a general answer to these types of questions (ie. do all CPUs that support feature X also support feature Y?), don't make any assumptions and explicitly check using CPUID that the features you want to rely upon are actually available because there are no guarantees.

Parlour answered 21/1, 2024 at 7:37 Comment(18)
P.s. I've just read Peter's answer and he states "SSE4.2 implies popcnt". I'll confirm whether or not this is the case for my laptop within the next day or so, and update my answer if needed.Parlour
popcnt isn't shown separately by cpuz anywayChurlish
I'll write a small program to properly test it out, but I suspect you are correct.Parlour
Does CPU-Z show non-SIMD features at all? It doesn't show ADX (adcx/adox instructions), RDSEED, MOVBE, or ERMSB (weakly-ordered stores from rep movsb/stosb) either, but your CPU has those, too. It's not plausible that a Comet Lake i5 wouldn't support BMI1/BMI2, unless you're running in a VM with a custom setup that filters some CPUID bits.Vena
@PeterCordes it seems CPU-Z does focus on the SIMD related features, but there are also other sources I checked to cross reference BMI1/2 support and those didn't mention them either. I guess the lesson here is to check the hardware directly using software that fully reports CPUID info rather than relying upon documentation and software that selectively reports features.Parlour
Linux /proc/cpuinfo is usually good. IDK what to recommend under Windows. Any surprising absence should be verified by running a test program to see if it actually faults, e.g. #include <immintrin.h> int main(int argc, char **argv) { return _pdep_u32(argc, argc+1); } can't optimize away the pdep since we pass it a runtime variable and return the result.Vena
explicitly check using CPUID that the features you want to rely upon are actually available - 100% agreed, updated my answer to say that, good point. Questions like this are mostly useful for deciding whether it's worth writing a version of a function for the AVX2 without BMI2 case. Until/unless some future CPU bug leads to a microcode update disabling BMI2 but not AVX2 or something, making your 32-byte-vector code also check for and require BMI2 isn't missing out on speedups on any CPUs.Vena
I recently came across some errata for 12th Gen Intel Core processers that says not to rely on faults: Problem: BMI1, BMI2, LZCNT, ADXC, and ADOX instructions will not generate an #UD fault, even though the respective CPUID feature flags do not enumerate them as supported instructions. Implication: Software that relies on BMI1, BMI2, LZCNT, ADXC, and ADOX instructions to generate an #UD fault, may not work correctly. Workaround: None identified. Software should check CPUID reported instructions availability and not rely on the #UD fault behavior.Parlour
@PeterCordes what do you make of the above errata? Is the implication that these processors simply don't support BMI1 & BMI2, and the processors aren't throwing a fault as they should be?Parlour
@Parlour half of those instructions use a funny encoding with a repurposed REP prefix, making them potentially impossible to detect as invalid, because they were never invalid - they used to mean something else than they do now. For example LZCNT is supposed to be executed as BSR by processors that don't support LZCNT, it would never #UD to begin with. But there are processors that claim not to support LZCNT, but then execute it as LZCNT anyway instead of as BSR, which is one of those errata.Churlish
@harold as an example, Peter suggested above using a _pdep_u32 call and checking if it faults to verify that BMI2 was indeed not available. The errata is a little confusing, because we seem to have reached the conclusion here that BMI2 would most likley be a feature on most recent processors, yet this errata seems to indicate that it's not and a fault won't be thrown as it should. Perhaps I'm missing some context.Parlour
@harold: OMFG, is Intel still making CPUs without BMI1/BMI2? I hoped that had stopped with Ice Lake when even the Pentium / Celeron models had AVX2+FMA so had to decode VEX prefixes. Are Alder Lake Pentium/Celeron back to being crippled without AVX or BMI? So we're still not getting any closer to a world where -march=x86-64-v3 can be baseline? A CPUID dump for a "1C+4c Intel Pentium 8505 (Alder Lake-P)" users.atw.hu/instlatx64/GenuineIntel/… shows leaf 7 EBX=0x239C27EB which does include BMI1, BMI2, and AVX2 so Pentiums aren't the problemVena
@PeterCordes I don't know actually, I couldn't find which CPU is affected by that 12th gen erratum that dpldgr mentioned. Maybe one of the Celerons? Intel is really stingy with their informationChurlish
They state "No fix" for all processor lines here: edc.intel.com/content/www/us/en/design/ipla/…Parlour
@harold: I found a discussion of the ADL004 erratum on realworldtech.com/forum/?threadid=205239&curpostid=205258 - a user reports that even Celeron-branded models are supposed to have BMI1/BMI2. (Others replied with worries that the E cores don't actually support BMI2, but we know they do.) I wonder if the erratum is just discussing a hypothetical case where they want to disable BMI1/2 in a microcode update in the future, not any real CPUs that exist now?Vena
@dpldgr: IDK what to make of that "no fix" across the board. We know most models in each of those segments have fully working BMI1/BMI2 with the CPUID flag bits set, too. If there are any real ADL CPUs without BMI2, IDK if they'd filter the columns by what currently existed in the market, since they might always release a high-power-mobile later. I'm still hoping and leaning toward this being a hypothetical erratum: if there ever was an ADL without those CPUID feature bits, they don't have a way for a microcode update to make the VEX instructions fault?Vena
The fact that LZCNT will never #UD (it runs as bsr with the rep prefix ignored on older CPUs) also makes me think this erratum is nonsense. If it was real, not just hypothetical, they would hopefully have spent more effort describing it accurately. LZCNT even has a separate feature bit (AMD's ABM) by which CPUs can advertize its presence without the rest of BMI1.Vena
I guess until we actually see some evidence that these features have been disabled on real hardware (and not simply filtered out by a hypervisor) we can presume they're available on all recent processors, but it'd still be wise to double check using CPUID that the relevant bits are actually set before using them.Parlour

© 2022 - 2025 — McMap. All rights reserved.