Are there any still-relevant CPUs (Intel/AMD/Atom) which don't support SSSE3 instructions?
What's the most recent CPU without SSSE3?
Are there any still-relevant CPUs (Intel/AMD/Atom) which don't support SSSE3 instructions?
What's the most recent CPU without SSSE3?
The most recent CPUs without SSSE3 are based on the AMD K10 microarchitecture:
K10 CPUs support SSE3 (FP instructions like movddup
and haddps
), and AMD-only SSE4a. Some early K8 cores only have SSE2, but later K8 also had SSE3.
Notice that AMD CPUs listed in https://en.wikipedia.org/wiki/SSSE3#CPUs_with_SSSE3 only start at Bulldozer, but do include AMD's low-power Bobcat / Jaguar CPUs.
If you google AMD Phenom II ssse3
, you'll find some pages about some games removing an SSSE3 requirement so they can work on Phenom II.
On Intel you have to go back as far as Pentium M / Core, because SSSE3 was introduced with Core 2. (First-gen core2 (Conroe/Merom) only has 64-bit wide shuffle execution units, so pshufb
is relatively slow. But so is SSE2 pshufd
. See Fastest way to do horizontal float vector sum on x86.)
I think even first-gen Atom has SSSE3. https://en.wikipedia.org/wiki/Intel_Atom.
There are CPUs like AMD Geode that don't have SSE at all, but I think the point of the question is CPUs that do have SSE2/3 but not SSSE3.
There are no new mainstream CPUs being made that don't have SSE4.2, but some Phenom II CPUs are probably still in use even in 2018. The older they are, the more it's expected that new software might not work on them.
There are unfortunately still brand-new mainstream CPUs being made without AVX and BMI: Intel's Pentium and Celeron models, even for Skylake / Kaby Lake. Presumably when a die has defects in the upper 128-bits of its vector ALUs, e.g. the large FMA units, they fuse it off and disable decoding of VEX prefixes, and label it as a Pentium or Celeron1. (This is presumably why Pentium/Celeron models don't support BMI1/BMI2 either; other than pext
/pdep
those take trivial die area.)
So we're not getting any closer to BMI1/BMI2 being baseline at some point in the future, which is really unfortunate because it's required for single-uop variable-count shifts on Intel CPUs. (shl cl,reg
is 3 uops because of the cl=0 no-flag-update case being possible; SHLX / SHRX are 1 uop). BMI1/2 is most useful when used throughout your whole code, not just in a couple functions.
Footnote 1: Certainly some fully-working chips get this treatment, too, especially once yields improve for a new process, but for consistency / market-segmentation they're still crippled.
But I think rep movs/rep stos
ERMSB still work with 256-bit loads/stores, so the FP register file, load/store units, and bypass forwarding network would all still need to support full width. (And ERMSB becomes much more attractive vs. vector loops because it can use twice the width.
I wonder if there's a way for the CPU to be rewired with fuses so it can use any 2 of the 4 128-bit lanes of FMA units that are working. We know Skylake-AVX512 can mix and match FMA units with ports 0, 1, and 5, only powering up the p5 FMA (if available) for 512-bit vectors, and combining the 256-bit FMA units on p0 and p1 as one 512-bit FMA unit. Statically doing something like that with fuses could let Intel use chips that had a defect affecting both lanes of what would have been one FMA unit.
Anyway, this is pure guesswork. It's likely, but don't know if we have any reliable source that Intel actually ever did this as a way to sell chips with FMA defects. We do know that chips with defects in a whole physical core get sold as lower core-count SKUs, like a dual-core chip from a quad-core die. And that quad-core i5 CPUs with only 6MB of L3 cache instead of 8MB means they have one of their 4 slices of L3 cache disabled, again probably for salvaging defects.
rep movs/stos
are full width. So probably just the FMA unit, which takes significant die area and could plausibly have an isolated defect. (Or possibly other ALU, but we know pretty much everything significant runs on the FMA, including integer multiply and shift.) I'm just guessing at this, I don't remember reading any confirmation. Note that I'm not claiming that all Pentium/Celeron chips actually do have bad upper halves, just that they want(ed?) that option. –
Anode © 2022 - 2024 — McMap. All rights reserved.
pshufb
is the first variable-shuffle, instead of an immediate control operand. So you can build a 4-bit LUT out of it, etc.) That's why I thought the question was worth answering, and why I mentioned some games removing SSSE3 from their baseline requirement. – Anode