Most recent processor without support of SSSE3 instructions? [closed]
Asked Answered
C

1

6

Are there any still-relevant CPUs (Intel/AMD/Atom) which don't support SSSE3 instructions?

What's the most recent CPU without SSSE3?

Casaubon answered 17/10, 2018 at 15:30 Comment(9)
Atom is a series of x86 CPUs made by Intel. And obviously all x86 CPUs made before the advent of SSSE3 can't support those instructionsCocker
@phuclv, it is obvious that processor designed before SSSE3 don't support SSSE3. I want to know those processor.Casaubon
there might be hundreds or thousands of CPUs like that. There's no reason to know their namesCocker
I am not understanding why -ve point as well closing request for this question . I need a processor to perform some experiments which don't support SSSE3 instructions. So, @Cocker , your last comment about thousands of processor are there, but no reason to know their name is ridiculous . At-least name some latest processor which don't support SSSE3 instruction.Casaubon
this is not about programming, thus should be closed. And why don't just have a look at wikipedia? It says SSSE3 was introduced in the Core microarchitecture (Merom), so any prior CPUs like Pentium 4, Pentium 3, II... didn't support that. This is likely an XY problem, why care about such decades old CPUs anyway?Cocker
No one needs to know the CPU name to test, they just need to know which instruction set a CPU/microarchitecture supports, which can be checked during runtime, or read the specs in Intel's website. It might be better to use a virtual machine or an emulator like SDE and disable such those instruction setsCocker
@phuclv: Usually this sort of things starts with "I have highly optimized code for SSE3; is it worth my time and effort writing, testing and optimizing new code to support CPUs without SSE3?". Note that there's some funky CPUs intended for embedded systems and HPC which don't follow the evolution of mainstream CPUs. Specifically, Intel's "Quark SE 1000" (released in 2015) does not support SSE at all.Queer
@phuclv: Knowing the CPU names aren't useful for actually doing CPU detection, but knowing which uarches do/don't have what feature can influence your choice of whether to spend time writing an optimized implementation that avoids SSSE3. (And SSSE3 is an important one; pshufb is the first variable-shuffle, instead of an immediate control operand. So you can build a 4-bit LUT out of it, etc.) That's why I thought the question was worth answering, and why I mentioned some games removing SSSE3 from their baseline requirement.Anode
@phuclv: for the OP's purpose, testing software for running on a machine without SSSE3, yes, SDE can do that regardless of the actual host. (Disabling AVX2 in CPU for testing purposes shows how). Unless you need to evaluate performance on an actual old CPU, especially for software that has realtime requirements for usability (like a game, or video / audio editing).Anode
A
15

The most recent CPUs without SSSE3 are based on the AMD K10 microarchitecture:

  • AMD Phenom II, the last-generation K10 socketed desktop CPUs before Bulldozer-family. They were produced from 2008 to 2012.
  • AMD Llano APUs, introduced June 2011. (Bulldozer-based APUs were introduced Oct 2012, IDK when the last Llano APUs were made / sold). Also based on K10 cores, but reporting CPUID "family" = 12h.

K10 CPUs support SSE3 (FP instructions like movddup and haddps), and AMD-only SSE4a. Some early K8 cores only have SSE2, but later K8 also had SSE3.

Notice that AMD CPUs listed in https://en.wikipedia.org/wiki/SSSE3#CPUs_with_SSSE3 only start at Bulldozer, but do include AMD's low-power Bobcat / Jaguar CPUs.

If you google AMD Phenom II ssse3, you'll find some pages about some games removing an SSSE3 requirement so they can work on Phenom II.


On Intel you have to go back as far as Pentium M / Core, because SSSE3 was introduced with Core 2. (First-gen core2 (Conroe/Merom) only has 64-bit wide shuffle execution units, so pshufb is relatively slow. But so is SSE2 pshufd. See Fastest way to do horizontal float vector sum on x86.)

I think even first-gen Atom has SSSE3. https://en.wikipedia.org/wiki/Intel_Atom.

There are CPUs like AMD Geode that don't have SSE at all, but I think the point of the question is CPUs that do have SSE2/3 but not SSSE3.


There are no new mainstream CPUs being made that don't have SSE4.2, but some Phenom II CPUs are probably still in use even in 2018. The older they are, the more it's expected that new software might not work on them.

There are unfortunately still brand-new mainstream CPUs being made without AVX and BMI: Intel's Pentium and Celeron models, even for Skylake / Kaby Lake. Presumably when a die has defects in the upper 128-bits of its vector ALUs, e.g. the large FMA units, they fuse it off and disable decoding of VEX prefixes, and label it as a Pentium or Celeron1. (This is presumably why Pentium/Celeron models don't support BMI1/BMI2 either; other than pext/pdep those take trivial die area.)

So we're not getting any closer to BMI1/BMI2 being baseline at some point in the future, which is really unfortunate because it's required for single-uop variable-count shifts on Intel CPUs. (shl cl,reg is 3 uops because of the cl=0 no-flag-update case being possible; SHLX / SHRX are 1 uop). BMI1/2 is most useful when used throughout your whole code, not just in a couple functions.


Footnote 1: Certainly some fully-working chips get this treatment, too, especially once yields improve for a new process, but for consistency / market-segmentation they're still crippled.

But I think rep movs/rep stos ERMSB still work with 256-bit loads/stores, so the FP register file, load/store units, and bypass forwarding network would all still need to support full width. (And ERMSB becomes much more attractive vs. vector loops because it can use twice the width.

I wonder if there's a way for the CPU to be rewired with fuses so it can use any 2 of the 4 128-bit lanes of FMA units that are working. We know Skylake-AVX512 can mix and match FMA units with ports 0, 1, and 5, only powering up the p5 FMA (if available) for 512-bit vectors, and combining the 256-bit FMA units on p0 and p1 as one 512-bit FMA unit. Statically doing something like that with fuses could let Intel use chips that had a defect affecting both lanes of what would have been one FMA unit.

Anyway, this is pure guesswork. It's likely, but don't know if we have any reliable source that Intel actually ever did this as a way to sell chips with FMA defects. We do know that chips with defects in a whole physical core get sold as lower core-count SKUs, like a dual-core chip from a quad-core die. And that quad-core i5 CPUs with only 6MB of L3 cache instead of 8MB means they have one of their 4 slices of L3 cache disabled, again probably for salvaging defects.

Anode answered 17/10, 2018 at 17:59 Comment(6)
You may also find this blog post useful.Mincing
Some suggestions for minor corrections: 1) The most recent CPUs without SSSE3 are AMD A-Series APUs based on the Llano core. They are based on the same cores as K10 but have a different family id (12h instead of 10h). en.wikipedia.org/wiki/… 2) There is no K10 without SSE3. The original K8 lacked SSE3, but in 2005 AMD added SSE3 in revision E of K8 cores. K10 only came much later.Geriatrician
Do you have any reference for this "errors in the upper bits sold as celeron" theory? Even if it was true at some point at some points yields probably improve to a point where you have to sell good chips as celerons, or restrict the market. Don't the rep mov/stos instructions work at full width on these chips? If so it implies no errors in at least the part of the pipeline that uses, which probably (?) includes the register file.Fauman
@BeeOnRope: Yeah good point about the physical register file; I don't know if we have specific evidence, but probably rep movs/stos are full width. So probably just the FMA unit, which takes significant die area and could plausibly have an isolated defect. (Or possibly other ALU, but we know pretty much everything significant runs on the FMA, including integer multiply and shift.) I'm just guessing at this, I don't remember reading any confirmation. Note that I'm not claiming that all Pentium/Celeron chips actually do have bad upper halves, just that they want(ed?) that option.Anode
I think only SIMD integer multiplication runs on the FMA unit, if I'm not mistaken? I think the 64x64->128 scalar multiplier is a separate unit.Fauman
@BeeOnRope: yes, that's what I meant; sorry about the ambiguity.Anode

© 2022 - 2024 — McMap. All rights reserved.