Is HyperThreading / SMT a flawed concept?

Asked 15/4, 2014 at 8:48 Answered 9/1, 2021 at 12:58

Solved multithreading cpu-architecture hyperthreading

The primary idea behind HT/SMT was that when one thread stalls, another thread on the same core can co-opt the rest of that core's idle time and run with it, transparently.

In 2013 Intel dropped SMT in favor of out-of-order execution for its Silvermont processor cores, as they found this gave better performance.

ARM no longer support SMT (for energy reasons). AMD never supported it. In the wild, we still have various processors that support it.

From my perspective, if data and algorithms are built to avoid cache misses and subsequent processing stalls at all costs, surely HT is a redundant factor in multi-core systems? While I appreciate that there is low overhead to the context-switching involved since the two HyperThreads' discrete hardware exists within the same physical core, I cannot see that this is better than no context switching at all.

I'm suggesting that any need for HyperThreading points to flawed software design. Is there anything I am missing here?

Cattalo answered 15/4, 2014 at 8:48 Comment(2)

if data and algorithms are built to avoid cache misses and subsequent processing stalls at all costs, surely HT is a redundant factor in multi-core systems? In a perfect world, sure, but that's not the world we live in. – Hatteras 15/4, 2014 at 8:57

Just a note: HT isn't "context switching"; both logical cores are truly running at the same time. (Alternating cycles in the front-end when neither is stalled, mixing execution in the out-of-order back-end. HT is fine-grained SMT. en.wikipedia.org/wiki/Simultaneous_multithreading) – Hanukkah 13/6, 2021 at 14:44

Whether hyper-threading helps and by how much very much depends on what the threads are doing. It isn't just about doing work in one thread while the other thread waits on I/O or a cache miss - although that is a big part of the rationale. It is about efficiently using the CPU resources to increase total system throughput. Suppose you have two threads

one has lots of data cache misses (poor spatial locality) and does not use floating point, the poor spatial locality is not necessarily because the programmer didn't do a good job, some workloads are inherently so.
another thread is streaming data from memory and doing floating point calculations

With hyper-threading these two threads can share the same CPU, one is doing integer operations and getting cache misses and stalling, the other is using the floating point unit and the data prefetcher is well ahead anticipating the sequential data from memory. The system throughput is better than if the O/S alternatively scheduled both threads on the same CPU core.

Intel chose not to include hyper-threading in Silvermont, but that doesn't mean it will do away with it in high end Xeon server processors, or even in processors targeted at laptops. Choosing the micro-architecture for a processor involves trade-offs, there are many considerations:

What is the target market (what kind of applications will run)?
What is the target transistor technology?
What is the performance target?
What is the power budget?
What is the target die size (affects yield)?
Where does it fit in the spectrum of price/performance for the company's future products?
What is the target launch date?
How many resources are available to implement and verify the design? Adding micro-architectural features adds complexity which is not linear, there are subtle interactions with other features and the goal is to identify as many bugs as possible before the first "tapeout" to minimize how many "steppings" have to be done before you have a working chip.

Silvermont's die size budget per core and power budget precluded having both out-of-order execution and hyperthreading, and out-of-order execution gives better single threaded performance. Here's Anandtech's assessment:

If I had to describe Intel’s design philosophy with Silvermont it would be sensible scaling. We’ve seen this from Apple with Swift, and from Qualcomm with the Krait 200 to Krait 300 transition. Remember the design rule put in place back with the original Atom: for every 2% increase in performance, the Atom architects could at most increase power by 1%. In other words, performance can go up, but performance per watt cannot go down. Silvermont maintains that design philosophy, and I think I have some idea of how.

Previous versions of Atom used Hyper Threading to get good utilization of execution resources. Hyper Threading had a power penalty associated with it, but the performance uplift was enough to justify it. At 22nm, Intel had enough die area (thanks to transistor scaling) to just add in more cores rather than rely on HT for better threaded performance so Hyper Threading was out. The power savings Intel got from getting rid of Hyper Threading were then allocated to making Silvermont an out-of-order design, which in turn helped drive up efficient use of the execution resources without HT. It turns out that at 22nm the die area Intel would’ve spent on enabling HT was roughly the same as Silvermont’s re-order buffer and OoO logic, so there wasn’t even an area penalty for the move.

Tantara answered 15/4, 2014 at 9:23 Comment(1)

+1 Good point on the decision being per Silvermont only. I've updated my question to reflect that. – Cattalo 15/4, 2014 at 9:33

Not all programmers have enough knowledge, time and many other things to write efficient, cache-friendly programs. Many of them don't even know about LTO, PGO or even flags to compile distributable binaries. Most of the time only the critical parts are optimized when needed. The other parts may have lots of cache misses
Even if the program was written with cache efficiency in mind, it may not eliminate cache misses completely. Cache availability is a dynamic information only known at runtime, and neither the programmer nor the compiler knows that to optimize memory access.
- Cache unpredictability is one of the reasons the Itanium failed, because while the compilers can reorder arithmetic operations, it cannot guess those cache information in a multithreading environment to reorder memory loads/stores efficiently.
- Each time there's a cache miss, hundreds of cycles are wasted which can be useful for other purposes. Some CPUs do out-of-order execution (OoO). But even OoO execution has its limits and you'll be blocked at some point. During those time while waiting for all memory problems to be solved, you can switch to another CPU thread and continue running.
As Peter Cordes said, there are other unavoidable stalls like branch misprediction or simply low instruction-level parallelism where OoO doesn't help. There's no way to solve them before runtime
It's not only Intel that uses SMT. AMD Bulldozer has module multithreading which is kind of a partial SMT and it has moved to full SMT in the Zen microarchitecture. There are still lots of other architectures that use SMT such as SPARC, MIPS, PowerPC... For example ARM Cavium ThunderX has 4-way SMT. ARM Helios Neoverse E1 and Cortex A65AE also support SMT. IBM z13 has SMT2. There are even CPUs with 8 or 16 threads per core, like the 12-core 96-thread POWER8 CPUs or the SPARC T3. See https://en.wikipedia.org/wiki/Simultaneous_multithreading#Modern_commercial_implementations

Br answered 15/4, 2014 at 10:2 Comment(1)

AMD Bulldozer-family isn't really SMT. It's two separate integer cores sharing the front-end and FPU. It was sometimes described as CMT (Clustered Multi-Threading). The key difference is that it can't use all of its execution resources on a single integer thread, when there isn't enough thread-level parallelism. It's permanently divided, unable to take advantage of lots of ILP in single-threaded integer code. – Hanukkah 13/6, 2021 at 14:57

Regardless of how well your code is written and running on the machine, there will be relatively long periods of CPU idle time where the CPU is just waiting on something to happen. Cache misses are a subset of the problem, waiting for I/O, user input, etc. can all lead to lengthy stalls in the CPU where the progress can still be made on the second set of registers. Also, there are several causes of cache misses that you can't plan for/around (an example is pushing new instructions on a branch since you executable probably doesn't all fit into Level 3 cache).

One of the main reasons that Silvermont went away from HT is the fact that at 22 nm, you have a lot of die (relatively) to play with. As a result, you can get away with more physical cores for increased parallelism.

ARM and AMD have not implemented hyper threading because it is Intel's proprietary technology.

Disapprobation answered 15/4, 2014 at 9:0 Comment(6)

"ARM and AMD have not implemented hyper threading because it is Intel's proprietary technology". ARM has implemented SMT. There is nothing proprietary about SMT, which is a general architectural concept. The info about the die is interesting, as are your remarks on unavoidable stalls... fair play. +1. – Cattalo 15/4, 2014 at 9:2

AMD has moved to SMT for quite many years. And other architectures also use SMT, most notably Sparc and PowerPC – Br 27/10, 2019 at 5:28

SMT in general is not proprietary to Intel. IBM notably uses it in their POWER CPUs. The first commercial CPU designed for SMT was Alpha EV8 (en.wikipedia.org/wiki/…). (It was cancelled before it was finished, never made it to silicon, but papers about it did still get presented at ISSCC 2002 because there was so much interest in it. See realworldtech.com/ev8-mckinley/.) – Hanukkah 13/6, 2021 at 15:13

To be fair, Intel hired most of the EV8 design team, and they worked on IA-64 Itanium chips for a while. (See the RWT link in my last comment). So if there were patents, Intel probably bought them. Or maybe Intel had their own plans? patents.google.com/patent/US6658447B2/en was granted to Intel in 1997! Years before any SMT silicon was ever released by anyone. (But it cites Tullsen, et al.'s 1996 conference paper on SMT, so maybe the core idea was open...) But anyway, it's clear from the existence of IBM's SMT since POWER5 (2004) that any necessary patents could be licensed. – Hanukkah 13/6, 2021 at 15:19

Anyway, re: the rest of your answer: waiting for I/O, user input, etc - The CPU doesn't busy-wait for those to happen! The OS will actually software context-switch and run something else until the I/O completes, or there is some user input, or whatever, not sit in a busy-wait loop polling the device. (Taking an interrupt does stall the CPU for a long time, though, and I think the other hyperthread can keep executing while that happens.) – Hanukkah 13/6, 2021 at 15:24

Branch mispredicts are one of the best examples of unavoidable slowdowns, along with cache misses. Also simply low amounts of instruction-level parallelism, e.g. in code that traverses a linked list, or naive FP code that has only one dependency chain. – Hanukkah 13/6, 2021 at 15:27

As far as i know and as i experienced as a developer in the field of heavy throughput calculations SMT/HT has only one single usefull application and in all others at best it doesn't make things worse:

On virtualization SMT/HT helps reducing the costs of (thread) context switching and thus highly reduces the latency when working with multiple VMs sharing the same cores.

But regarding throughput, i never encountered in practice anything where SMT/HT actually didn't made things slower. Theoretically, it could be neither slower nor faster if the OS would optimally schedule the processes but in practice it happens to schedule two demanding processes on the same core due to SMT and thus slowing down the throughput.

So on all machines that are used for high performance calculations we disable HT and SMT. In all our tests they slow down calculation by around 10-20%.

If somebody has a real world (htoughput not latency) example where smt/HT actually didn't slow down things i would be very curious.

Aftershock answered 9/1, 2021 at 12:58 Comment(2)

It gives approximately 15% speedup with x265 video encoding (-preset slow at 1080p) on Skylake i7-6700k, DDR4-2666. It's a pretty memory bandwidth intensive workload, but having two threads sharing a core doesn't increase cache misses so much that it actually hurts. (And splitting the work into more threads doesn't lead to much more total work because it scales well.) – Hanukkah 13/6, 2021 at 15:4

It's well-known that HPC code often scales negatively with SMT, if using optimized stuff like BLAS matmuls that are already high-IPC enough to saturate a core with one thread per core, not stalling much. And when competition for limited cache space just makes everything worse. Code that isn't so well tuned, and/or isn't so bottlenecked on cache / memory, can often benefit significantly. e.g. code that stalls a lot on branch misses or latency of dependency chains can leave a lot of unused execution resources every clock cycle. For example, big compile jobs, like make -j... scale well. – Hanukkah 13/6, 2021 at 15:7

-1

After using the 8 core Atoms with virtualization, I salivate over the prospect of such a chip with HT. I will agree for most workloads maybe not, but with ESXi? You get truly impressive use of HT. The low power consumption just seals the deal on them for me. If you could get 16 logical cores on ESXi the price / performance would be truly through the roof. I mean, no way to afford the current Intel chips with 8 cores and HT and because of the way Vsphere and products for Vsphere are licensed per proc, dual proc hosts just don't make sense anymore cost wise for true small businesses.

Momentum answered 19/6, 2015 at 16:38 Comment(1)

Welcome to the site and thanks for your interest. However, you should have left this as a comment, since that's all it is. It is not an answer to the question posited. – Cattalo 19/6, 2015 at 19:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags