Is duplication of state resources considered optimal for hyper-threading?
Asked Answered
N

1

5

This question has an answer that says:

Hyper-threading duplicates internal resources to reduce context switch time. Resources can be: Registers, arithmetic unit, cache.

Why did CPU designers end up with duplication of state resources for simultaneous multithreading (or hyper-threading on Intel)?

Why wouldn't tripling (quadrupling, and so on) those same resources give us three logical cores and, therefore, even faster throughput?

Is duplication that researchers arrived at in some sense optimal, or is it just a reflection of current possibilities (transistor size, etc.)?

Nihhi answered 2/3, 2016 at 13:16 Comment(7)
Yeah, engineers at intel asked the same questions to themselves 10 years ago.Aurelio
And then they would have done their simulations, etc and figured out which of the design alternatives would give the best performance. Can we get any real insights into this? No! It would be highly commercially sensitive information.Vancevancleave
Ideal thread count per core is workload-dependent. Intel's Xeon Phi (which targets HPC workloads otherwise targeted by GPGPU) provides four threads per core. Oracle's M5 (targeting server workloads, especially database) provides eight threads per core as does IBM's POWER8 (which has more robust ILP exploitation). Intel's mainstream processors (non-Atom/non-Phi) still have significant emphasis on targeting personal computer workloads. Current hardware and software interfaces also limit the benefit of higher thread counts (in addition to inherent tradeoffs in size, complexity, sharing etc.).Outride
The lower sharing of multicore provides several advantages (avoiding cache conflict, communication overhead [by assuming communication is not the common case whereas multithreaded architectures are more optimized for frequent communication], etc.).Outride
@PaulA.Clayton I guess it follows from what you say that having more than two threads per core, as in the processors you mention, means that they do have 4, 8, etc. copies of state resources to support the thread count?Nihhi
The architectural state must be replicated, but it does not need to have uniform access characteristics. For example, POWER8 uses the main register file as a cache of sorts where some register state is spilled/filled to/from additional storage and Itanium used "temporal banking"/3D register file to support Switch-on-Event-MultiThreading. Making a distinction between a virtual processor and a thread allows some reduction in architectural state per thread. Another technique (not yet used, AFAIK) would be using otherwise unused FP/SIMD register storage. The design space is huge.Outride
Ok, makes sense, thanks. Also, "...reduction in architectural state per thread" - illuminating point (that that's a possibility and therefore a goal).Nihhi
S
3

The answer you're quoting sounds wrong. Hyperthreading competitively shares the existing ALUs, cache, and physical register file.

Running two threads at once on the same core lets it find more parallelism to keep those execution units fed with work instead of sitting idle waiting for cache misses, latency, and branch mispredictions. (See Modern Microprocessors A 90-Minute Guide! for very useful background, and a section on SMT. Also this answer for more about how modern superscalar / out-of-order CPUs find and exploit instruction-level parallelism to run more than 1 instruction per clock.)

Only a few things need to be physically replicated or partitioned to track the architectural state of two CPUs in one core, and it's mostly in the front-end (before the issue/rename stage). David Kanter's Haswell writeup shows how Sandybridge always partitioned the IDQ (decoded-uop queue that feeds the issue/rename stage), but IvyBridge and Haswell can use it as one big queue when only a single thread is active. He also describes how cache is competitively shared between threads. For example, a Haswell core has 168 physical integer registers, but the architectural state of each logical CPU only needs 16. (Out-of-order execution for each thread of course benefits from lots of registers, that's why register renaming onto a big physical register file is done in the first place.)

Some things are statically partitioned, like the ROB, to stop one thread from filling up the back-end with work dependent on a cache-miss load.


Modern Intel CPUs have so many execution units that you can only barely saturate them with carefully tuned code that doesn't have any stalls and runs 4 fused-domain uops per clock. This is very rare in practice, outside something like a matrix multiply in a hand-tuned BLAS library.

Most code benefits from HT because it can't saturate a full core on its own, so the existing resources of a single core can run two threads at faster than half speed each. (Usually significantly faster than half).

But when only a single thread is running, the full power of a big core is available for that thread. This is what you lose out on if you design a multicore CPU that has lots of small cores. If Intel CPUs didn't implement hyperthreading, they would probably not include quite so many execution units for a single thread. It helps for a few single-thread workloads, but helps a lot more with HT. So you could argue that it is a case of replicating ALUs because the design supports HT, but it's not essential.

Pentium 4 didn't really have enough execution resources to run two full threads without losing more than you gained. Part of this might be the trace cache, but it also didn't have nearly the amount of execution units. P4 with HT made it useful to use prefetch threads that do nothing but prefetch data from an array the main thread is looping over, as described/recommended in What Every Programmer Should Know About Memory (which is otherwise still useful and relevant). A prefetch thread has a small trace-cache footprint and fetches into the L1D cache used by the main thread. This is what happens when you implement HT without enough execution resources to really make it good.


HT doesn't help at all for code that achieves very high throughput with a single thread per physical core. For example, saturating the front-end bandwidth of 4 uops / clock cycle without ever stalling.

Or if your code only bottlenecks on a core's peak FMA throughput or something (keeping 10 FMAs in flight with 10 vector accumulators). It can even hurt for code that ends up slowing down a lot from extra cache misses caused by competing for space in the L1D and L2 caches with another thread. (And also the uop cache and L1I cache).

Saturating the FMAs and doing something with the results typically takes some instructions other than vfma... so high-throughput FP code is often close to saturating the front-end as well.

Agner Fog's microarch pdf says the same thing about very carefully tuned code not benefiting from HT, or even being hurt by it.

Paul Clayton's comments on the question also make some good points about SMT designs in general.


If you have different threads doing different things, SMT can still be helpful. e.g. high-throughput FP code sharing a core with a thread that does mostly integer work and stalls a lot on branch and cache misses could gain significant overall throughput. The low-throughput thread leaves most of the core unused most of the time, so running another thread that uses the other 80% of a core's front-end and back-end resources can be very good.

Shut answered 11/10, 2016 at 18:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.