Hyperthreading vs. Superscalar execution - McMap

About

Hyperthreading vs. Superscalar execution

Asked 11/4, 2019 at 15:48 Answered 12/4, 2019 at 10:19

Solved cpu hyperthreading superscalar

A

1

8

Imagine a CPU (or core) that is superscalar (multiple execution units) and also has hyperthreading (SMT) support.

Why is the number of software threads the CPU can truly execute in parallel typically given by the number of logical cores (i.e. so-called hardware threads) it possesses, and not the total number of execution units it has?
If my understanding is correct, SMT doesn't actually enable true parallel execution, it instead simply makes context switching much faster/more efficient by duplicating certain parts of the CPU (those that store the architectural state, but not the main execution resources). On the other hand, superscalar architecture allows true simultaneous execution of multiple instructions per clock cycle, because the CPU has multiple execution units, i.e. multiple parallel pipelines which can each can process a separate thread, in true parallel fashion.

So for example, if a CPU has 2 cores, and each core has 2 execution units, shouldn't its hardware concurrency (the number of threads it can truly execute in parallel) be 4? Why is its hardware concurrency instead given by the number of logical cores, when SMT doesn't actually enable true parallel execution?

Amritsar answered 11/4, 2019 at 15:48 Comment(0)

P

2

You can't just slam instructions into the execution units.
If you want two a 2-way SMT you need to keep two architectural states and fetch two instruction streams.

If a company has 100 developers but only two project managers it can only develop two projects in parallel (but it can concurrently develop more if it make the PMs switch project each day or so).

If a CPU can fetch only from two instruction streams (keeping only two thread contexts) you can assign it only two threads to execute in parallel.
You can however make a time-division and execute more threads concurrently.

The software has no access to the execution units, that would make a circular argument (the software needs the EUs to execute but the EUs need the software to execute).
The CPU will try to use as much as the EUs as possible exploiting Out-of-order and speculating on anything it can.
Actually, hyper-threading is just a way to keep all the resources busy (like sharing a developer with another PM when they have little to do).

But if all fails and an EU is not used, then that possible unit of work has simply gone wasted.

Puiia answered 12/4, 2019 at 10:19 Comment(9)

Thanks for the quick answer. Are you saying an EU is not a complete, stand-alone execution pipeline for the CPU? Sadly, I haven't been able to found a clear explanation on what an EU is precisely, sources I've read say it's either a single internal unit such as an ALU or FPU, while others say it refers to the group of internal components (e.g. an internal control sequence unit, registers, an ALU+FPU, and so on. In other words, a complete pipeline in which instructions can be slammed into). What's your take on this? Thanks again. – Amritsar 13/4, 2019 at 14:51

@Amritsar Both definitions are fundamentally equivalent. An EU is a circuit that takes inputs (operands) and produces a result. In both cases the inputs of an EU are the operands (constants, architectural registers) not instructions, that's why you still need everything else around the EUs. There's not a precise definition of EU, the part of the CPU that actually transform inputs in outputs is called an EU. Being a sync net of gates, an EU can have internal (invisible) registers, sparse control logic and so on and usually is pipelined. But it cannot fetch, decode and retire instructions. – Puiia 13/4, 2019 at 19:50

Ok, sounds like EU refers to just an ALU/FPU/etc., instead of full-blown pipeline comprising ALU/FPU/etc but also several other things outside said ALU/FPU, and which can process instructions. Your definition would also explain why an EU doesn't = an additional hardware thread, which was my original Q. Still, it's interesting that the # logical cores is the measure of hardware concurrency, since SMT doesn't enable true simultaneous execution. I guess it's close enough, since there's actual hw supporting a hw thread (allowing storage of archi state and lightning fast context switch)? – Amritsar 14/4, 2019 at 21:54

@Amritsar Actually it is. SMT is defined either as parallel execution or whole parallel pipelines. On Intel's CPUs only the first stages of the front-end (probably the fetch and the pre-decode only) are not truly parallel (a mix of fine-grained Vertical Threading and coarse-grained threading). Later stages are processing the instructions of both threads in parallel. This is thanks to the various queues in the pipeline (e.g. the IQ and the IDQ for the front-end). What is happening is that a single 4/6/7-way pipeline is handling the instructions of two threads independently of the source. – Puiia 15/4, 2019 at 14:11

Interesting...but if SMT can process both threads in true parallel fashion, in other words have two whole parallel pipelines as u say, then why is SMT typically still significantly slower than simply having more physical cores? I.e. 4 logical cores across 2 physical cores is always slower than 4 physical cores with no SMT. Thanks again. – Amritsar 16/4, 2019 at 22:46

@Amritsar Hyperthreading doesn't have two full pipeline, it is just using a single pipeline for both threads. In a superscalar CPU a single pipeline can execute more than one instruction in parallel, for the hyper threading case it is just that these paralleled execution instructions can come only from the first thread, only from the second thread or from both. – Puiia 17/4, 2019 at 10:37

Ah ok, I understood your comment wrong. I think I understand it correctly. With SMT, a single pipeline can "see" and choose from both threads (due to having both archi states) when cramming instructions into it, and this makes superscalar CPUs more efficient b/c this allows the CPU to do more each cycle. In other words, more (almost all) of the CPU's internal parts/resources can be utilized per cycle that without SMT. – Amritsar 17/4, 2019 at 14:5

The diagram in Paul Jakubik's answer on quora is pretty illustrative imo, which shows SMT as filling in the CPU's "bubbles", or "things to do" each cycle. Now that I better understand what an EU is, I imagine one of the "things to do" might be using an EU, e.g. calculating the result of a mathematical operation by using its ALU (roughly speaking). Thanks for your help! – Amritsar 17/4, 2019 at 14:10

Thank you for your answers. Most important part that removed all my confusion: "On Intel's CPUs only the first stages of the front-end ... are not truly parallel ... Later stages are processing the instructions of both threads in parallel." – Renewal 1/12, 2021 at 13:46

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.