Imagine a CPU (or core) that is superscalar (multiple execution units) and also has hyperthreading (SMT) support.
Why is the number of software threads the CPU can truly execute in parallel typically given by the number of logical cores (i.e. so-called hardware threads) it possesses, and not the total number of execution units it has?
If my understanding is correct, SMT doesn't actually enable true parallel execution, it instead simply makes context switching much faster/more efficient by duplicating certain parts of the CPU (those that store the architectural state, but not the main execution resources). On the other hand, superscalar architecture allows true simultaneous execution of multiple instructions per clock cycle, because the CPU has multiple execution units, i.e. multiple parallel pipelines which can each can process a separate thread, in true parallel fashion.
So for example, if a CPU has 2 cores, and each core has 2 execution units, shouldn't its hardware concurrency (the number of threads it can truly execute in parallel) be 4? Why is its hardware concurrency instead given by the number of logical cores, when SMT doesn't actually enable true parallel execution?