Why would a fully CPU bound process work better with hyperthreading?
Asked Answered
V

4

27

Given:

  • An entirely CPU bound very large (i.e., more than a few CPU cycles) job, and
  • A CPU with four physical and a total of 8 logical cores,

should 8, 16, and 28 threads perform better than 4? I understand that four threads would have lesser context switches to perform and lesser overhead than 8, 16, or 28 threads would have on a four physical core machine. However, the timings are -

Threads    Time Taken (in seconds)
   4         78.82
   8         48.58
   16        51.35
   28        52.10

The code used to test get the timings is mentioned in the Original Question section below. The CPU specifications are also given at the bottom.


Original Question

What does it mean when we say

Hyper-threading works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources. This allows a hyper-threading processor to appear as the usual "physical" processor and an extra "logical" processor to the host operating system

?

This question is asked on SO today, and it tests the performance of multiple threads doing the same work. It has the following code:

private static void Main(string[] args)
{
    int threadCount;
    if (args == null || args.Length < 1 || !int.TryParse(args[0], out threadCount))
        threadCount = Environment.ProcessorCount;

    int load;
    if (args == null || args.Length < 2 || !int.TryParse(args[1], out load))
        load = 1;

    Console.WriteLine("ThreadCount:{0} Load:{1}", threadCount, load);
    List<Thread> threads = new List<Thread>();
    for (int i = 0; i < threadCount; i++)
    {
        int i1 = i;
        threads.Add(new Thread(() => DoWork(i1, threadCount, load)));
    }

    var timer = Stopwatch.StartNew();
    foreach (var thread in threads) thread.Start();
    foreach (var thread in threads) thread.Join();
    timer.Stop();

    Console.WriteLine("Time:{0} seconds", timer.ElapsedMilliseconds/1000.0);
}

static void DoWork(int seed, int threadCount, int load)
{
    var mtx = new double[3,3];
    for (var i = 0; i < ((10000000 * load)/threadCount); i++)
    {
         mtx = new double[3,3];
         for (int k = 0; k < 3; k++)
            for (int l = 0; l < 3; l++)
              mtx[k, l] = Math.Sin(j + (k*3) + l + seed);
     }
}

(I have cut out a few braces to bring the code to a single page for quick readability.)

I ran this code on my machine to replicate the issue. My machine has four physical cores and eight logical ones. The method DoWork() in the code above is completely CPU bound. I felt hyper-threading could contribute to maybe a 30% speedup (because here we have as many CPU-bound threads as the physical cores (i.e., 4)). But it nearly does attain a 64% performance gain. When I ran this code for four threads, it took about 82 seconds, and when I ran this code for 8, 16, and 28 threads, it ran in all the cases in about 50 seconds.

To summarize the timings:

Threads    Time Taken (in seconds)
   4         78.82
   8         48.58
   16        51.35
   28        52.10

I could see that CPU usage was ~50% with four threads. Shouldn't it be ~100%? After all, my processor has only four physical cores. And the CPU usage was ~100% for 8 and 16 threads.

I am trying to understand Why would a fully CPU-bound process work better with hyperthreading?.


For the sake of completion,

  • I have Intel Core i7-4770 CPU @ 3.40 GHz, 3401 MHz, 4 Core(s), 8 Logical Processor(s).
  • I ran the code in Release mode.
  • I know that the way timings are measured is bad. This will only give time for the slowest thread. I took the code as it is from the other question.
Vet answered 11/9, 2015 at 19:48 Comment(21)
This looks like a well written question; wont be surprised if it reaches hot questions listDiscommodity
When you ran this with 4 threads, did you try binding the threads to different physical cores?Benyamin
@Mysticial: Would the OS really be so incompetent at load distribution that it maps the 4 threads to only 2 cores, especially when other two are idle?Vet
@Vet Don't underestimate the stupidity of Windows. :D In all seriousness, I don't believe Windows was aware of HT until Win7. And even then, the scheduler bounces things around so much that it's hard to see if it's actually doing the right thing. So it's worth giving it a shot. Then you have power cycling. A core that's unused will clock down and it takes time to clock back up. On my 5960X (8 cores), if I don't bind single-threaded benchmarks to a particular core, they run 30% slower. Multi-threaded stuff are unaffected since they peg all cores at 100%.Benyamin
And how did you measure cpu utilization (that 50%, 100%)?Revanchism
If you're on Windows, all the even core numbers are on different cores. 0, 2, 4, 6 are on different cores. (0 and 1 are the same core, 2 and 3 are the same core, etc...)Benyamin
@Evk: Hard eye-balled myself the Windows Task Manager for 52 seconds for 8, 16 and 28 threads and for 80 seconds for 4 threads scenario.Vet
Asked because wanted to say that it will not report you more than 50 in your test, but Peter Duniho already explained that.Revanchism
Dependent FPU operations have non-zero latencies, and there may not be enough independent instructions for the OoO to fill the latencies and saturate the execution units. Therefore, the more threads are there, the better pipelines saturation is, even for the 100% CPU-bound loads.Lower
Can you add a table of the results you got for each thread count? It is really confusing to gather the information from all those places in this question.Foumart
Context switches are rather cheap. Compared to a typical scheduler quantum of 10-40ms the switch is extremely cheap (3 microseconds?!). The savings that HT brings in such situations do not come from less scheduling.Foumart
@usr: I have put the exact timings that I got in the question now, if that helps.Vet
The results make sense to me. A typical HT outcome. HT usually provides more than 0 and less than 100% gain. It depends on the workload how much it is. You can contrive workloads to achieve 0 and workloads to achieve high numbers.Foumart
@usr: What I'm curious about is that with 4 physical cores and the end-to-end CPU bound jobs, how can we get 60% speed-up? If it takes t time to run 4 jobs on 4 physical cores, shouldn't it take 4t time to run 16 such jobs on 4 physical cores?Vet
It's less than 4t because of hyper threading. That's the point of HT. Throughput still increases between 4 and 8. After that no gains but added scheduling overhead.Foumart
@usr: It should be less than 4t is understandable to some extent. What explains the fact that it is less than t?Vet
It's less than t because you weren't saturating the processor cores in the first place. You had idle computational resources in the 4-thread test, you have increased only the number of threads working and not the workload (note that the DoWork() method scales the workload for each method invocation inversely according to the number of threads), and so adding new threads results in shorter run times. Note that once you go past the number of logical cores (8), runtime increases again; you are adding scheduling overhead without adding any new computational resources in that case.Patchwork
@PeterDuniho: Now that you have pointed it out, it seems so obvious. Thanks. I'll mark your answer as accepted.Vet
I have extended my answer with context switch explanation. In fact, the earlier answer by @Xirema already explains, why 8 threads on 4 physical cores do not need context switches. Though that answer is not entirely correct (both threads are active at once with HT).Aglow
@usr: You said that savings in HT do not come from less scheduling. What is then the primary source of increase in efficiency in HT?Vet
@Vet better use of execution units in the CPU if mixed workloads are merged that way. Also more outstanding memory loads which help with latency and memory utilization.Foumart
P
11

I could see that CPU usage was ~50% with 4 threads. Shouldn't it be ~100%?

No, it shouldn't.

what is the justification for 50% CPU usage when running 4 CPU bound threads on a 4 physical core machine?

This is simply how CPU utilization is reported in Windows (and on at least some other OS's too, by the way). A HT CPU shows up as two cores to the operating system, and is reported as such.

Thus, Windows sees an eight-core machine, when you have four HT CPUs. You'll see eight different CPU graphs if you look at the "Performance" tab in Task Manager, and the total CPU utilization is computed with 100% utilization being the full utilization of these eight cores.

If you are only using four threads, then these threads cannot fully utilize the available CPU resources and that explains the timings. They can, at most, use four of the eight cores available and so of course your utilization will max out at 50%. Once you go past the number of logical cores (8), runtime increases again; you are adding scheduling overhead without adding any new computational resources in that case.


By the way…

HyperThreading has improved quite a lot from the old days of shared cache and other limitations, but it will still never provide the same throughput benefit that a full CPU could, as there remains some contention within the CPU. So even ignoring OS overhead, your 35% improvement in speed seems pretty good to me. I often see no more than a 20% speed-up adding the extra HT cores to a computationally-bottlenecked process.

Patchwork answered 11/9, 2015 at 20:50 Comment(4)
Your last para is my question. Why am I seeing a gain as much as 64% when I shouldn't be getting something close to 20%? Am I wrong in some calculations or what?Vet
Improvement can vary greatly according to the exact nature of the workload. That said, I definitely disagree with the idea that the only source of speedup is avoiding context switching. HT CPUs are doing real work, adding to true parallelism in computations by taking advantage of parallelized CPU components.Patchwork
You apparently have a scenario where your process is better able to make use of what would be otherwise unused CPU resources; different threads could be making use of different stages in the same core, or it could be that resources normally used for speculative execution are able to be used for the extra thread instead.Patchwork
To be fair, Windows definitely knows it is running a 4 physical core and 8 logical core box - rather than "just seeing 8 cores". The logical thread topology is readily available to the OS and indeed is critical for making good scheduling decisions (e.g., using all physical cores before scheduling a second thread on the same core). Windows (and other OSes) just choose to report CPU use with a denominator of logical cores. This often gives a misleading value, but it's hard to see a better way to do it.Knout
A
12

CPU pipeline

Each instruction has to go through several steps in the pipeline to be fully executed. At the very least, it must be decoded, sent to execution unit, then actually executed there. There are several execution units on modern CPUs, and they can execute instructions completely in parallel. By the way, the execution units are not interchangeable: some operations can only be done on a single execution unit. For example, memory loads are usually specialized to one or two units, memory stores are exclusively sent to another unit, all the calculations are done by some other units.

Knowing about the pipeline, we may wonder: how can CPU work so fast, if we write purely sequental code and each instruction has to go through so many pipeline stages? Here is the answer: processor executes instructions in out-of-order fashion. It has a large reorder buffer (e.g. for 200 instructions), and it pushes many instructions through its pipeline in parallel. If at any moment some instruction cannot be executed for any reason (waits for data from slow memory, depends on other instruction not yet finished, whatever), then it is delayed for some cycles. During this time processor executes some new instructions, which are located after the delayed instructions in our code, given that they do not depend on the delayed instructions in any way.

Now we can see the problem of latency. Even if an instruction is decoded and all of its inputs are already available, it would take it several cycles to be executed completely. This delay is called instruction latency. However, we know that at this moment processor can execute many other independent instructions, if there are any.

If an instruction loads data from L2 cache, it has to wait about 10 cycles for the data to be loaded. If the data is located only in RAM, then it would take hundreds of cycles to load it to processor. In this case we can say that the instruction has high latency. It is important for maximum performance to have some other independent operations to execute at this moment. This is sometimes called latency hiding.

At the very end, we have to admit that most of real code is sequental in its nature. It has some independent instructions to execute in parallel, but not too many. Having no instructions to execute causes pipeline bubbles, and it leads to inefficient usage of processor's transistors. On the other hand, instructions of two different threads are automatically independent in almost all cases. This leads us directly to the idea of hyper-threading.

P.S. You might want to read Agner Fog's manual to better understand internals of modern CPUs.

Hyper-threading

When two threads are executed in hyper-threading mode on a single core, the processor can interleave their instructions, allowing to fill bubbles from the first thread with instructions of the second thread. This allows to better utilize processor's resources, especially in case of ordinary programs. Note that HT may help not only when you have a lot of memory accesses, but also in heavily sequental code. A well-optimized computational code may fully utilize all resources of CPU, in which case you will see no profit from HT (e.g. dgemm routine from well-optimized BLAS).

P.S. You might want to read Intel's detailed explanation of hyper-threading, including info about which resources are duplicated or shared, and discussion about performance.

Context switches

The context is an internal state of CPU, which at least includes all the registers. When execution thread changes, OS has to do a context switch (detailed description here). According to this answer, context switch takes about 10 microseconds, while the time quant of scheduler is 10 milliseconds or more (see here). So context switches do not affect total time much, because they are done seldom enough. Note that competition for CPU caches between threads can increase the effective cost of switches in some cases.

However, in case of hyper-threading each core has two states internally: two sets of registers, shared caches, one set of execution units. As a result, the OS has no need to do any context switches when you run 8 threads on 4 physical cores. When you run 16 threads on quad-core, the context switches are performed, but they take small part of the overall time, as explained above.

Process manager

Speaking of CPU utilization that you see in the process manager, it does not measure the internals of CPU pipeline. Windows can only notice when a thread returns execution to OS in order to: sleep, wait for mutex, wait for HDD, and do other slow things. As a result, it thinks that a core is fully used if there is a thread working on it, which does not sleep or wait for anything. For instance, you may check that running endless loop while (true) {} leads to full utilization of CPU.

Aglow answered 12/9, 2015 at 14:36 Comment(1)
Thanks for the detailed explanation. I was going to ask another question which would call for this answer. It's very helpful that you added this info here.Vet
P
11

I could see that CPU usage was ~50% with 4 threads. Shouldn't it be ~100%?

No, it shouldn't.

what is the justification for 50% CPU usage when running 4 CPU bound threads on a 4 physical core machine?

This is simply how CPU utilization is reported in Windows (and on at least some other OS's too, by the way). A HT CPU shows up as two cores to the operating system, and is reported as such.

Thus, Windows sees an eight-core machine, when you have four HT CPUs. You'll see eight different CPU graphs if you look at the "Performance" tab in Task Manager, and the total CPU utilization is computed with 100% utilization being the full utilization of these eight cores.

If you are only using four threads, then these threads cannot fully utilize the available CPU resources and that explains the timings. They can, at most, use four of the eight cores available and so of course your utilization will max out at 50%. Once you go past the number of logical cores (8), runtime increases again; you are adding scheduling overhead without adding any new computational resources in that case.


By the way…

HyperThreading has improved quite a lot from the old days of shared cache and other limitations, but it will still never provide the same throughput benefit that a full CPU could, as there remains some contention within the CPU. So even ignoring OS overhead, your 35% improvement in speed seems pretty good to me. I often see no more than a 20% speed-up adding the extra HT cores to a computationally-bottlenecked process.

Patchwork answered 11/9, 2015 at 20:50 Comment(4)
Your last para is my question. Why am I seeing a gain as much as 64% when I shouldn't be getting something close to 20%? Am I wrong in some calculations or what?Vet
Improvement can vary greatly according to the exact nature of the workload. That said, I definitely disagree with the idea that the only source of speedup is avoiding context switching. HT CPUs are doing real work, adding to true parallelism in computations by taking advantage of parallelized CPU components.Patchwork
You apparently have a scenario where your process is better able to make use of what would be otherwise unused CPU resources; different threads could be making use of different stages in the same core, or it could be that resources normally used for speculative execution are able to be used for the extra thread instead.Patchwork
To be fair, Windows definitely knows it is running a 4 physical core and 8 logical core box - rather than "just seeing 8 cores". The logical thread topology is readily available to the OS and indeed is critical for making good scheduling decisions (e.g., using all physical cores before scheduling a second thread on the same core). Windows (and other OSes) just choose to report CPU use with a denominator of logical cores. This often gives a misleading value, but it's hard to see a better way to do it.Knout
K
4

I can't explain the sheer volume of speed-up that you observed: 100% seems way too much of an improvement for Hyperthreading. But I can explain the principles in place.

The main benefit to Hyperthreading is when a processor has to switch between threads. Whenever there are more threads than there are CPU cores (true 99.9997% of the time) and the OS decides to switch to a different thread, it has to perform (most of) the following steps:

  1. Save the state of the current thread: this includes the stack, the state of the registers, and the program counter. where they get saved depends on the architecture, but generally speaking they'll either get saved in cache or in memory. Either way, this step takes time.
  2. Put the Thread into "Ready" state (as opposed to "Running" state).
  3. Load the state of the next thread: again, including the stack, the registers, and the program counter, which once again, is a step that takes time.
  4. Flip the Thread into "Running" state.

In a normal (non-HT) CPU, the number of cores it has is the quantity of processing units. Each of these contain registers, program counters (registers), stack counters (registers), (usually) individual cache, and complete processing units. So if a normal CPU has 4 cores, it can run 4 threads simultaneously. When a thread is done (or the OS has decided that it's taking too much time and needs to wait its turn to start again), the CPU needs to follow those four steps to unload the thread and load in the new one before execution of the new one can begin.

In a HyperThreading CPU, on the other hand, the above holds true, but in addition, Each core has a duplicated set of Registers, Program Counters, Stack Counters, and (sometimes) cache. What this means is that a 4-core CPU can still only have 4 threads running simultaneously, but the CPU can have "preloaded" threads on the duplicated registers. So 4 threads are running, but 8 threads are loaded onto the CPU, 4 active, 4 inactive. Then, when it's time for the CPU to switch threads, instead of having to perform the loading/unloading at the moment the threads need to switch out, it simply "toggles" which thread is active, and performs the unloading/loading in the background on the newly "inactive" registers. Remember the two steps I suffixed with "these steps take time"? In a Hyperthreaded system, steps 2 and 4 are the only ones that need to be performed in real-time, whereas steps 1 and 3 are performed in the background in the hardware (divorced from any concept of threads or processes or CPU cores).

Now, this process doesn't completely speed up multithreaded software, but in an environment where threads often have extremely small workloads that they perform very frequently, the quantity of thread-switches can be expensive. Even in environments that don't conform to that paradigm, there can be benefits from Hyperthreading.

Let me know if you need any clarifications. It's been a few years since CS250, so I may be mixing up terminology here-or-there; let me know if I'm using the wrong terms for something. I'm 99.9997% certain that everything I'm describing is accurate in terms of the logic of how it all works.

Keeling answered 11/9, 2015 at 20:57 Comment(7)
Steps 2 and 4 don't happen at all with hyperthreading: As far as the OS is concerned, the alternate thread IS running on a core, even when the thread is stalled because the core's resources are being used by the primary thread.Goulder
Thanks for your comprehensive answer. It's very helpful. Can HT result in >50% performance gain when all threads have no I/O involved? I realized that by mistake I reported the performance gain as 100%. It's actually close to 64%. But even this gain is over the top.Vet
"a 4-core CPU can still only have 4 threads running simultaneously" -- this is a misleading oversimplification of how HT works. Gains come from the CPU being able to do actual work, either concurrently (if two different threads don't at the moment need the same execution resources) or by doing work in one thread while the other is stalled (e.g. waiting on a memory fetch). A 4-CPU HT chip, i.e. with 8 logical cores, can in some scenarios actually execute 8 threads concurrently. It is only the fact that it can't do so 100% of the time that prevents full scalability on concurrent algorithms.Patchwork
@Vet Difficult to say. The thing with multithreading is that you can observe great performance gains on one system and poor gains on another, even if they have the same CPU. It often depends on what's happening in the background.Keeling
@PeterDuniho It's been a few years since I learned about hyperthreading, so the tech could have evolved in that time. My understanding is that Hyperthreading is /only/ the duplication of process registers. You're right that HT allows a dormant thread to use the execution resources if the primary thread is stalled due to memory fetching. However, I don't think it would be accurate to say that it therefore is equivalent to 8 threads running simultaneously. From a user level viewpoint it would be semantically correct. But at a hardware level, that's not quite what's happening.Keeling
Execution resources aren't duplicated for the sake of the extra logical core, but they are already duplicated even in a single-core CPU for other reasons, e.g. speculative execution. As an example, if branch prediction determines that resources that would normally be used for speculative execution are unneeded, those duplicated execution resources can be used in a different thread. HT involves a lot of different techniques together; it's why really it's off-topic to even try to explain it on Stack Overflow, as a correct explanation is far too broad for this forum.Patchwork
Your comment tries to convince a reader, that context switches is the main evil in multithreading, and HT only reduces the overhead caused by this evil. This is not true, the main savings of HT are due to interleaving pipelines.Aglow
H
3

Hyper-threading works by interleaving instructions in the processor execution pipeline. While the processor is performing read-write operations on one 'thread' it is performing logical evaluation on the other 'thread', keeping them separate and giving you a perceived doubling in performance.

The reason you get such a big speedup is because there is no branching logic in your DoWork method. It is all a big loop with a very predictable execution sequence.

A processor execution pipeline has to go through several clock cycles to execute a single calculation. The processor attempts to optimise the performance by pre-loading the execution buffer with the next few instructions. If the instruction loaded is actually a conditional jump (such as an if statement), this is bad news, because the processor has to flush the entire pipeline and fetch instructions from a different part of memory.

You may find that if you put if statements in your DoWork method, you will not get 100% speedup...

Honorarium answered 11/9, 2015 at 20:54 Comment(3)
There are two aspects of your answer that I didn't quite get: 1. There is no read/write in the threads, so there should no "free" time with CPU to perform logical evaluation on the other thread. And thus there is no speed-up expected from this. 2. If may reduce the gains, but even without the if should I get such high gains? If both of these reasons are not valid in this context, what is the secret behind 64% performance increase?Vet
I don't mean thread as in software threads. I am just recycling the word from 'hyperthreading' which has a different meaning in this context. Bad name imo. I'm not an expert, just paraphrasing from my general understanding. I'm sure there is an authoritative text somewhere that can do a much better job of explaining.Honorarium
@displayName: Check my long answer. Sequental code does not fully utilize processor's pipeline, even if there are not memory accesses.Aglow

© 2022 - 2024 — McMap. All rights reserved.