Given:
- An entirely CPU bound very large (i.e., more than a few CPU cycles) job, and
- A CPU with four physical and a total of 8 logical cores,
should 8, 16, and 28 threads perform better than 4? I understand that four threads would have lesser context switches to perform and lesser overhead than 8, 16, or 28 threads would have on a four physical core machine. However, the timings are -
Threads Time Taken (in seconds)
4 78.82
8 48.58
16 51.35
28 52.10
The code used to test get the timings is mentioned in the Original Question section below. The CPU specifications are also given at the bottom.
Original Question
What does it mean when we say
Hyper-threading works by duplicating certain sections of the processor—those that store the architectural state—but not duplicating the main execution resources. This allows a hyper-threading processor to appear as the usual "physical" processor and an extra "logical" processor to the host operating system
?
This question is asked on SO today, and it tests the performance of multiple threads doing the same work. It has the following code:
private static void Main(string[] args)
{
int threadCount;
if (args == null || args.Length < 1 || !int.TryParse(args[0], out threadCount))
threadCount = Environment.ProcessorCount;
int load;
if (args == null || args.Length < 2 || !int.TryParse(args[1], out load))
load = 1;
Console.WriteLine("ThreadCount:{0} Load:{1}", threadCount, load);
List<Thread> threads = new List<Thread>();
for (int i = 0; i < threadCount; i++)
{
int i1 = i;
threads.Add(new Thread(() => DoWork(i1, threadCount, load)));
}
var timer = Stopwatch.StartNew();
foreach (var thread in threads) thread.Start();
foreach (var thread in threads) thread.Join();
timer.Stop();
Console.WriteLine("Time:{0} seconds", timer.ElapsedMilliseconds/1000.0);
}
static void DoWork(int seed, int threadCount, int load)
{
var mtx = new double[3,3];
for (var i = 0; i < ((10000000 * load)/threadCount); i++)
{
mtx = new double[3,3];
for (int k = 0; k < 3; k++)
for (int l = 0; l < 3; l++)
mtx[k, l] = Math.Sin(j + (k*3) + l + seed);
}
}
(I have cut out a few braces to bring the code to a single page for quick readability.)
I ran this code on my machine to replicate the issue. My machine has four physical cores and eight logical ones. The method DoWork()
in the code above is completely CPU bound. I felt hyper-threading could contribute to maybe a 30% speedup (because here we have as many CPU-bound threads as the physical cores (i.e., 4)). But it nearly does attain a 64% performance gain. When I ran this code for four threads, it took about 82 seconds, and when I ran this code for 8, 16, and 28 threads, it ran in all the cases in about 50 seconds.
To summarize the timings:
Threads Time Taken (in seconds)
4 78.82
8 48.58
16 51.35
28 52.10
I could see that CPU usage was ~50% with four threads. Shouldn't it be ~100%? After all, my processor has only four physical cores. And the CPU usage was ~100% for 8 and 16 threads.
I am trying to understand Why would a fully CPU-bound process work better with hyperthreading?.
For the sake of completion,
- I have Intel Core i7-4770 CPU @ 3.40 GHz, 3401 MHz, 4 Core(s), 8 Logical Processor(s).
- I ran the code in Release mode.
- I know that the way timings are measured is bad. This will only give time for the slowest thread. I took the code as it is from the other question.
DoWork()
method scales the workload for each method invocation inversely according to the number of threads), and so adding new threads results in shorter run times. Note that once you go past the number of logical cores (8), runtime increases again; you are adding scheduling overhead without adding any new computational resources in that case. – Patchwork