Multicore + Hyperthreading - how are threads distributed?

Asked 11/12, 2008 at 18:7 Answered 6/1, 2014 at 22:39

multithreading operating-system multicore hyperthreading

I was reading a review of the new Intel Atom 330, where they noted that Task Manager shows 4 cores - two physical cores, plus two more simulated by Hyperthreading.

Suppose you have a program with two threads. Suppose also that these are the only threads doing any work on the PC, everything else is idle. What is the probability that the OS will put both threads on the same core? This has huge implications for program throughput.

If the answer is anything other than 0%, are there any mitigation strategies other than creating more threads?

I expect there will be different answers for Windows, Linux, and Mac OS X.

Using sk's answer as Google fodder, then following the links, I found the GetLogicalProcessorInformation function in Windows. It speaks of "logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios." This implies that jalf is correct, but it's not quite a definitive answer.

Mastership answered 11/12, 2008 at 18:7 Comment(3)

I'd just like to comment that the optimal policy is not always to run the two tasks on different cores; for instance, if you have two tasks which share memory and perform many non-overlapping operations, running them on the same core may provide higher performance because the reduction in cache misses offsets the slightly slower runtime of occasionally having to share the processor (remember, in this scenario both threads will usually run in parallel even on one core because they're using different logical units). – Ganda 28/7, 2010 at 22:47

Just as an FYI: If you're looking for raw performance, you may want to disable hyperthreading. Unless, that is Intel has finally made it work well. In the past (last I measured was on a 2x processor P4 Xeon box with hyperthreading (yielding 4 logical processors to the OS), the net performance of running 4 computationally intensive threads with hyperthreading enabled yield a lower net performance than running 2 threads with hyperthreading disabled. Obviously, you'd want to test this yourself with the latest hardware - it may no longer be the case. But, be aware... – Toots 28/7, 2010 at 23:6

Running threads on the same core is EXACTLY what you want, sometimes. If you're running lock-free data structures, for example; when you have threads on separate physical cores, the cache line swapping between cores DECIMATES performance. – Manly 27/4, 2011 at 10:29

Linux has quite a sophisticated thread scheduler which is HT aware. Some of its strategies include:

Passive Loadbalancing: If a physical CPU is running more than one task the scheduler will attempt to run any new tasks on a second physical processor.

Active Loadbalancing: If there are 3 tasks, 2 on one physical cpu and 1 on the other when the second physical processor goes idle the scheduler will attempt to migrate one of the tasks to it.

It does this while attempting to keep thread affinity because when a thread migrates to another physical processor it will have to refill all levels of cache from main memory causing a stall in the task.

So to answer your question (on Linux at least); given 2 threads on a dual core hyperthreaded machine, each thread will run on its own physical core.

Alsatia answered 11/12, 2008 at 19:4 Comment(2)

I don't see that happening on my machine. Running stress -c 2 on my i5-2520M, it sometimes schedules (and keeps) the two threads onto HT cores 1 and 2, which map to the same physical core. Even if the system is idle otherwise. (I found the HT->physical core assgnment with egrep "processor|physical id|core id" /proc/cpuinfo | sed 's/^processor/\nprocessor/g'.) – Lumper 2/4, 2015 at 18:16

I made this problem more concrete with this question. – Lumper 2/4, 2015 at 20:41

A sane OS will try to schedule computationally intensive tasks on their own cores, but problems arise when you start context switching them. Modern OS's still have a tendency to schedule things on cores where there is no work at scheduling time, but this can result in processes in parallel applications getting swapped from core to core fairly liberally. For parallel apps, you do not want this, because you lose data the process might've been using in the caches on its core. People use processor affinity to control for this, but on Linux, the semantics of sched_affinity() can vary a lot between distros/kernels/vendors, etc.

If you're on Linux, you can portably control processor affinity with the Portable Linux Processor Affinity Library (PLPA). This is what OpenMPI uses internally to make sure processes get scheduled to their own cores in multicore and multisocket systems; they've just spun off the module as a standalone project. OpenMPI is used at Los Alamos among a number of other places, so this is well-tested code. I'm not sure what the equivalent is under Windows.

Renatarenate answered 11/12, 2008 at 19:10 Comment(0)

I have been looking for some answers on thread scheduling on Windows, and have some empirical information that I'll post here for anyone who may stumble across this post in the future.

I wrote a simple C# program that launches two threads. On my quad core Windows 7 box, I saw some surprising results.

When I did not force affinity, Windows spread the workload of the two threads across all four cores. There are two lines of code that are commented out - one that binds a thread to a CPU, and one that suggests an ideal CPU. The suggestion seemed to have no effect, but setting thread affinity did cause Windows to run each thread on their own core.

To see the results best, compile this code using the freely available compiler csc.exe that comes with the .NET Framework 4.0 client, and run it on a machine with multiple cores. With the processor affinity line commented out, Task Manager showed the threads spread across all four cores, each running at about 50%. With affinity set, the two threads maxed out two cores at 100%, with the other two cores idling (which is what I expected to see before I ran this test).

EDIT: I initially found some differences in performance with these two configurations. However, I haven't been able to reproduce them, so I edited this post to reflect that. I still found the thread affinity interesting since it wasn't what I expected.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Threading.Tasks;

class Program
{
    [DllImport("kernel32")]
    static extern int GetCurrentThreadId();

    static void Main(string[] args)
    {
        Task task1 = Task.Factory.StartNew(() => ThreadFunc(1));
        Task task2 = Task.Factory.StartNew(() => ThreadFunc(2));
        Stopwatch time = Stopwatch.StartNew();
        Task.WaitAll(task1, task2);
        Console.WriteLine(time.Elapsed);
    }

    static void ThreadFunc(int cpu)
    {
        int cur = GetCurrentThreadId();
        var me = Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Where(t => t.Id == cur).Single();
        //me.ProcessorAffinity = (IntPtr)cpu;     //using this line of code binds a thread to each core
        //me.IdealProcessor = cpu;                //seems to have no effect

        //do some CPU / memory bound work
        List<int> ls = new List<int>();
        ls.Add(10);
        for (int j = 1; j != 30000; ++j)
        {
            ls.Add((int)ls.Average());
        }
    }
}

Nutgall answered 28/7, 2010 at 22:22 Comment(2)

You should be aware that if you are using Task Manager to look at the usage, Task Manager itself can be very disruptive to the system because it generally runs with a boosted priority. Try forcing Task Manager to Low Priority and see if the pattern changes. – Satyriasis 28/7, 2010 at 22:31

Can you share your run times under the different configurations? – Mastership 29/7, 2010 at 0:31

The probability is essentially 0% that the OS won't utilize as many physical cores as possible. Your OS isn't stupid. Its job is to schedule everything, and it knows full well what cores it has available. If it sees two CPU-intensive threads, it will make sure they run on two physical cores.

Edit Just to elaborate a bit, for high-performance stuff, once you get into MPI or other serious parallelization frameworks, you definitely want to control what runs on each core.

The OS will make a sort of best-effort attempt to utilize all cores, but it doesn't have the long-term information that you do, that "this thread is going to run for a very long time", or that "we're going to have this many threads executing in parallel". So it can't make perfect decisions, which means that your thread will get assigned to a new core from time to time, which means you'll run into cache misses and similar, which costs a bit of time. For most purposes, it's good enough, and you won't even notice the performance difference. And it also plays nice with the rest of the system, if that matters. (On someone's desktop system, that's probably fairly important. In a grid with a few thousand CPU's dedicated to this task, you don't particularly want to play nice, you just want to use every clock cycle available).

So for large-scale HPC stuff, yes, you'll want each thread to stay on one core, fixed. But for most smaller tasks, it won't really matter, and you can trust the OS's scheduler.

Silverweed answered 11/12, 2008 at 18:31 Comment(9)

I'd like to believe that too, but a little evidence would be useful. – Mastership 11/12, 2008 at 18:34

Evidence of what? Create a program which runs two threads in an infinite loop, and check CPU usage. You'll find that any sane OS assigns a thread to each core. Do you think it's a problem the OS designers haven't considered? Of course not. It's a fundamental issue that an OS has to handle. – Silverweed 11/12, 2008 at 18:43

I don't have such a system at hand to test, otherwise that's not a bad suggestion. – Mastership 11/12, 2008 at 18:51

jaff: there are still performance issues when these things context-switch and get juggled. We see this at the national labs, and all the runtimes on parallel machines set affinity to make sure processes stay on their cores. See open-mpi.org/projects/plpa and my answer below. – Renatarenate 11/12, 2008 at 19:26

Yep, I know it's not 100% optimal, but for most purposes, it comes close enough. My point was simply that the OS isn't so dumb that it'll try to schedule all the CPU-heavy threads on the same core, leaving others totally unused. Of course for MPI or similar, yes, you want full control. :) – Silverweed 11/12, 2008 at 20:10

Removed comments, see summary below in my [alternate] answer. You guys pretty much covered it in your comments anyway; the scheduler can't be perfect in all cases because it has no knowledge of the threads it is scheduling. Optimization is usually possible in specific situations. – Photopia 26/11, 2010 at 4:39

"If it sees two CPU-intensive threads, it will make sure they run on two physical cores." This is by no means invariably optimal. If the threads operate on the same memory, they would benefit HUGELY being on the same physical core. Also, there's a difference between CPU intensive and cache-use intensive. If the threads are heavily utilizing cache and are operating on different memory, THEN they benefit from being on separate cores (which is to say, really, being on separate caches). – Manly 27/4, 2011 at 10:39

@Blank: yes? How does that contradict what I said? where did I claim that the OS would schedule optimally? And apart from that, I have to say I'm skeptical about your "HUGELY" claim. Unless you force your OS into context switching far too frequently (in which case you have more serious problems), threads will run on the same core long enough for perf hit from cache misses when a thread is moved to a different core to be fairly minor. You're not going to see performance triple just from forcing thread affinity. – Silverweed 27/4, 2011 at 11:24

@Jalf: the use case I had in mind for 'hugely' was lock-free data structures. You see performance fall off a cliff once you start running on separate physical cores - all the cache line swapping, since every CAS write invalidates the cache line for every other physical core. Context switching isn't the problem. – Manly 27/4, 2011 at 11:27

This is a very good and relevant question. As we all know, a hyper-threaded core is not a real CPU/core. Instead, it is a virtual CPU/core (from now on I'll say core). The Windows CPU scheduler as of Windows XP is supposed to be able to distinguish hyperthreaded (virtual) cores from real cores. You might imagine then that in this perfect world it handles them 'just right' and it is not an issue. You would be wrong.

Microsoft's own recommendation for optimizing a Windows 2008 BizTalk server recommends disabling HyperThreading. This suggests, to me, that the handling of hyper-threaded cores isn't perfect and sometimes threads get a time slice on a hyper-threaded core and suffer the penalty (a fraction of the performance of a real core, 10% I'd guess, and Microsoft guesses 20-30%).

Microsoft article reference where they suggest disabling HyperThreading to improve server efficiency: http://msdn.microsoft.com/en-us/library/cc615012(BTS.10).aspx

It is the SECOND recommendation after BIOS update, that is how important they consider it. They say:

FROM MICROSOFT:

"Disable hyper-threading on BizTalk Server and SQL Server computers

It is critical hyper-threading be turned off for BizTalk Server computers. This is a BIOS setting, typically found in the Processor settings of the BIOS setup. Hyper-threading makes the server appear to have more processors/processor cores than it actually does; however hyper-threaded processors typically provide between 20 and 30% of the performance of a physical processor/processor core. When BizTalk Server counts the number of processors to adjust its self-tuning algorithms; the hyper-threaded processors cause these adjustments to be skewed which is detrimental to overall performance. "

Now, they do say it is due to it throwing off the self-tuning algorithms, but then go on to mention contention problems (suggesting it is a larger scheduling issue, at least to me). Read it as you will, but I think it says it all. HyperThreading was a good idea when were with single CPU systems, but is now just a complication that can hurt performance in this multi-core world.

Instead of completely disabling HyperThreading, you can use programs like Process Lasso (free) to set default CPU affinities for critical processes, so that their threads never get allocated to virtual CPUs.

So.... I don't think anyone really knows just how well the Windows CPU Scheduler handles virtual CPUs, but I think it is safe to say that XP handles it worst, and they've gradually improved it since then, but it still isn't perfect. In fact, it may NEVER be perfect because the OS doesn't have any knowledge of what threads are best to put on these slower virtual cores. That may be the issue there, and why Microsoft recommends disabling HyperThreading in server environments.

Also remember even WITHOUT HyperThreading, there is the issue of 'core thrashing'. If you can keep a thread on a single core, that's a good thing, as it reduces the core change penalties.

Photopia answered 26/11, 2010 at 3:17 Comment(0)

You can make sure both threads get scheduled for the same execution units by giving them a processor affinity. This can be done in either windows or unix, via either an API (so the program can ask for it) or via administrative interfaces (so an administrator can set it). E.g. in WinXP you can use the Task Manager to limit which logical processor(s) a process can execute on.

Otherwise, the scheduling will be essentially random and you can expect a 25% usage on each logical processor.

Thetisa answered 11/12, 2008 at 18:14 Comment(1)

While I’ve never been one that likes to leave things up to the OS, setting a threads affinity mask can be detrimental to performance if things get busy. Would SetThreadIdealProcessor() be a better option? – Stringy 21/2, 2009 at 5:2

I don't know about the other platforms, but in the case of Intel, they publish a lot of info on threading on their Intel Software Network. They also have a free newsletter (The Intel Software Dispatch) you can subscribe via email and has had a lot of such articles lately.

Curson answered 11/12, 2008 at 18:32 Comment(0)

The chance that the OS will dispatch 2 active threads to the same core is zero unless the threads were tied to a specific core (thread affinity).

The reasons behind this are mostly HW related:

The OS (and the CPU) wants to use as little power as possible so it will run the tasks as efficient as possible in order to enter a low power-state ASAP.
Running everything on the same core will cause it to heat up much faster. In pathological conditions, the processor may overheat and reduce its clock to cool down. Excessive heat also cause CPU fans to spin faster (think laptops) and create more noise.
The system is never actually idle. ISRs and DPCs run every ms (on most modern OSes).
Performance degradation due to threads hopping from core to core are negligible in 99.99% of the workloads.
In all modern processors the last level cache is shared thus switching cores isn't so bad.
For Multi-socket systems (Numa), the OS will minimize hopping from socket to socket so a process stays "near" its memory controller. This is a complex domain when optimizing for such systems (tens/hundreds of cores).

BTW, the way the OS knows the CPU topology is via ACPI - an interface provided by the BIOS.

To sum things up, it all boils down to system power considerations (battery life, power bill, noise from cooling solution).

Illiquid answered 6/1, 2014 at 22:39 Comment(2)

I wasn't asking for a list of reasons why it shouldn't, I think we can all agree on that. I was asking if the OS had enough information to prevent it and if the schedulers were smart enough to use the information. The only part of your answer relevant to that is the mention of ACPI. – Mastership 6/1, 2014 at 23:12

My answer provided the "why" and "how" schedulers behave as they do and also whether they have this information. Are you looking for code snippets from a kernel as an answer? If so, the Linux and Darwin kernels are open source... – Illiquid 7/1, 2014 at 8:53

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags