Does M:N threading model really utilizes CPU cores?

Asked 1/12, 2019 at 23:16 Answered 24/1 at 8:55

There are several threading models that are used for scheduling threads within applications:

1:1 (kernel level threading): Each thread created by a user is mapped to a scheduled thread in the kernel.
N:1 (user level threading): All the threads that are created by a user in a single application are actually mapped to a single scheduled kernel thread.
M:N (hybrid threading): M threads created by a user in an application are mapped to N kernel threads.

User-level threads are considered faster than kernel-level threads, because context switching at the kernel level is more expensive than in user-level. One big disadvantage of user-level threads is that they don't utilize multiprocessor systems as they only use one kernel-level thread.

There are some articles telling that M:N threading model is best used with N as the number of CPU cores (here is an example). In this way we can achieve the advantage of both 1:1 and N:1 threading models.

My questions are:

When we use kernel-level threads we also get 'extra' time slice during execution (as opposed to user-level threads), so doesn't it make up for the slow context-switch?
Why is the number of CPU cores even relevant here? As I understand, the number of CPU cores is pretty transparent here, because even when we use the exact amount kernel-threads, nothing guarantee us that they are really executed simultaneously, because other cores can execute other threads from other processes, and 'our' threads might still use context switch afterwards. So they use context switch regardless of the how many CPU cores we have. Am I wrong here?.

Vincenza answered 1/12, 2019 at 23:16 Comment(0)

Alan Cox once said, before multi-core architecture were common place, that: "A Computer is a state machine. Threads are for people who can't program state machines."

Kernel threads that gets scheduled, at least potentially, on different cores makes sense. It is my experience that in the vast majority of situations, user threads are nothing more of a needlessly costly abstraction designed to allow you to not think explicitly about managing the state machine which is a CPU core.

Which is fine, of course. We don't always, perhaps even usually, do not need top performance. But if your scenario does not require top performance as a first priority consideration, I would not bother worrying about threading models and just use the simplest.If you do care, I'd go with 1:1 kernel threading and handle a single core multiplexing explicitly.

Hard answered 2/12, 2019 at 6:20 Comment(0)

Go(lang)?, as a case in point, utilizes this model for its concurrency. A goroutine may return from a system call in a different kernel thread than it dispatched from. Go aims and claims to be both efficient and achieve a high degree of utilization.

One problem is that the thread (hammer) has many applications (nail-like-objects). Concurrent programming is a form of program organization to match the operating characteristics of the system is quite different from parallel programming which aims to reduce execution time by maximizing resource utilization. There is considerable overlap, since often concurrent programs are faster and more responsive than their sequential equivalents.

(1): Extra time slices. There may be some schedulers where this is true, but what you are describing is an over-loaded system -- it has more work to do than available resources to do it. The overload may be transient, but to make this a design choice pits you at the mercy of a scheduler upgrade that balances between processes/jobs/sessions/??? rather than threads; your extra is an implementation detail.

(2): Yes, you are more right than wrong here. Just creating N kernel threads is not sufficient unless your machine has some form of coscheduling, but still could be victimized by system call that needs to synchronize on real io (eg. read(2)). At the risk of being a Go-fanboi, the Go scheduler circumvents this by keeping a system call slush fund of parked kernel threads in addition to the #execution units; so really has an L:M:N thread model.

Fireback answered 2/12, 2019 at 18:4 Comment(3)

So if I understand correctly, there are 3 usual cases in which we want multithreading. 1 is for blocking IO, in such case we should only use kernel threads. 2 is for parallelism (to run faster), in such case we would use kernel threads up to the number of logical cores we have, because more than that would not really give us more CPU time (usually). And 3 is for concurrency, in such case we want M:N model (where N=logical cores) for faster execution as well as more threads than cores. Is that correct? – Vincenza 4/12, 2019 at 6:33

and what do you mean by L:M:N thread model? I thought N is kernel threads and M is goroutines, but what is L? – Vincenza 4/12, 2019 at 6:34

I probably shouldn't invent new notations, but the goroutines are an L, where as the go scheduler maintains M actual threads over N virtual cpus. So when Go is setup to use 4 cores, it launches say 7 threads to accommodate some blocked in kernel calls and multiplexes them among L go-routines. That is a very coarse approximation of a much more nuanced scheduling system. – Fireback 4/12, 2019 at 14:27

User-level threads are considered faster than kernel-level threads

Both can be faster depending on workload, OS, hardware, and green threads implementation.

doesn't it make up for the slow context-switch?

Sometimes. Usually no. Kernel threads have a stack, when you have thousands of them they consume gigabytes of RAM, and when they context switch you're guaranteed to have plenty of cache misses. That's assuming your workload is IO heavy and you context switch a lot.

Why is the number of CPU cores even relevant here?

Irrelevant. You should use number of hardware threads, many modern CPUs have 2 hardware threads per core.

other cores can execute other threads from other processes

If they take considerable time, it means you have 2 processes loading the system. In this case, a better approach might be use 50% of hardware threads. When people design resource-demanding software, they usually assume it'll be the main workload of a computer.

creating a new kernel-level thread (on a system with a single hardware thread) usually won't give me extra CPU time because of the context switch

If there're other processes which also want 100% CPU, with 2 threads you will indeed get extra CPU time. But that's rare edge case, user gonna hit reset button due to an unresponsive system. Generally, unless blocking IO or security are involved, there's little point in creating more kernel threads than hardware threads.

Garnish answered 1/12, 2019 at 23:53 Comment(7)

So if I understand your second answer correctly, creating a new kernel-level thread (on a system with a single hardware thread) usually won't give me extra CPU time because of the context switch? The only 'real' benefit of a new thread is when I use blocking operations? – Vincenza 2/12, 2019 at 0:25

Also, your fourth answer referred to a resource-demanding software, but resource-demanding software is one that uses a lot of resources, so if it only uses as many threads as the hardware threads it is not really resource demanding.. – Vincenza 2/12, 2019 at 0:39

@Vincenza 1 see update 2 CPU time is just 1 type of resources. Also, theoretically, when you only have a single process running, having kernel=hardware threads gives you maximum performance due to zero context switches. Creating more kernel threads will reduce performance because you introduce context switches, associated cache misses, scheduler overhead. – Garnish 2/12, 2019 at 0:49

OK I got it now, for some reason I forgot that the CPU is usually not at full capacity... So while the CPU is not at 100% it is pointless to create another thread because it won't really give us extra time slice (because the extra time slice will be at our expense anyhow). Am I correct? – Vincenza 4/12, 2019 at 6:8

You said " unless blocking IO or security are involved, there's little point in creating more kernel threads than hardware threads". What about concurrency (like in a web server)? Did you mean that it is best to use user threads in that case because kernel threads have no advantage over them? Or did you mean to not use concurrency at all (for some reason)? And what is the 'security' that can be involved? – Vincenza 4/12, 2019 at 6:18

Another thing that bothers me, you have mentioned that if some other processes also want 100% CPU it would make an unresponsive system. Is that always so? I thought it would just mean that everything would be a little slower but not unresponsive (unless there are a LOT of processes that want 100%) – Vincenza 4/12, 2019 at 6:23

@YAYAdest: Correct, having more tasks ready to run than physical cores will just make your system feel slower, not unresponsive. Preemptive multi-tasking solved this problem decades ago, back when most machines were single-core. Unresponsive tends to happen when you're running out of RAM, thrashing the swap file, so forward progress is many thousands of times slower. – Endlong 24/1 at 18:25

It is about Little's Law: L = A x W.

A = Throughput
L = Capacity
W = Wait time

So throughput will be A = L / W

We can increase throughput by either increasing the L (CPU cores/threads) or by reducing W.

In 1:1 threading A = (core count) / W. All Platform threads do both waiting and work. Platform threads have more overhead and require more resources than user threads, so it can be very waist full to let them just sit and wait. Creating and destroying platform threads involve a lot of system calls / context switches / pipeline stalls / memory.

In M:1 threading we get A = (1 core) / W. The M amount of lightweight user threads are there to reduce the waiting (latency) as much as possible so that we don't have our single core sitting and waiting for I/O, but keep it busy with real work. Creating / Destroying / Context switches are much cheaper for user threads.

In M:N threading we have A = (core count) / W. The maximum amount of throughput of the system is governed by the max amount of CPU cores we have. If we can eliminate all the waiting with the M lightweight user threads, then it is optimal to use N platform threads to keep N CPU cores at 100%. If more than N platform threads are used, it will introduce unnecessary overhead and resource consumption. All the extra platform threads (N > core count) will just wait in the CPU scheduler for a chance to run. The N platform threads are typically in a thread pool and (re)creating/destroying them is not necessary.

In summery: M is waiting threads. N is worker threads. In the special case where all tasks are CPU intensive, M:N threading has no advantage over 1:1 threading. In fact, the user-space scheduler will add a little overhead.

Instil answered 24/1 at 8:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags