Haskell lightweight threads overhead and use on multicores

Asked 1/5, 2011 at 9:43 Answered 1/5, 2011 at 15:54

Solved multithreading haskell concurrency multicore lightweight-processes

I've been reading the "Real World Haskell" book, the chapter on concurrency and parallelism. My question is as follows:

Since Haskell threads are really just multiple "virtual" threads inside one "real" OS-thread, does this mean that creating a lot of them (like 1000) will not have a drastic impact on performance? I.e., can we say that the overhead incurred from creating a Haskell thread with forkIO is (almost) negligible? Please bring pactical examples if possible.
Doesn't the concept of lightweight threads prevent us from using the benefints of multicore architectures? As I understand, it is not possible for two Haskell threads to execute concurrently on two separate cores, because they are really one single thread from the operating system's point of view. Or does the Haskell runtime do some clever tricks to ensure that multiple CPU's can be made use of?

Snuck answered 1/5, 2011 at 9:43 Comment(2)

See also #3064152 – Minesweeper 1/5, 2011 at 15:37

And also #1921305 – Minesweeper 1/5, 2011 at 15:38

GHC's runtime provides an execution environment supporting billions of sparks, thousands of lightweight threads, which may be distributed over multiple hardware cores. Compile with -threaded and use the +RTS -N4 flags to set your desired number of cores.

sparks/threads/workers/cores

Specifically:

does this mean that creating a lot of them (like 1000) will not have a drastic impact on performance?

Well, creating 1,000,000 of them is certainly possible. 1000 is so cheap it won't even show up. You can see in thread creation benchmarks, such as "thread ring" that GHC is very, very good.

Doesn't the concept of lightweight threads prevent us from using the benefints of multicore architectures?

Not at all. GHC has been running on multicores since 2004. The current status of the multicore runtime is tracked here.

How does it do it? The best place to read up on this architecture is in the paper, "Runtime Support for Multicore Haskell":

The GHC runtime system supports millions of lightweight threads by multiplexing them onto a handful of operating system threads, roughly one for each physical CPU. ...

Haskell threads are executed by a set of operating system threads, which we call worker threads. We maintain roughly one worker thread per physical CPU, but exactly which worker thread may vary from moment to moment ...

Since the worker thread may change, we maintain exactly one Haskell Execution Context (HEC) for each CPU. The HEC is a data structure that contains all the data that an OS worker thread requires in order to execute Haskell threads

You can monitor your threads being created, and where they're executing, via threadscope.. Here, e.g. running the binary-trees benchmark:

threadscope

Minesweeper answered 1/5, 2011 at 15:54 Comment(1)

Many thanks for the comprehensive answer and especially for referencing the paper on multicore Haskell. – Snuck 1/5, 2011 at 16:13

The Warp webserver uses these lightweight threads extensively to get really good performance. Note that the other Haskell web servers also smoke the competition: this is more of a "Haskell is good" than "Warp is good."
Haskell provides a multithreaded runtime which can distribute lightweight threads across multiple system threads. It works very well for up to 4 cores. Past that, there are some performance issues, though those are being actively worked on.

Spica answered 1/5, 2011 at 10:37 Comment(7)

Do you have any references regarding the performance issues on >4 cores that you mention? – Educator 1/5, 2011 at 11:0

Nothing published, no. I know of the problem from personal experience, and believe I've heard Johan mention that they're working on it. Sorry to be so vague. – Spica 1/5, 2011 at 12:12

I'm a bit skeptical. See e.g. the speedups in Simon's recent paper: i.imgur.com/rWb7l.png -- from this .pdf research.microsoft.com/en-us/um/people/simonpj/papers/parallel/… -- similar results in the concurrent collections and data parallel papers are reported (scaling some problems up to the 32 or 48 core mark). – Minesweeper 1/5, 2011 at 16:16

I'd be thrilled to hear that there is no issue scaling to >4 cores. I just know that Warp (and I believe Snap and Happstack) showed lower req/sec when providing a -N value greater than 3 (or 4, depending on the test). If recent changes in GHC mean this is not the case anymore, I'll happily eat my words. – Spica 1/5, 2011 at 17:7

I believe the problem Michael is referring to is that the architecture of the IO manager isn't well suited to scaling beyond a few cores. In particular, it should probably use one IO manager thread per core. So this isn't a runtime issue, and it doesn't affect scaling for CPU-intensive workloads, but IO-intensive applications (such as web servers) might encounter the bottleneck. As far as I know nobody has done a thorough analysis yet. – Keffer 2/5, 2011 at 6:37

Yes, Snap has also seen this problem. It has gotten better recently, but still seems to fall off after maybe 5 or 6 threads. Last time I did benchmarks Warp had faster absolute performance than Snap, but Snap scaled better beyond 4 cores. That's probably not specific enough to infer anything concrete, but I thought it was interesting. – Sayyid 4/5, 2011 at 13:32

What is the status quo of performance degradation with increased number of core beyond 4-5 as of today? – Cassis 7/6, 2017 at 12:49

Creating 1000 processes is relatively light weight; don't worry about doing it. As for performance, you should just benchmark it.

As has been pointed out before, multiple cores work just fine. Several Haskell threads can run at the same time by being scheduled on different OS threads.

Adena answered 1/5, 2011 at 12:46 Comment(0)

Recommended topics

Hot tags