Why are threads so expensive, that event-driven non-blocking IO is better on benchmarks

Asked 28/10, 2012 at 3:10 Answered 28/10, 2012 at 3:44

Solved multithreading event-driven event-driven-design

I recently started learning about node.js, a javascript library on top of V8 known for its non-blocking IO and incredible speed.

To my understanding, node does not wait for IO to respond, but runs an event loop (similar to a game loop) that keeps checking unfinished operations and continues/completes them as soon as IO responds. Node performance was compared to Apache HTTPD with node being significantly faster while using less memory.

Now if you read about Apache, you learn it uses 1 thread per user, which supposedly slows it down significantly and this is where my question appears:

If you compare threads to what node does internally in its event loop, you start seeing the similarities: Both are abstractions of an unfinished process that waits for a resource to respond, both check if the operation made progress regularly and then don't occupy the CPU for a certain amount of time (at least I think a good blocking API sleeps for a few milliseconds before rechecking).

Now where is that striking, critical difference that makes threads so much worse?

Carnatic answered 28/10, 2012 at 3:10 Comment(0)

The difference here is context switching. Swapping threads by the OS requires:

saving the instruction pointer (done by the CPU)
saving the CPU register (may not be neccessary if the thread has made a blocking call, but is neccessary if it is preempted)
swapping the call stacks. Even if the stacks reside in the same virtual memory space, this is at least one write and some reads and even applies to micro-threads(fibers).
in case of swapping to a different process, swapping to kernel mode, updating the virtual memory table and returning to user mode.

In the case of an event queue:

the state is updated. This needs to happen in any case.
the event handler returns. Instead of swapping the call stacks, the current call stack is popped.
the event queue is checked for pending requests. Only if there is no pending request, the application waits. This can be done by sleeping repeatedly (as suggested by the OP) or (better) by making a blocking call to the event queue. If the event queue (e.g. a set of TCP sockets) is managed by the OS, then the OS is responsible for notifying the application on a new event (a socket can accept more data).

If the server is highly loaded, the only overhead of an event queue is a handler return, reading the queue and a handler call. In the threaded approach, there is additional overhead from swapping the threads.

Also, as mentioned by PST, the threaded approach introduces the need for locking. Locking itself is cheap, but waiting for a resource to be released by some other thread requires additional context switching as the waiting thread cannot continue. It is even possible for a thread to be swapped in to acquire a lock only to be swapped out a few clock cycles later because it needs to lock another resource as well. Compare how much is done by the OS (reading the tread queue and swapping call stacks, at the very least) to how much is done by the thread (returning from a call and making another call).

Lewandowski answered 28/10, 2012 at 3:44 Comment(0)

From one aspect, it does depend on the implementation of the threading particular to that language. In general, however, it is the creation of a thread that is the costly portion, not the running of a thread. As such, some languages (like .Net) keep a thread pool of threads just laying around, so you can grab one that is essentially already created, keeping costs down.

The problem with threads is also, from what a professor told me, is that every language has the equivalent of a Thread.Yield() function, but no one actually makes use of it; so, every thread you will encounter is extremely aggressive in scheduling, which sets up all sorts of wars between mutexs and sempaphores; some threads, due to the level of aggression used, never actually run, which is a problem in of itself.

The benefit to threads is that they offload functionality from other loops, like the GUI loop, at the cost of increased functionality. Events, to the best of my knowledge, still run in a single thread (unless specifically told to do otherwise).

Garling answered 28/10, 2012 at 3:32 Comment(6)

I disagree. The issue with threads often isn't with the creation - that is, assume that a basic HTTP request takes "significantly longer" than the creation of a thread to handle it - but rather contention, context-switching and resource usage to maintain the thread states. So then the issue generally isn't with 10 or even 200 threads, really, but n threads, where n is above some threshold. Scalability papers often compare say 10,000 "native" threads to 10,000 continuations. Of course, considering that some thread implementations are really "green" then .. apples and oranges. – Monte 28/10, 2012 at 3:34

Those issues can be significant, I grant you, and within a particular usage scenario, such as HTTP (which the author does allude to with his mention of JavaScript) could be the top factor in determining their usage. However, when it comes to the general usage of threads, it really does come down to creation costs. developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/… – Garling 28/10, 2012 at 3:42

No, it does not come down to create costs. 90 microseconds? Look at the other numbers, and then we can discard worrying about the creation "cost". – Monte 28/10, 2012 at 3:44

Again, that comes down to the language you are developing in. C/C++ will in general be remarkably fast, as will Python (which provides a thin wrapper around pthreads), while other languages will have significantly more wrapping / functions that can make those costs climb. And 90 microseconds is a fairly lengthy period of time, from the computer's standpoint. It's been a while, but I believe it averages 30 microseconds per function (statistics, yes). – Garling 28/10, 2012 at 3:55

And again .. creation is not the factor. Even if creation took 5 milliseconds (or 5000 microseconds), it would only be 10% of a 50ms request. The problem is usually scalability when dealing with many concurrent threads. (Of course for very small operations - e.g. each successively smaller branch in a mergesort - the cost of creation plays a significant role.) – Monte 28/10, 2012 at 3:59

Hmm, that linked table is somewhat confusing as it lists "running" costs together with "creation" costs. – Monte 28/10, 2012 at 4:3

Recommended topics

Hot tags