Is non-blocking I/O really faster than multi-threaded blocking I/O? How?
Asked Answered
S

9

147

I searched the web on some technical details about blocking I/O and non blocking I/O and I found several people stating that non-blocking I/O would be faster than blocking I/O. For example in this document.

If I use blocking I/O, then of course the thread that is currently blocked can't do anything else... Because it's blocked. But as soon as a thread starts being blocked, the OS can switch to another thread and not switch back until there is something to do for the blocked thread. So as long as there is another thread on the system that needs CPU and is not blocked, there should not be any more CPU idle time compared to an event based non-blocking approach, is there?

Besides reducing the time the CPU is idle I see one more option to increase the number of tasks a computer can perform in a given time frame: Reduce the overhead introduced by switching threads. But how can this be done? And is the overhead large enough to show measurable effects? Here is an idea on how I can picture it working:

  1. To load the contents of a file, an application delegates this task to an event-based i/o framework, passing a callback function along with a filename
  2. The event framework delegates to the operating system, which programs a DMA controller of the hard disk to write the file directly to memory
  3. The event framework allows further code to run.
  4. Upon completion of the disk-to-memory copy, the DMA controller causes an interrupt.
  5. The operating system's interrupt handler notifies the event-based i/o framework about the file being completely loaded into memory. How does it do that? Using a signal??
  6. The code that is currently run within the event i/o framework finishes.
  7. The event-based i/o framework checks its queue and sees the operating system's message from step 5 and executes the callback it got in step 1.

Is that how it works? If it does not, how does it work? That means that the event system can work without ever having the need to explicitly touch the stack (such as a real scheduler that would need to backup the stack and copy the stack of another thread into memory while switching threads)? How much time does this actually save? Is there more to it?

Shluh answered 17/12, 2011 at 16:45 Comment(9)
short answer: it's more about the overhead of having a thread per connection. non-blocking io lets one avoid having a thread per connection.Wellestablished
Blocking IO is expensive on a system where you cannot create as many threads as connections exist. On the JVM you can create some thousand threads, but what if you have over 100.000 connections? So you have to stick to an asynchronous solution. However, there are languages where threads are not expensive (e.g. green threads) like in Go/Erlang/Rust where it is not a problem to have 100.000 threads. When the number of threads can be large I believe blocking IO yields faster respond times. But that is something I would also have to ask the experts whether that holds true in reality.Evette
@OliverPlow, I think so too, because blocking IO usually means we let the system handle the "parallel management", instead of doing it ourselves using task queues and such.Drier
@DanD., And what if the overhead of having threads is equal to the overhead of performaing non-blocking IO? (usually true in the case of green threads)Drier
"copying the stack" doesn't happen. Different threads have their stacks at different addresses. Each thread has its own stack pointer, along with other registers. A context switch saves/restores just architectural state (including all the registers), but not memory. Between threads in the same process, the kernel doesn't even have to change the page tables.Brigandage
@Evette I can imagine having 100,000 total connections per hour. But, 100,000 simultaneous open connections on a single OS process? If someone used non-blocking IO for such a huge system, what about debugging and stack-traces? I thought non-blocking systems had much more cryptic stack traces..Induna
@Induna No cryptic thing. Each thread has its OWN stack, that's all.Excision
languages like go and elixir can solve with light thread like utilities without async coding requirementsRavens
@Evette Those languages work by using non-blocking I/O.Mastery
C
56

The biggest advantage of nonblocking or asynchronous I/O is that your thread can continue its work in parallel. Of course you can achieve this also using an additional thread. As you stated for best overall (system) performance I guess it would be better to use asynchronous I/O and not multiple threads (so reducing thread switching).

Let's look at possible implementations of a network server program that shall handle 1000 clients connected in parallel:

  1. One thread per connection (can be blocking I/O, but can also be non-blocking I/O).
    Each thread requires memory resources (also kernel memory!), that is a disadvantage. And every additional thread means more work for the scheduler.
  2. One thread for all connections.
    This takes load from the system because we have fewer threads. But it also prevents you from using the full performance of your machine, because you might end up driving one processor to 100% and letting all other processors idle around.
  3. A few threads where each thread handles some of the connections.
    This takes load from the system because there are fewer threads. And it can use all available processors. On Windows this approach is supported by Thread Pool API.

Of course having more threads is not per se a problem. As you might have recognized I chose quite a high number of connections/threads. I doubt that you'll see any difference between the three possible implementations if we are talking about only a dozen threads (this is also what Raymond Chen suggests on the MSDN blog post Does Windows have a limit of 2000 threads per process?).

On Windows using unbuffered file I/O means that writes must be of a size which is a multiple of the page size. I have not tested it, but it sounds like this could also affect write performance positively for buffered synchronous and asynchronous writes.

The steps 1 to 7 you describe give a good idea of how it works. On Windows the operating system will inform you about completion of an asynchronous I/O (WriteFile with OVERLAPPED structure) using an event or a callback. Callback functions will only be called for example when your code calls WaitForMultipleObjectsEx with bAlertable set to true.

Some more reading on the web:

Cabbageworm answered 17/12, 2011 at 17:11 Comment(9)
From the web point of view common knowledge (Internet, comments from experts) suggest that greatly increasing the max. number of request threads is a bad thing in blocking IO (making processing of requests even slower) due to memory increase and context switching time, but, isn't Async IO doing the same thing when deferring the job to another thread? Yes you can serve more requests now but have the same number of threads in the background.. what's the real benefit of that?Loyalty
@Loyalty You seem to believe that if n threads do async file IO another n threads will be created to do a blocking file IO? This is not true. The OS has async file IO support and it does not need to block when waiting for IO to complete. It can queue IO requests and if a hardware (e.g. DMA) interrupt occurs it can mark the request as done and set an event that signals the callers thread. Even if an extra thread was required the OS would be able to use that thread for multiple IO requests from multiple threads.Cabbageworm
Thanks, it makes sense involving the OS async file IO support but when I write code for an actual implementation of this (from the web point of view) say with Java Servlet 3.0 NIO I still see a thread for the request and a background thread (async) looping to read a file, database or whatever.Loyalty
@Loyalty This looks like an implementation detail of the Java runtime.Cabbageworm
@WernerHenze : As per you "But depending on the number of threads you'll surely not see any difference between asynchronous IO and using only a few write threads." --- what is the reason behind this? I can't figure out if my number of threads is smaller, why doing a non blocking I/O on them would not be of much use?Goingson
@Goingson I rewote my answer. I hope it is clearer now.Cabbageworm
Yep.. the picture is clearer. So it means that if the I/O stuff is actually heavy then using non blocking I/O significant performance can be achieved. If it's not that heavy, then we won't be able to see much performance diff as compared to some threads doing the blocking I/O in background. Am I right to think this way? There is a question asked by me at #34878205 . I could somehow relate the answer through this post. Can you please help me looking out for exact answer there?Goingson
On Windows using asynchronous file I/O means that writes must be of a size which is a multiple of the page size. - no, it doesn't. You're thinking of unbuffered I/O. (They are often used together, but they don't have to be.)Gallery
@HarryJohnston You are right. Thanks for pointing that out. I adjusted my answer.Cabbageworm
B
36

I/O includes multiple kind of operations like reading and writing data from hard drives, accessing network resources, calling web services or retrieving data from databases. Depending on the platform and on the kind of operation, asynchronous I/O will usually take advantage of any hardware or low level system support for performing the operation. This means that it will be performed with as little impact as possible on the CPU.

At application level, asynchronous I/O prevents threads from having to wait for I/O operations to complete. As soon as an asynchronous I/O operation is started, it releases the thread on which it was launched and a callback is registered. When the operation completes, the callback is queued for execution on the first available thread.

If the I/O operation is executed synchronously, it keeps its running thread doing nothing until the operation completes. The runtime doesn't know when the I/O operation completes, so it will periodically provide some CPU time to the waiting thread, CPU time that could have otherwise be used by other threads that have actual CPU bound operations to perform.

So, as @user1629468 mentioned, asynchronous I/O does not provide better performance but rather better scalability. This is obvious when running in contexts that have a limited number of threads available, like it is the case with web applications. Web application usually use a thread pool from which they assign threads to each request. If requests are blocked on long running I/O operations there is the risk of depleting the web pool and making the web application freeze or slow to respond.

One thing I have noticed is that asynchronous I/O isn't the best option when dealing with very fast I/O operations. In that case the benefit of not keeping a thread busy while waiting for the I/O operation to complete is not very important and the fact that the operation is started on one thread and it is completed on another adds an overhead to the overall execution.

You can read a more detailed research I have recently made on the topic of asynchronous I/O vs. multithreading here.

Bizerte answered 13/5, 2013 at 18:38 Comment(6)
I wonder if it would be worth making a distinction between I/O operations that are expected to complete, and things that might not [e.g. "get next character that arrives on a serial port", in cases where the remote device may or may not send anything]. If an I/O operation is expected to complete within a reasonable time, one might delay cleanup of the related resources until the operation completes. If the operation might never complete, however, such delay would be unreasonable.Selfrighteous
@Selfrighteous the scenario you are describing is used in lower level applications and libraries. Servers are relying on it, since they continuously wait for incoming connections. Async I/O as described above cannot fit in this scenario because it is based on starting a specific operation and registering a callback for its completion. In the case you are describing you need to register a callback on a system event and process every notification. You are continuously processing input rather than performing operations. As said, this is usually done at low level, almost never in your apps.Bizerte
The pattern is pretty common with applications that come with various types of hardware. Serial ports aren't as common as they used to be, but USB chips which emulate serial ports are pretty popular in the design of specialized hardware. Characters from such things are handled at the application level, since the OS will have no way of knowing that a sequence of input characters means e.g. a cash drawer was opened and a notification should be sent somewhere.Selfrighteous
I don’t think the part about CPU cost of blocking IO is accurate: when in blocking state, a thread that triggered blocking IO is put on wait by the OS and does not cost CPU periods until the IO has fully completed, only after which does the OS (notifies by interrupts) resume the blocked thread. What you described (busy wait by long polling) is not how blocking IO is implemented in almost any runtime/compiler.Garget
the threads you mention above are kernel-space threads i assume, right?Veliz
@FlorinDumitrescu The link is dead.Bein
T
9

To presume a speed improvement due to any form of multi-computing you must presume either that multiple CPU-based tasks are being executed concurrently upon multiple computing resources (generally processor cores) or else that not all of the tasks rely upon the concurrent usage of the same resource -- that is, some tasks may depend on one system subcomponent (disk storage, say) while some tasks depend on another (receiving communication from a peripheral device) and still others may require usage of processor cores.

The first scenario is often referred to as "parallel" programming. The second scenario is often referred to as "concurrent" or "asynchronous" programming, although "concurrent" is sometimes also used to refer to the case of merely allowing an operating system to interleave execution of multiple tasks, regardless of whether such execution must take place serially or if multiple resources can be used to achieve parallel execution. In this latter case, "concurrent" generally refers to the way that execution is written in the program, rather than from the perspective of the actual simultaneity of task execution.

It's very easy to speak about all of this with tacit assumptions. For example, some are quick to make a claim such as "Asynchronous I/O will be faster than multi-threaded I/O." This claim is dubious for several reasons. First, it could be the case that some given asynchronous I/O framework is implemented precisely with multi-threading, in which case they are one in the same and it doesn't make sense to say one concept "is faster than" the other.

Second, even in the case when there is a single-threaded implementation of an asynchronous framework (such as a single-threaded event loop) you must still make an assumption about what that loop is doing. For example, one silly thing you can do with a single-threaded event loop is request for it to asynchronously complete two different purely CPU-bound tasks. If you did this on a machine with only an idealized single processor core (ignoring modern hardware optimizations) then performing this task "asynchronously" wouldn't really perform any differently than performing it with two independently managed threads, or with just one lone process -- the difference might come down to thread context switching or operating system schedule optimizations, but if both tasks are going to the CPU it would be similar in either case.

It is useful to imagine a lot of the unusual or stupid corner cases you might run into.

"Asynchronous" does not have to be concurrent, for example just as above: you "asynchronously" execute two CPU-bound tasks on a machine with exactly one processor core.

Multi-threaded execution doesn't have to be concurrent: you spawn two threads on a machine with a single processor core, or ask two threads to acquire any other kind of scarce resource (imagine, say, a network database that can only establish one connection at a time). The threads' execution might be interleaved however the operating system scheduler sees fit, but their total runtime cannot be reduced (and will be increased from the thread context switching) on a single core (or more generally, if you spawn more threads than there are cores to run them, or have more threads asking for a resource than what the resource can sustain). This same thing goes for multi-processing as well.

So neither asynchronous I/O nor multi-threading have to offer any performance gain in terms of run time. They can even slow things down.

If you define a specific use case, however, like a specific program that both makes a network call to retrieve data from a network-connected resource like a remote database and also does some local CPU-bound computation, then you can start to reason about the performance differences between the two methods given a particular assumption about hardware.

The questions to ask: How many computational steps do I need to perform and how many independent systems of resources are there to perform them? Are there subsets of the computational steps that require usage of independent system subcomponents and can benefit from doing so concurrently? How many processor cores do I have and what is the overhead for using multiple processors or threads to complete tasks on separate cores?

If your tasks largely rely on independent subsystems, then an asynchronous solution might be good. If the number of threads needed to handle it would be large, such that context switching became non-trivial for the operating system, then a single-threaded asynchronous solution might be better.

Whenever the tasks are bound by the same resource (e.g. multiple needs to concurrently access the same network or local resource), then multi-threading will probably introduce unsatisfactory overhead, and while single-threaded asynchrony may introduce less overhead, in such a resource-limited situation it too cannot produce a speed-up. In such a case, the only option (if you want a speed-up) is to make multiple copies of that resource available (e.g. multiple processor cores if the scarce resource is CPU; a better database that supports more concurrent connections if the scarce resource is a connection-limited database, etc.).

Another way to put it is: allowing the operating system to interleave the usage of a single resource for two tasks cannot be faster than merely letting one task use the resource while the other waits, then letting the second task finish serially. Further, the scheduler cost of interleaving means in any real situation it actually creates a slowdown. It doesn't matter if the interleaved usage occurs of the CPU, a network resource, a memory resource, a peripheral device, or any other system resource.

Toggle answered 1/6, 2015 at 0:22 Comment(0)
A
5

The main reason to use AIO is for scalability. When viewed in the context of a few threads, the benefits are not obvious. But when the system scales to 1000s of threads, AIO will offer much better performance. The caveat is that AIO library should not introduce further bottlenecks.

Adinaadine answered 4/1, 2013 at 10:1 Comment(0)
A
3

One possible implementation of non-blocking I/O is exactly what you said, with a pool of background threads that do blocking I/O and notify the thread of the originator of the I/O via some callback mechanism. In fact, this is how the AIO module in glibc works. Here are some vague details about the implementation.

While this is a good solution that is quite portable (as long as you have threads), the OS is typically able to service non-blocking I/O more efficiently. This Wikipedia article lists possible implementations besides the thread pool.

Apophyge answered 17/12, 2011 at 17:10 Comment(0)
B
2

I am currently in the process of implementing async io on an embedded platform using protothreads. Non blocking io makes the difference between running at 16000fps and 160fps. The biggest benefit of non blocking io is that you can structure your code to do other things while hardware does its thing. Even initialization of devices can be done in parallel.

Martin

Babby answered 20/2, 2015 at 20:27 Comment(0)
V
1

In Node, multiple threads are being launched, but it's a layer down in the C++ run-time.

"So Yes NodeJS is single threaded, but this is a half truth, actually it is event-driven and single-threaded with background workers. The main event loop is single-threaded but most of the I/O works run on separate threads, because the I/O APIs in Node.js are asynchronous/non-blocking by design, in order to accommodate the event loop. "

https://codeburst.io/how-node-js-single-thread-mechanism-work-understanding-event-loop-in-nodejs-230f7440b0ea

"Node.js is non-blocking which means that all functions ( callbacks ) are delegated to the event loop and they are ( or can be ) executed by different threads. That is handled by Node.js run-time."

https://itnext.io/multi-threading-and-multi-process-in-node-js-ffa5bb5cde98 

The "Node is faster because it's non-blocking..." explanation is a bit of marketing and this is a great question. It's efficient and scaleable, but not exactly single threaded.

Vidette answered 27/1, 2019 at 17:39 Comment(0)
M
1

Let me give you a counterexample that asynchronous I/O does not work. I am writing a proxy similar to below-using boost::asio. https://github.com/ArashPartow/proxy/blob/master/tcpproxy_server.cpp

However, the scenario of my case is, incoming (from clients side) messages are fast while outgoing (to server side) is slow for one session, to keep up with the incoming speed or to maximize the total proxy throughput, we have to use multiple sessions under one connection.

Thus this async I/O framework does not work anymore. We do need a thread pool to send to the server by assigning each thread a session.

Mcfarlane answered 11/2, 2019 at 5:45 Comment(0)
V
0

The improvement as far as I know is that Asynchronous I/O uses ( I'm talking about MS System, just to clarify ) the so called I/O completion ports. By using the Asynchronous call the framework leverage such architecture automatically, and this is supposed to be much more efficient that standard threading mechanism. As a personal experience I can say that you would sensibly feel your application more reactive if you prefer AsyncCalls instead of blocking threads.

Vanquish answered 17/12, 2011 at 17:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.