I/O completion port's advantages and disadvantages

Asked 12/3, 2011 at 14:18 Answered 12/3, 2011 at 18:33

windows network-programming io device-driver iocp

Why do many people say I/O completion port is a fast and nice model?
What are the I/O completion port's advantages and disadvantages?

I want to know some points which make the I/O completion port faster than other approaches.

If you can explain it comparing to other models (select, epoll, traditional multithread/multiprocess), it would be better.

Hoffert answered 12/3, 2011 at 14:18 Comment(1)

For those still interested: Multithreaded Asynchronous I/O & I/O Completion Ports - Dr Dubb's – Feune 26/6, 2013 at 5:50

103

I/O completion ports are awesome. There's no better word to describe them. If anything in Windows was done right, it's completion ports.

You can create some number of threads (does not really matter how many) and make them all block on one completion port until an event (either one you post manually, or an event from ~~a timer or~~ asynchronous I/O, or whatever) arrives. Then the completion port will wake one thread to handle the event, up to the limit that you specified. If you didn't specify anything, it will assume "up to number of CPU cores", which is really nice.

If there are already more threads active than the maximum limit, it will wait until one of them is done and then hand the event to the thread as soon as it goes to wait state. Also, it will always wake threads in a LIFO order, so chances are that caches are still warm.

In other words, completion ports are a no-fuss "poll for events" as well as "fill CPU as much as you can" solution.

You can throw file reads and writes at a completion port, sockets, or anything else that's waitable. And, you can post your own events if you want. Each custom event has at least one integer and one pointer worth of data (if you use the default structure), but you are not really limited to that as the system will happily accept any other structure too.

Also, completion ports are fast, really really fast. Once upon a time, I needed to notify one thread from another. As it happened, that thread already had a completion port for file I/O, but it didn't pump messages. So, I wondered if I should just bite the bullet and use the completion port for simplicity, even though posting a thread message would obviously be much more efficient. I was undecided, so I benchmarked. Surprise, it turned out completion ports were about 3 times faster. So... faster and more flexible, the decision was not hard.

Willywilly answered 12/3, 2011 at 18:33 Comment(16)

I/O wait in general is done extremely well in Windows, whether it's IOCP, overlapped with events, overlapped with completion routines, or waiting for all of the above simultaneously with a semaphore, completion of a child process, and UI messages. – Dactylic 12/3, 2011 at 18:48

I asked why completion ports are so fast. LIFO order wake up is a point I wanted. But other sentences are not. Anyway +1 for LIFO – Hoffert 13/3, 2011 at 0:7

One nice thing to know is that even with ReadFile/WriteFile you can extend the OVERLAPPED structure arbitrarily. Just embed it in a bigger struct and use CONTAINING_RECORD to retrieve your extra data. – Argybargy 25/9, 2012 at 13:17

"... until an event (either one you post manually, or an event from a timer or asynchronous I/O, or whatever) arrives." Are you sure you can poll a timer with IOCP? The answer to stackoverflow.com/questions/3239080 says otherwise. – Churning 25/10, 2012 at 4:10

@Joey Adams: I was going to say "of course" and link to a working example that you can try for yourself, but embarrassingly enough that working example doesn't work. Which is twice embarrassing for me, because it means not only that the above statement about timers is wrong, but also I have to explain to my boss that I built shit into a software that shipped 6 months ago and didn't notice, and none in QA noticed either (this was to guarantee a deadline in case no events come in otherwise, it must have worked accidentially during QA because always enough other events were available). – Willywilly 26/10, 2012 at 9:27

As for the bugfix, and in case you need to combine timer and completion port, one can work around this under Vista/7/8 by using the callback function with the timer (which can, again, manually post an event to the completion port). Unluckily, this needs GetQueuedCompletionStatusEx for being alertable, so no 2000/XP support. – Willywilly 26/10, 2012 at 9:28

@Damon: Thanks for the followups. Another option would be to pass a timeout to GetQueuedCompletionStatus. If your app needs to time things out in multiple places, you could use a priority queue to track expirations, and have a worker thread call GetQueuedCompletionStatus over and over with the most recent time. Wake the worker thread using PostQueuedCompletionStatus. – Churning 2/11, 2012 at 1:27

@Damon: Is there any way to limit the size of an IOCP queue? I have numerous read operations that I want to kick off, but I don't want to kick them off all at once (it requires too much memory) -- instead I want to limit it to only N items in flight at a time. – Gros 1/8, 2014 at 11:33

@Mehrdad: The size of the queue is (or at least seems to be?) infinite, I've successfully pushed tens of millions of items without blocking or getting an error. There's no way to configure this to my knowledge. On the other hand, having only N items in flight is a different thing, and possible (even automatic). The IOCP will only allow N (as specified at creation, defaulting to the number of CPUs) threads to pass, much like a semaphore. It will let another one pass if one of the workers blocks, so you may occasionally have N+1 or so threads running, but in general it's pretty rock-solid. – Willywilly 1/8, 2014 at 11:40

@Mehrdad: If you want to limit the number of actual reads (not the number of workers handling their completion) because of memory constraints, you can however push tasks that initiate these onto the IOCP (wrap an OVERLAPPED in a struct with parameters and a function pointer). That'll work. I've abused IOCPs for pure inter-thread communication/synchronization (which is basically what that is), which works just fine, also with actual I/O on the same IOCP at the same time. – Willywilly 1/8, 2014 at 11:45

@Damon: Thanks for the reply. That's an interesting trick (I have to think about it more to see whether it would still be high-throughput), but just a couple of minutes ago I realized I can simply use a semaphore :) it's perfect for this. The worker threads simply call ReleaseSemaphore and the master thread that queues the reads waits for the semaphore with WaitForSingleObject. This limits the number of in-flight reads to whatever I want. – Gros 1/8, 2014 at 11:56

@Mehrdad: Semaphore is the other way around, unless I understand your intent wrong. The "master" releases the semaphore and the worker waits on it. As for high-throughput on IOCP, I've benchmarked that about 5-6 years ago and found that IOCP easily pushed through 300k events per second. Which, at least for me, clearly ruled it out as limiting factor, speed-wise (in fact, a hand-written lockfree queue hardly does more than 2-3 times better under heavy load, in a non-cheating non-artificial benchmark, so seeing how something that already works existed I was quite happy with using IOCP for that). – Willywilly 1/8, 2014 at 12:10

@Damon: No I think you misunderstood, the workers release the semaphore and the master waits on it. This ensures that the master never pushes more than N items on the queue, which ensures the system is performing no more than N reads simultaneously. – Gros 1/8, 2014 at 19:46

@Damon: I have another question if you don't mind: once I've kicked off all the reads, what's the proper way for me to tell the worker threads that I'm "done" and that they should stop pulling from the queue once the existing reads have been processed? – Gros 2/8, 2014 at 2:16

Normally you have M tasks and N+X workers where N is the number of cores, X is a little extra (like 1 to 3) in case a worker blocks, M≫N. You post (release) a semaphore from the controlling thread so you have N +/- 1 threads running at all times while there remain tasks. The IOCP does that semi-automatically. To have threads exit when all work is done, for IOCP you can simply post a completion message with some recognizable magic number (I use all zeroes for key, len, and overlapped*). With a semaphore, the easiest way is to check a global bool every time the wait returns. – Willywilly 2/8, 2014 at 11:26

Too bad IOCP fails hard for stdin, stdout, and anonymous pipes. – Generatrix 1/2, 2016 at 23:27

by using IOCP, we can overcome the "one-thread-per-client" problem. It is commonly known that the performance decreases heavily if the software does not run on a true multiprocessor machine. Threads are system resources that are neither unlimited nor cheap.

IOCP provides a way to have a few (I/O worker) threads handle multiple clients' input/output "fairly". The threads are suspended, and don't use the CPU cycles until there is something to do.

Also you can read some information in this nice book http://www.amazon.com/Windows-System-Programming-Johnson-Hart/dp/0321256190

Boodle answered 12/3, 2011 at 14:27 Comment(4)

OVERLAPPED I/O overcomes "one-thread-per-client" very well thank you. What IOCP brings to the table is (as you have correctly mentioned) sharing the load between multiple threads. For most applications using OVERLAPPED I/O from a single thread is simpler and more efficient. Only high volume application servers should consider IOCP. – Dactylic 12/3, 2011 at 15:10

@Ben: To be honest, I find that IOCP programming leads to a more understandable programming style than you get using overlapped I/O. This is especially true if you've got multiple operations going on simultaneously (like you would when copying a file). – Ardellearden 12/3, 2011 at 16:50

@Larry: I don't know how you handle overlapped I/O, but I use MsgWaitForMultipleObjectsEx and completion routines. Same state-machine programming style as IOCP and no need for thread synchronization. – Dactylic 12/3, 2011 at 18:27

@Ben Voigt +1 Fully agree with you it need only for high volume application servers – Boodle 12/3, 2011 at 18:40

I/O completion ports are provided by the O/S as an asynchronous I/O operation, which means that it occurs in the background (usually in hardware). The system does not waste any resources (e.g. threads) waiting for the I/O to complete. When the I/O is complete, the hardware sends an interrupt to the O/S, which then wakes up the relevant process/thread to handle the result. WRONG: IOCP does NOT require hardware support (see comments below)

Typically a single thread can wait on a large number of I/O completions while taking up very little resources when the I/O has not returned.

Other async models that are not based on I/O completion ports usually employ a thread pool and have threads wait for I/O to complete, thereby using more system resources.

The flip side is that I/O completion ports usually require hardware support, and so they are not generally applicable to all async scenarios.

Thuggee answered 12/3, 2011 at 14:29 Comment(11)

What hardware support? I don't think IOCP mechanism is releated with hardware. – Hoffert 12/3, 2011 at 14:33

Async I/O does not use a thread pool, ever. Some widely-used frameworks have put a thread pool wrapper around synchronous I/O calls, but that isn't async I/O. – Dactylic 12/3, 2011 at 15:7

IOCP does not require hardware support. – Mota 13/3, 2011 at 8:29

@Benjamin, @Paul Betts, you are both right. +1 to both comments. Hardware support is not necessary for IOCP. I was wrongly under the impression that IOCP is used mostly for disk or device I/O that it requires hardware interrupt support for notifying the I/O completion. – Thuggee 13/3, 2011 at 9:3

The phrasing here was bad, but the objections are similarly misleading. The point here is that hardware can perform operations (e.g. posting network buffers, etc) without the involvement of a userland thread. The program's thread need only wake after the hardware signals that an event of interest has occurred. – Foolish 21/8, 2013 at 21:54

@BenVoigt what am I missing? Of course if you mean thread pools as introduced with Vista, you'd be formally right. But I/O completion ports are typically using something that amounts to a thread pool, whether the Vista thread pools share implementation details is beyond me, though. – Capsulize 25/10, 2018 at 14:57

@0xC0000022L: I mean that IOCP work gets done on the thread where you want for the completion port. Asynchronous I/O using callbacks runs the APC on the thread that started the I/O (when it enters alertable wait). Neither one dispatches work to a thread pool. But it's even worse than that, because you would think that even if the I/O doesn't automatically use the thread pool, you could put I/O work on thread pool threads yourself. Not so... when a thread exits, I/O started from that thread gets cancelled – Dactylic 25/10, 2018 at 15:8

That means if you tried to combine asynchronous I/O and the thread pool, and the thread pool manager decides there are unnecessary threads in the pool, I/O will get cancelled. For this reason, thread pool threads can only safely use synchronous I/O. (Which is what the answer describes, "employ a thread pool and have threads wait for I/O to complete"... but that is a classic example of synchronization) – Dactylic 25/10, 2018 at 15:11

I see, so you were referring not to the concept of thread pooling, but to the facility provided since Vista. Understood. – Capsulize 25/10, 2018 at 15:13

@0xC0000022L: If by "provided since Vista" you mean "still available since Vista" (but introduced long before). But no, it's not limited to the OS thread pool. Any thread pool that ends threads based on demand will break in-flight I/O. – Dactylic 25/10, 2018 at 15:15

@Capsulize for clarification of Ben's last comment, Windows already had a threadpool API before Vista, Vista just offered a significant enhancement. If Windows overlapped I/O does indeed use a threadpool at all, this threadpool is not attached to the user process and presumably exists in the kernel. Implementations of asynchronous file I/O in other kernels (ie Linux and FreeBSD) do use threadpools, as do most userspace AIO implementations, so it's a reasonable assumption in my book. – Harmonize 12/2, 2023 at 5:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags