Explanation for tiny reads (overlapped, buffered) outperforming large contiguous reads?

Asked 6/5, 2011 at 9:21 Answered 6/5, 2011 at 14:48

Solved windows winapi overlapped-io buffered

(apologies for the somewhat lengthy intro)

During development of an application which prefaults an entire large file (>400MB) into the buffer cache for speeding up the actual run later, I tested whether reading 4MB at a time still had any noticeable benefits over reading only 1MB chunks at a time. Surprisingly, the smaller requests actually turned out to be faster. This seemed counter-intuitive, so I ran a more extensive test.

The buffer cache was purged before running the tests (just for laughs, I did one run with the file in the buffers, too. The buffer cache delivers upwards of 2GB/s regardless of request size, though with a surprising +/- 30% random variance).
All reads used overlapped ReadFile with the same target buffer (the handle was opened with FILE_FLAG_OVERLAPPED and without FILE_FLAG_NO_BUFFERING). The harddisk used is somewhat elderly but fully functional, NTFS has a cluster size of 8kB. The disk was defragmented after an initial run (6 fragments vs. unfragmented, zero difference). For better figures, I used a larger file too, below numbers are for reading 1GB.

The results were really surprising:

4MB x 256    : 5ms per request,    completion 25.8s @ ~40 MB/s
1MB x 1024   : 11.7ms per request, completion 23.3s @ ~43 MB/s
32kB x 32768 : 12.6ms per request, completion 15.5s @ ~66 MB/s
16kB x 65536 : 12.8ms per request, completion 13.5s @ ~75 MB/s

So, this suggests that submitting ten thousands of requests two clusters in length is actually better than submitting a few hundred large, contiguous reads. The submit time (time before ReadFile returns) does go up substantially as the number of requests goes up, but asynchronous completion time nearly halves.
Kernel CPU time is around 5-6% in every case (on a quadcore, so one should really say 20-30%) while the asynchronous reads are completing, which is a surprising amount of CPU -- apparently the OS does some non-neglegible amount of busy waiting, too. 30% CPU for 25 seconds at 2.6 GHz, that's quite a few cycles for doing "nothing".

Any idea how this can be explained? Maybe someone here has a deeper insight of the inner workings of Windows overlapped IO? Or, is there something substantially wrong with the idea that you can use ReadFile for reading a megabyte of data?

I can see how an IO scheduler would be able to optimize multiple requests by minimizing seeks, especially when requests are random access (which they aren't!). I can also see how a harddisk would be able to perform a similar optimization given a few requests in the NCQ.
However, we're talking about ridiculous numbers of ridiculously small requests -- which nevertheless outperform what appears to be sensible by a factor of 2.

Sidenote: The clear winner is memory mapping. I'm almost inclined to add "unsurprisingly" because I am a big fan of memory mapping, but in this case, it actually does surprise me, as the "requests" are even smaller and the OS should be even less able to predict and schedule the IO. I didn't test memory mapping at first because it seemed counter-intuitive that it might be able to compete even remotely. So much for your intuition, heh.

Mapping/unmapping a view repeatedly at different offsets takes practically zero time. Using a 16MB view and faulting every page with a simple for() loop reading a single byte per page completes in 9.2 secs @ ~111 MB/s. CPU usage is under 3% (one core) at all times. Same computer, same disk, same everything.

It also appears that Windows loads 8 pages into the buffer cache at a time, although only one page is actually created. Faulting every 8th page runs at the same speed and loads the same amount of data from disk, but shows lower "physical memory" and "system cache" metrics and only 1/8 of the page faults. Subsequent reads prove that the pages are nevertheless definitively in the buffer cache (no delay, no disk activity).

(Possibly very, very distantly related to Memory-Mapped File is Faster on Huge Sequential Read?)

To make it a bit more illustrative:
enter image description here

Update:

Using FILE_FLAG_SEQUENTIAL_SCAN seems to somewhat "balance" reads of 128k, improving performance by 100%. On the other hand, it severely impacts reads of 512k and 256k (you have to wonder why?) and has no real effect on anything else. The MB/s graph of the smaller blocks sizes arguably seems a little more "even", but there is no difference in runtime.

enter image description here

I may have found an explanation for smaller block sizes performing better, too. As you know, asynchronous requests may run synchronously if the OS can serve the request immediately, i.e. from the buffers (and for a variety of version-specific technical limitations).

When accounting for actual asynchronous vs. "immediate" asyncronous reads, one notices that upwards of 256k, Windows runs every asynchronous request asynchronously. The smaller the blocksize, the more requests are being served "immediately", even when they are not available immediately (i.e. ReadFile simply runs synchronously). I cannot make out a clear pattern (such as "the first 100 requests" or "more than 1000 requests"), but there seems to be an inverse correlation between request size and synchronicity. At a blocksize of 8k, every asynchronous request is served synchronously.
Buffered synchronous transfers are, for some reason, twice as fast as asynchronous transfers (no idea why), hence the smaller the request sizes, the faster the overall transfer, because more transfers are done synchronously.

For memory mapped prefaulting, FILE_FLAG_SEQUENTIAL_SCAN causes a slightly different shape of the performance graph (there is a "notch" which is moved a bit backwards), but the total time taken is exactly identical (again, this is surprising, but I can't help it).

Update 2:

Unbuffered IO makes the performance graphs for the 1M, 4M, and 512k request testcases somewhat higher and more "spiky" with maximums in the 90s of GB/s, but with harsh minumums too, the overall runtime for 1GB is within +/- 0.5s of the buffered run (the requests with smaller buffer sizes complete significantly faster, however, that is because with more than 2558 in-flight requests, ERROR_WORKING_SET_QUOTA is returned). Measured CPU usage is zero in all unbuffered cases, which is unsurprising, since any IO that happens runs via DMA.

Another very interesting observation with FILE_FLAG_NO_BUFFERING is that it significantly changes API behaviour. CancelIO does not work any more, at least not in a sense of cancelling IO. With unbuffered in-flight requests, CancelIO will simply block until all requests have finished. A lawyer would probably argue that the function cannot be held liable for neglecting its duty, because there are no more in-flight requests left when it returns, so in some way it has done what was asked -- but my understanding of "cancel" is somewhat different.
With buffered, overlapped IO, CancelIO will simply cut the rope, all in-flight operations terminate immediately, as one would expect.

Yet another funny thing is that the process is unkillable until all requests have finished or failed. This kind of makes sense if the OS is doing DMA into that address space, but it's a stunning "feature" nevertheless.

Eldoree answered 6/5, 2011 at 9:21 Comment(5)

Well, what exactly would you like to see? The "crucial bit" of code is basically not much more than a for(...) ReadFile(...);. Or, in the case of memory mapping, a for(...) sum += data[i];. That, plus a CloseHandle(CreateFile(... NO_BUFFERING)) at the beginning to purge the caches. – Eldoree 6/5, 2011 at 15:35

I wondered if you were doing anything 'interesting' whilst issuing your async reads... – Orren 6/5, 2011 at 16:11

"The buffer cache was purged before running the tests" - how? (Only the write cache can be flushed directly, so do you reboot or read in a huge file between testing the cached file?) Also, which Windows version did you use? Pre-Vista can only do 64kb of I/O per kernel call (iirc), bumped to (iirc) 4meg for Vista and 64meg for 2008 server. – Thalamus 7/5, 2011 at 18:52

@snemarch: Opening a file with the NO_BUFFERING flag will throw away the entire cache contents for that file. Not for this handle, but for everyone, globally (at least under XP, I have not tested this under Win 7). This is absolutely not in accordance with the documentation, but it is what happens anyway (nasty surprise if one doesn't expect). My system is Windows XP SP3 (professional). – Eldoree 7/5, 2011 at 21:57

wow this is the longest SO I think I've ever read, nice! ;) – Condescending 15/10, 2012 at 18:13

I'm not a filesystem expert but I think there are a couple of things going on here. First off. w.r.t. your comment about memory mapping being the winner. This isn't totally surprising since the NT cache manager is based on memory mapping - by doing the memory mapping yourself, you're duplicating the cache manager's behavior without the additional memory copies.

When you read sequentially from the file, the cache manager attempts to pre-fetch the data for you - so it's likely that you are seeing the effect of readahead in the cache manager. At some point the cache manager stops prefetching reads (or rather at some point the prefetched data isn't sufficient to satisfy your reads and so the cache manager has to stall). That may account for the slowdown on larger I/Os that you're seeing.

Have you tried adding FILE_FLAG_SEQUENTIAL_SCAN to your CreateFile flags? That instructs the prefetcher to be even more aggressive.

This may be counter-intuitive, but traditionally the fastest way to read data off the disk is to use asynchronous I/O and FILE_FLAG_NO_BUFFERING. When you do that, the I/O goes directly from the disk driver into your I/O buffers with nothing to get in the way (assuming that the segments of the file are contiguous - if they're not, the filesystem will have to issue several disk reads to satisfy the application read request). Of course it also means that you lose the built-in prefetch logic and have to roll your own. But with FILE_FLAG_NO_BUFFERING you have complete control of your I/O pipeline.

One other thing to remember: When you're doing asynchronous I/O, it's important to ensure that you always have an I/O request oustanding - otherwise you lose potential time between when the last I/O completes and the next I/O is started.

Paramedic answered 6/5, 2011 at 14:48 Comment(16)

I did not try FILE_FLAG_SEQUENTIAL_SCAN because a sequential hint usually has the connotation "prefetch aggressively, read once, throw away", and my intent was rather "don't care, don't care, keep cached" -- but after reading through MSDN, this doesn't seem to be the case (at least not explicitly) for Windows. I will try whether that makes any difference later this evening just for curiosity (don't think it will really make that much of a difference). As for outstanding requests, the requests were submitted as fast as possible, one following the other. So, the 4Mx256 example would have 255 ... – Eldoree 6/5, 2011 at 15:11

... outstanding requests before the first one finishes. The 16x65536 version (which runs twice as fast) would of course have a lot more outstanding requests, but I'd figure that several hundred should be sufficient. As for FILE_FLAG_NO_BUFFERING, that is the exact opposite of what I want. :-) It is true that FILE_FLAG_NO_BUFFERING directly DMAs from disk into your application, but I'm not interested in the data at all. Plus, to make it workse, FILE_FLAG_NO_BUFFERING not only doesn't use buffers, but it also discards any of that file's data already in the buffers (globally). – Eldoree 6/5, 2011 at 15:16

What is intended in my case is to read data from disk into the buffers (it only accidentially copies to application space too, as an unavoidable side effect). This sounds odd, I know :-) The reasoning is that the user gives some input at the beginning, which takes 5-8 seconds, and then the application has to load a large quasi-random amount of this file. If the file is in the buffer cache, the load is "instantaneous" whereas otherwise it takes something like 15-20 seconds. Hence the idea that the app could "speculatively pre-fault" the buffers while the user is typing. – Eldoree 6/5, 2011 at 15:21

It just baffles me that basically telling the OS "I need these few huge blocks of contiguous data" is slower than telling it "I need these thousands of tiny bits (which might be scattered all about the disk)", and that is still slower than just causing more or less unpredictable page faults... Well, we shall see what FILE_FLAG_SEQUENTIAL_SCAN gives when I'm at home, it's interesting to see in any case :-) – Eldoree 6/5, 2011 at 15:28

It shouldn't be surprising. When you ask for large reads (especially without specifying file_flag_sequential_scan), you essentially are fighting with the cache manager - the cache manager is optimized for relatively small (64K, 256K) I/Os and when you issue the large I/Os things can get confusing. That's why going to unbuffered I/O can be such a huge win - the cache manager gets out of the way. – Paramedic 7/5, 2011 at 0:9

Of course going to unbuffered I/O can be a problem - part of the reason that Vista's file copy performance sucked was that they switched to using unbuffered I/O which killed the copy performance on small files (even though it helped a lot on large files). – Paramedic 7/5, 2011 at 0:9

@Larry Osterman: updated Q for FILE_FLAG_SEQUENTIAL_SCAN (see above). – Eldoree 7/5, 2011 at 17:16

Interesting. Btw, the reason the returns are synchronous on cached I/O is because the request can be completely satisfied from the cache. If the request can be satisfied from the cache, there's no point in making the request asynchronous (it would be slower) – Paramedic 7/5, 2011 at 23:29

@Larry Osterman: Yes, that is how it's intended, it makes sense too when the data is in cache. Funnily, it still runs synchronously as mentioned in above case when the data is very clearly not in the cache (this explains why submitting requests takes several seconds too). But... it is still faster than running asynchronously. It "kind of" backwards-explains why memory mapping is faster too (because that obviously is synchronous too, the process blocks in the page fault). Or maybe, one should more correctly look at it this way, sync reads are not nothing but page faults in disguise. – Eldoree 7/5, 2011 at 23:55

Have you tried repeating this experiment with overlapped unbuffered I/O? As I said in an earlier response, it's my understanding that you'll get the best perf on large files with that. – Paramedic 8/5, 2011 at 7:14

@Larry Osterman: For "scientific curiosity", I did (though like I said, unbuffered IO is the exact opposite of what I want; it is really unsuited for 99.9% of all people). Unbuffered IO makes the performance graphs for the 1M, 4M, and 512k request testcases somewhat higher and more "spiky" with maximums in the 90s of GB/s, but with harsh minumums too, the overall runtime for 1GB is within +/- 0.5s of the buffered run (the requests with smaller buffer sizes complete significantly faster, however, that is because with more than 2558 in-flight requests, ERROR_WORKING_SET_QUOTA is returned). – Eldoree 8/5, 2011 at 12:5

@Damon: You're right that unbuffered is unsuited for certain workloads. But buffered I/O adds an additional copy to the mix and that additional copy can potentially hurt performance. Unbuffered I/O has the filesystem copy the file directly into the application buffer without first going through the cache. On the other hand, for certain large read sizes, I believe that the I/O is effectively turned into unbuffered I/O because the overhead of caching is too much. – Paramedic 8/5, 2011 at 16:33

Have you also measured CPU usage during these operations? For small reads, I would expect the CPU usage to be higher (since you're copying from the cache) and for large reads, I would expect the CPU usage to be zero (since you're spending all the time waiting for I/O to complete). – Paramedic 8/5, 2011 at 16:34

@Larry Osterman: Measured CPU usage is zero in all unbuffered cases, which is unsurprising. The large block reads use DMA, and the small block reads fail, so they do nothing. The main reason why I see unbuffered IO as unsuitable is that it is, well, unbuffered. Buffered IO does extra copies, true, but those are insignificant. There seems to be some strange fixed overhead effects (for which I blame locking in lack of a better explanation), but apart from that, the best possible thing to happen is that your data is in the buffers. The only exception is really the rare case that you want to ... – Eldoree 8/5, 2011 at 22:12

... stream with as little delay as possible and with as little CPU involvement as possible (possibly from more "unflexible medium"), and you really, really only ever need data once and never again. Playing a movie from a DVD would be one such thing. For every other situation, buffers are big win. Reading from buffers takes "no time". That's the reason why I implemented that prefaulting thing at all. Asserting that data is in buffers is a kind of trivial optimization that makes your data-heavy application 20-25 times faster without changing one line of the actual code. – Eldoree 8/5, 2011 at 22:17

@Larry Osterman: Copied some of above comments into the Q to make later reading easier, also I found another two interesting details on unbuffered IO. – Eldoree 9/5, 2011 at 10:18

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags