In the oflag=direct
case:
- You are giving the kernel the ability to write data out straight away rather than filling a buffer and waiting for a threshold/timeout to be hit (which in turn means that data is less likely to be held up behind a sync of unrelated data).
- You are saving the kernel work (no extra copies from userland to the kernel, no need to perform most buffer cache management operations).
- In some cases, dirtying buffers faster than they can be flushed will result in the program generating the dirty buffers being made to wait until pressure on arbitrary limits is relieved (see SUSE's "Low write performance on SLES 11/12 servers with large RAM").
More generally, that giant block size (1 MByte) is likely bigger than the RAID's block size so the I/O will be split up within the kernel and those smaller pieces submitted in parallel, thus big enough that the coalescing you get from buffered writeback with tiny I/Os won't be worth much (the exact point that the kernel will start splitting I/Os depends on a number of factors. Further, while RAID stripe sizes can be larger than 1 MByte, the kernel isn't always aware of this for hardware RAID. In the case of software RAID the kernel can sometimes optimize for stripe size - e.g. the kernel I'm on knows the md0
device has a 4 MByte stripe size and express a hint that it prefers I/O in that size via /sys/block/md0/queue/optimal_io_size
).
Given all the above, IF you were maxing out a single CPU during the original buffered copy AND your workload doesn't benefit much from caching/coalescing BUT the disk could handle more throughput THEN doing the O_DIRECT
copy should go faster as there's more CPU time available for userspace/servicing disk I/Os due to the reduction in kernel overhead.
So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache?
It's not just an extra memcpy per I/O that is involved - think about all the extra cache machinery that must be maintained. There is a nice explanation about how copying a buffer to the kernel isn't instantaneous and how page pressure can slow things down in an answer to the Linux async (io_submit) write v/s normal (buffered) write question. However, unless your program can generate data fast enough AND the CPU is so overloaded it can't feed the disk quickly enough then it usually doesn't show up or matter.
Is this atypical?
No, your result is quite typical with the sort of workload you were using. I'd imagine it would be a very different outcome if the blocksize were tiny (e.g. 512 bytes) though.
Let's compare some of fio's output to help us understand this:
$ fio --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M_no_fsync
buffered_1M_no_fsync: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2511MiB/s][r=0,w=2510 IOPS][eta 00m:00s]
buffered_1M_no_fsync: (groupid=0, jobs=1): err= 0: pid=25408: Sun Aug 25 09:10:31 2019
write: IOPS=2100, BW=2100MiB/s (2202MB/s)(20.0GiB/9752msec)
[...]
cpu : usr=2.08%, sys=97.72%, ctx=114, majf=0, minf=11
[...]
Disk stats (read/write):
md0: ios=0/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
So using buffering we wrote at about 2.1 GBytes/s but used up a whole CPU to do so. However, the block device (md0
) says it barely saw any I/O (ios=0/3
- only three write I/Os) which likely means most of the I/O was cached in RAM! As this particular machine could easily buffer 20 GBytes in RAM we shall do another run with end_fsync=1
to force any data that may only have been in the kernel's RAM cache at the end of the run to be pushed to disk thus ensuring we record the time it took for all the data to actually reach non-volatile storage:
$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M
buffered_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
buffered_1M: (groupid=0, jobs=1): err= 0: pid=41884: Sun Aug 25 09:13:01 2019
write: IOPS=1928, BW=1929MiB/s (2023MB/s)(20.0GiB/10617msec)
[...]
cpu : usr=1.77%, sys=97.32%, ctx=132, majf=0, minf=11
[...]
Disk stats (read/write):
md0: ios=0/40967, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/2561, aggrmerge=0/2559, aggrticks=0/132223, aggrin_queue=127862, aggrutil=21.36%
OK now the speed has dropped to about 1.9 GBytes/s and we still use all a CPU but the disks in the RAID device claim they had capacity to go faster (aggrutil=21.36%
). Next up direct I/O:
$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_1M
direct_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=3242MiB/s][r=0,w=3242 IOPS][eta 00m:00s]
direct_1M: (groupid=0, jobs=1): err= 0: pid=75226: Sun Aug 25 09:16:40 2019
write: IOPS=2252, BW=2252MiB/s (2361MB/s)(20.0GiB/9094msec)
[...]
cpu : usr=8.71%, sys=38.14%, ctx=20621, majf=0, minf=83
[...]
Disk stats (read/write):
md0: ios=0/40966, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/5120, aggrmerge=0/0, aggrticks=0/1283, aggrin_queue=1, aggrutil=0.09%
Going direct we use just under 50% of a CPU to do 2.2 GBytes/s (but notice how I/Os weren't merged and how we did far more userspace/kernel context switches). If we were to push more I/O per syscall things change:
$ fio --bs=4M --size=20G --rw=write --filename=zeroes --name=buffered_4M_no_fsync
buffered_4M_no_fsync: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2390MiB/s][r=0,w=597 IOPS][eta 00m:00s]
buffered_4M_no_fsync: (groupid=0, jobs=1): err= 0: pid=8029: Sun Aug 25 09:19:39 2019
write: IOPS=592, BW=2370MiB/s (2485MB/s)(20.0GiB/8641msec)
[...]
cpu : usr=3.83%, sys=96.19%, ctx=12, majf=0, minf=1048
[...]
Disk stats (read/write):
md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%
$ fio --end_fsync=1 --bs=4M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_4M
direct_4M: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=5193MiB/s][r=0,w=1298 IOPS][eta 00m:00s]
direct_4M: (groupid=0, jobs=1): err= 0: pid=92097: Sun Aug 25 09:22:39 2019
write: IOPS=866, BW=3466MiB/s (3635MB/s)(20.0GiB/5908msec)
[...]
cpu : usr=10.02%, sys=44.03%, ctx=5233, majf=0, minf=12
[...]
Disk stats (read/write):
md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%
With a massive block size of 4 MBytes buffered I/O became bottlenecked at "just" 2.3 GBytes/s (even when we didn't force the cache to be flushed) due to the fact that there's no CPU left. Direct I/O used around 55% of a CPU and managed to reach 3.5 GBytes/s so it was roughly 50% faster than buffered I/O.
Summary: Your I/O pattern doesn't really benefit from buffering (I/Os are huge, data is not being reused, I/O is streaming sequential) so you're in an optimal scenario for O_DIRECT
being faster. See these slides by the original author of Linux's O_DIRECT
(longer PDF document that contains an embedded version of most of the slides) for the original motivation behind it.
O_DIRECT
doesn't just save some memcpys I would hesitate to say that it goes out of its way to do chunkier reads and writes. If anything, using buffering (i.e. when you aren't usingO_DIRECT
) will result in chunkier I/O down to disk when the data read/written is tiny but sequential because you're more likely to get coalescing (although memory fragmentation and device limits control just how chunky I/O can be)... – Owings