Why is dd with the 'direct' (O_DIRECT) flag so dramatically faster?

Asked 2/11, 2015 at 19:12 Answered 25/2, 2018 at 13:9

I have a server with a RAID50 configuration of 24 drives (two groups of 12), and if I run:

dd if=/dev/zero of=ddfile2 bs=1M count=1953 oflag=direct

I get:

2047868928 bytes (2.0 GB) copied, 0.805075 s, 2.5 GB/s

But if I run:

dd if=/dev/zero of=ddfile2 bs=1M count=1953

I get:

2047868928 bytes (2.0 GB) copied, 2.53489 s, 808 MB/s

I understand that O_DIRECT causes the page cache to be bypassed. But as I understand it bypassing the page cache basically means avoiding a memcpy. Testing on my desktop with the bandwidth tool I have a worst case sequential memory write bandwidth of 14GB/s, and I imagine on the newer much more expensive server the bandwidth must be even better. So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache? Is this atypical?

Recognizor answered 2/11, 2015 at 19:12 Comment(15)

Not atypical (see thesubodh.com/2013/07/what-are-exactly-odirect-osync-flags.html). Not only memcpy but cache management also... – Pulling 2/11, 2015 at 19:36

OT, but 12-disk RAID 5? 11 data disks? That's going to cause some real nasty read-modify-write operations. See the Read-modify-write section: infostor.com/index/articles/display/107505/articles/infostor/… RAID-5 (and RAID-6) work best with a power-of-two number of data disks where you match your write block size to the block size that will write an entire stripe across all the RAID data disks. Good controllers can hide the problem, but under extreme load you'll see it. – Flame 3/11, 2015 at 11:17

1. is it hardware or software RAID? 2. Do you flush RAID's buffer (and al linux buffers) before test? Depending on this, answer may significantly differ – Pyrimidine 3/11, 2015 at 20:59

@Pyrimidine -- I don't flush the Linux cache because on the input side /dev/zero is artificial and can't be in cache and on the output side you're writing a new file so it can't be in cache either. Also it's hardware RAID, Adaptec controller, would have to dig to find model info but can if requested. – Recognizor 3/11, 2015 at 21:9

@AndrewHenle I didn't do the original configuration, but good point. This is my first time trying to heavily optimize disk I/O so I was unaware but what you say makes sense. So you're saying stripelength * numdrives = blocksize is best? – Recognizor 3/11, 2015 at 21:13

@joseph-garvin I mean that tests may be wrong when you test without O_DIRECT, since linux buffers data before writing to RAID. So second re-testing may affect numbers. Moreover, it may affect O_DIRECT test since kernel will compete in driver between passing O_DIRECT requests, and doing page_cache write requests (that was not still written). Next, driver for you hardware raid maybe so stupid, and write page cache by pieces of 4K. – Pyrimidine 3/11, 2015 at 21:14

@JosephGarvin Yes, writing in blocks equal to number of data disks times the size of the chunk written to each disk is best. There are a lot of different terms used for that - segment size, stripe width, etc. The disk partitions also have to be aligned properly with the underlying RAID volume - if the stripe width is 1 MB, for example, you don't want to start /dev/sdb2 128kb into the RAID volume. And as socketpair points out, the driver for your hardware RAID can mess things up anyway, although higher-end ones are usually pretty good. Again, though, good controllers can hide this quite well. – Flame 4/11, 2015 at 0:53

Biggest thing in my opinion, though, is not to get too hung up on exact RAID configurations unless you need to really push the design limits of your hardware. If what you have is fast enough, reliability and ease of management can be a lot more important than reading your email in 8 ms instead of 13 or even 23. But if you do have to average 80-90% of your hardware's design bandwidth for long periods of time just to meet your data processing requirements... – Flame 4/11, 2015 at 0:58

Your guess about single-threaded memory bandwidth on your server is probably wrong. Counter-intuitively, single-threaded memory BW is limited by max_concurrency / latency, and a single desktop core has the same number of line-fill buffers as a core in a big Xeon, but the big Xeon has higher latency (more hops on the ring bus) between the core and DRAM or L3. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?. Max aggregate throughput is huge, but it take more cores than on a desktop to hit the same B/W. – Greco 7/1, 2019 at 6:4

I disagree with closing this question. Passing the O_DIRECT flag to open() is a possible dramatic optimization for anyone writing IO code. I think we're trying too hard to squeeze everything into separate categories here. – Recognizor 9/12, 2019 at 23:5

O_DIRECT does wayyy more than just avoid some memcpys. It changes the size of the I/O requests sent to your storage subsystem. When you do a 1MB read/write with O_DIRECT, Linux will go out of its way to actually do a single MB read/write. When you read/write 1MB globs without O_DIRECT it will typically break those into much smaller chunks which could be substantially slower for your I/O subsystem. O_DIRECT benefits have much more to do with saving caching RAM and doing more efficient I/O than saving CPU. – Opus 17/3, 2020 at 3:28

@MarceloPacheco while I agree that O_DIRECT doesn't just save some memcpys I would hesitate to say that it goes out of its way to do chunkier reads and writes. If anything, using buffering (i.e. when you aren't using O_DIRECT) will result in chunkier I/O down to disk when the data read/written is tiny but sequential because you're more likely to get coalescing (although memory fragmentation and device limits control just how chunky I/O can be)... – Owings 13/5, 2020 at 17:24

I can tell you with certainty that if the app does 1MB write/read with O_DIRECT it will be the Linux handily. By a long shot. – Opus 14/5, 2020 at 9:19

@MarceloPacheco I am not saying that big I/Os are bad... I guess I'm saying it might not be correct to say that writing I/O through the page cache results in smaller I/Os than those you get when writing O_DIRECTly when you look at the I/O sent to the disk via iostat... I would be especially interested in the comparison seen if you send small (say 4K) sequential I/Os O_DIRECTly versus sending them through the buffer cache and then doing an fsync at the end. For larger I/Os (say 1MByte) I would be especially interested in the iostat results when you've just booted the system. – Owings 14/5, 2020 at 11:27

Linux mostly do writes due to memory pressure. This leads to moments when the disk is idle and moments when its very busy. Because its responding to memory pressure, it doesn't necessarily sorts everything out and goes in a nicely sequential writing scheme. It's as simple as cping a few times your available RAM and monitoring with vmstat. Go out and benchmark it. Meanwhile the code that knows it wants to write everything out to disk ASAP, in big chunks will keep the disk subsystem busy, ignoring memory status. And doing huge I/Os makes it even faster. – Opus 15/5, 2020 at 14:51

In the oflag=direct case:

You are giving the kernel the ability to write data out straight away rather than filling a buffer and waiting for a threshold/timeout to be hit (which in turn means that data is less likely to be held up behind a sync of unrelated data).
You are saving the kernel work (no extra copies from userland to the kernel, no need to perform most buffer cache management operations).
In some cases, dirtying buffers faster than they can be flushed will result in the program generating the dirty buffers being made to wait until pressure on arbitrary limits is relieved (see SUSE's "Low write performance on SLES 11/12 servers with large RAM").

More generally, that giant block size (1 MByte) is likely ~~bigger than the RAID's block size so the I/O will be split up within the kernel and those smaller pieces submitted in parallel, thus~~ big enough that the coalescing you get from buffered writeback with tiny I/Os won't be worth much (the exact point that the kernel will start splitting I/Os depends on a number of factors. Further, while RAID stripe sizes can be larger than 1 MByte, the kernel isn't always aware of this for hardware RAID. In the case of software RAID the kernel can sometimes optimize for stripe size - e.g. the kernel I'm on knows the md0 device has a 4 MByte stripe size and express a hint that it prefers I/O in that size via /sys/block/md0/queue/optimal_io_size).

Given all the above, IF you were maxing out a single CPU during the original buffered copy AND your workload doesn't benefit much from caching/coalescing BUT the disk could handle more throughput THEN doing the O_DIRECT copy should go faster as there's more CPU time available for userspace/servicing disk I/Os due to the reduction in kernel overhead.

So why would an extra memcpy cause a >2x slowdown? Is there really a lot more involved when using the page cache?

It's not just an extra memcpy per I/O that is involved - think about all the extra cache machinery that must be maintained. There is a nice explanation about how copying a buffer to the kernel isn't instantaneous and how page pressure can slow things down in an answer to the Linux async (io_submit) write v/s normal (buffered) write question. However, unless your program can generate data fast enough AND the CPU is so overloaded it can't feed the disk quickly enough then it usually doesn't show up or matter.

Is this atypical?

No, your result is quite typical with the sort of workload you were using. I'd imagine it would be a very different outcome if the blocksize were tiny (e.g. 512 bytes) though.

Let's compare some of fio's output to help us understand this:

$ fio --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M_no_fsync
buffered_1M_no_fsync: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2511MiB/s][r=0,w=2510 IOPS][eta 00m:00s]
buffered_1M_no_fsync: (groupid=0, jobs=1): err= 0: pid=25408: Sun Aug 25 09:10:31 2019
  write: IOPS=2100, BW=2100MiB/s (2202MB/s)(20.0GiB/9752msec)
[...]
  cpu          : usr=2.08%, sys=97.72%, ctx=114, majf=0, minf=11
[...]
Disk stats (read/write):
    md0: ios=0/3, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%

So using buffering we wrote at about 2.1 GBytes/s but used up a whole CPU to do so. However, the block device (md0) says it barely saw any I/O (ios=0/3 - only three write I/Os) which likely means most of the I/O was cached in RAM! As this particular machine could easily buffer 20 GBytes in RAM we shall do another run with end_fsync=1 to force any data that may only have been in the kernel's RAM cache at the end of the run to be pushed to disk thus ensuring we record the time it took for all the data to actually reach non-volatile storage:

$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --name=buffered_1M
buffered_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [F(1)][100.0%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]      
buffered_1M: (groupid=0, jobs=1): err= 0: pid=41884: Sun Aug 25 09:13:01 2019
  write: IOPS=1928, BW=1929MiB/s (2023MB/s)(20.0GiB/10617msec)
[...]
  cpu          : usr=1.77%, sys=97.32%, ctx=132, majf=0, minf=11
[...]
Disk stats (read/write):
    md0: ios=0/40967, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/2561, aggrmerge=0/2559, aggrticks=0/132223, aggrin_queue=127862, aggrutil=21.36%

OK now the speed has dropped to about 1.9 GBytes/s and we still use all a CPU but the disks in the RAID device claim they had capacity to go faster (aggrutil=21.36%). Next up direct I/O:

$ fio --end_fsync=1 --bs=1M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_1M 
direct_1M: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=3242MiB/s][r=0,w=3242 IOPS][eta 00m:00s]
direct_1M: (groupid=0, jobs=1): err= 0: pid=75226: Sun Aug 25 09:16:40 2019
  write: IOPS=2252, BW=2252MiB/s (2361MB/s)(20.0GiB/9094msec)
[...]
  cpu          : usr=8.71%, sys=38.14%, ctx=20621, majf=0, minf=83
[...]
Disk stats (read/write):
    md0: ios=0/40966, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/5120, aggrmerge=0/0, aggrticks=0/1283, aggrin_queue=1, aggrutil=0.09%

Going direct we use just under 50% of a CPU to do 2.2 GBytes/s (but notice how I/Os weren't merged and how we did far more userspace/kernel context switches). If we were to push more I/O per syscall things change:

$ fio --bs=4M --size=20G --rw=write --filename=zeroes --name=buffered_4M_no_fsync
buffered_4M_no_fsync: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=2390MiB/s][r=0,w=597 IOPS][eta 00m:00s]
buffered_4M_no_fsync: (groupid=0, jobs=1): err= 0: pid=8029: Sun Aug 25 09:19:39 2019
  write: IOPS=592, BW=2370MiB/s (2485MB/s)(20.0GiB/8641msec)
[...]
  cpu          : usr=3.83%, sys=96.19%, ctx=12, majf=0, minf=1048
[...]
Disk stats (read/write):
    md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%

$ fio --end_fsync=1 --bs=4M --size=20G --rw=write --filename=zeroes --direct=1 --name=direct_4M
direct_4M: (g=0): rw=write, bs=(R) 4096KiB-4096KiB, (W) 4096KiB-4096KiB, (T) 4096KiB-4096KiB, ioengine=psync, iodepth=1
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=5193MiB/s][r=0,w=1298 IOPS][eta 00m:00s]
direct_4M: (groupid=0, jobs=1): err= 0: pid=92097: Sun Aug 25 09:22:39 2019
  write: IOPS=866, BW=3466MiB/s (3635MB/s)(20.0GiB/5908msec)
[...]
  cpu          : usr=10.02%, sys=44.03%, ctx=5233, majf=0, minf=12
[...]
Disk stats (read/write):
    md0: ios=0/4667, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=0/292, aggrmerge=0/291, aggrticks=0/748, aggrin_queue=53, aggrutil=0.87%

With a massive block size of 4 MBytes buffered I/O became bottlenecked at "just" 2.3 GBytes/s (even when we didn't force the cache to be flushed) due to the fact that there's no CPU left. Direct I/O used around 55% of a CPU and managed to reach 3.5 GBytes/s so it was roughly 50% faster than buffered I/O.

Summary: Your I/O pattern doesn't really benefit from buffering (I/Os are huge, data is not being reused, I/O is streaming sequential) so you're in an optimal scenario for O_DIRECT being faster. See these slides by the original author of Linux's O_DIRECT (longer PDF document that contains an embedded version of most of the slides) for the original motivation behind it.

Owings answered 25/2, 2018 at 13:9 Comment(6)

Why would the kernel make extra copies? Once the data is in kernel memory I would just expect, maybe naively, that a pointer just gets passed around. I understand that without the direct flag it should need to make exactly 1 copy to get the user space data into a kernel space buffer, but don't know why you would need copies after that. – Recognizor 25/2, 2018 at 16:38

@JosephGarvin sure but each transition to and from user space is another "copy" hence "copies". Theoretically the need to get data into a particular region of memory could also cause copies but this is platform (e.g. 32 bit Linux with high amounts of memory) specific. – Owings 26/2, 2018 at 12:37

But 1extra copy should only halve the bandwidth. It went from 14 GB/s to less than 1! – Recognizor 26/2, 2018 at 17:52

@JosephGarvin you're assuming that a userspace->kernel copy is only a memcpy whereas there's slightly more to it (see the linked PDF). Your memcpy tester may be assuming you are doing a giant gigabyte memcpy from one place to the next non-stop whereas your dd is only copying one megabyte at a time with a userspace->kernel->userspace transition in-between each one. Finally it's not just the fact that you save the copy - it's that and all the extra work you save - how busy was your CPU with buffering and if it was maxed then perhaps that was your bottleneck. – Owings 26/2, 2018 at 22:21

That html conversion contains errors about things not being printable, do you have the original? – Recognizor 3/3, 2018 at 1:1

@JosephGarvin I believe that's just the one slide that has that message but I don't know of an alternative slide deck. I suppose you could try asking Andrea if he still has a copy - he works at Red Hat these days. – Owings 3/3, 2018 at 9:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags