What does O_DIRECT really mean?
Asked Answered
G

1

24

If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?

Guard answered 21/12, 2016 at 7:49 Comment(4)
No. The manual page is very explicit on that. There is a separate section talking specifically about O_DIRECT and O_SYNC with synchronous I/O.Nominalism
Thanks for the reply:-) I read the man page, it says that "The I/O is synchronous, that is, at the completion of a read(2) or write(2), data is guaranteed to have been transferred." And it seems that O_SYNC is used to guarantee that the metadata is also transferred. So, I wonder can O_DIRECT guarantee that the data (not metadata) is transferred by the time the write returns?Guard
By the way, does "transferred" mean that the data is on disk?Guard
From the manual page: "O_DIRECT..does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred". Read about O_SYNC for related guarantees Linux makes. Also "transferred" can never guarantee the data is on disk. That's a very complicated matter which can never be fully guaranteed because of the underlying systems (I/O controllers, bus, etc)Nominalism
N
46

(This answer pertains to Linux - other OSes may have different caveats/semantics)

Let's start with the sub-question:

If I open a file with O_DIRECT flag, does it mean that whenever a write(blocking mode) to that file returns, the data is on disk?

No (as @michael-foukarakis commented) - if you need a guarantee your data made it to non-volatile storage you must use/add something else.

What does O_DIRECT really mean?

It's a hint that you want your I/O to bypass the Linux kernel's caches. What will actually happen depends on things like:

  • Disk configuration
  • Whether you are opening a block device or a file in a filesystem
  • If using a file within a filesystem
    • The exact filesystem used and the options in use on the filesystem and the file
    • Whether you've correctly aligned your I/O
    • Whether a filesystem has to do a new block allocation to satisfy your I/O
  • If the underlying disk is local, what layers you have in your kernel storage stack before you reach the disk block device
  • Linux kernel version
  • ...

The list above is not exhaustive.

In the "best" case, setting O_DIRECT will avoid making extra copies of data while transferring it and the call will return after transfer is complete. You are more likely to be in this case when directly opening block devices of "real" local disks. As previously stated, even this property doesn't guarantee that data of a successful write() call will survive sudden power loss. IF the data is DMA'd out of RAM to non-volatile storage (e.g. battery backed RAID controller) or the RAM itself is persistent storage THEN you may have a guarantee that the data reached stable storage that can survive power loss. To know if this is the case you have to qualify your hardware stack so you can't assume this in general.

In the "worst" case, O_DIRECT can mean nothing at all even though setting it wasn't rejected and subsequent calls "succeed". Sometimes things in the Linux storage stack (like certain filesystem setups) can choose to ignore it because of what they have to do or because you didn't satisfy the requirements (which is legal) and just silently do buffered I/O instead (i.e. write to a buffer/satisfy read from already buffered data). It is unclear whether extra effort will be made to ensure that the data of an acknowledged write was at least "with the device" (but in the O_DIRECT and barriers thread Christoph Hellwig posts that the O_DIRECT fallback will ensure data has at least been sent to the device). A further complication is that using O_DIRECT implies nothing about file metadata so even if write data is "with the device" by call completion, key file metadata (like the size of the file because you were doing an append) may not be. Thus you may not actually be able to get at the data you thought had been transferred after a crash (it may appear truncated, or all zeros etc).

While brief testing can make it look like data using O_DIRECT alone always implies data will be on disk after a write returns, changing things (e.g. using an Ext4 filesystem instead of XFS) can weaken what is actually achieved in very drastic ways.

As you mention "guarantee that the data" (rather than metadata) perhaps you're looking for O_DSYNC/fdatasync()? If you want to guarantee metadata was written too, you will have to look at O_SYNC/fsync().

References

Newburg answered 24/3, 2018 at 7:40 Comment(3)
will ALL the bytes you write always be committed in a single write, unlike a write to a socket? I assume even if not, that it'll never write at least PAGE_SIZE batches?Epigone
@Epigone if we're talking about Linux in the general case the answer is no - you can't guarantee an ordinary write(2) won't be split apart by the block layer (you may be interested in reading https://mcmap.net/q/269933/-are-disk-sector-writes-atomic and https://mcmap.net/q/213310/-linux-writes-are-split-into-512k-chunks). Also if the disk supports it you can legitimately write less than PAGE_SIZE (many recent disks may do RMW so it can support sectors of 512 bytes) and that's before you consider some Linux platforms have PAGE_SIZE==64k...Newburg
thanks! read everything. my NVMe device has max_hw_sectors_kb:2048 and max_sectors_kb:1280 logical_block_size:512. so even on a bleeding edge Linux kernel, a single write() to a file will never write more than 1280KB? so there is no point submitting a write larger than that?Epigone

© 2022 - 2024 — McMap. All rights reserved.