(2018, many years after this question was first asked)
What does it take to be durable on Linux?
From reading your question I see you have a filesystem between you and the disk. So the question becomes:
What does it take to be durable using a Linux filesystem?
The best you can do (in the general filesystem and unspecified hardware case) is the "fsync dance" which goes something like this:
preallocate_file(tmp);fsync(tmp);fsync(dir);rename(tmp, normal);fsync(normal);fsync(dir);
(shamelessly stolen from the comment Andres Freund (Postgres Developer) left on LWN) and you must check the return code of every call before proceeding to see if it succeeded and assume something went wrong if any return code returned non-zero. If you're using mmap
then msync(MS_SYNC)
is the equivalent of fsync
.
A similar pattern to the above is mentioned on Dan Luu's "Files are hard" (which has a nice table about overwrite atomicity of various filesystems), the LWN article "Ensuring data reaches disk" and Ted Ts'o's "Don’t fear the fsync!".
For all of these [O_DIRECT
| O_DSYNC
, O_DIRECT
+ fdatasync
, mmap
+ msync
], is it possible for data to be lost (after the write or sync has returned) or corrupted by a power failure, panic, crash, or anything else?
Yes you could have unnoticed corruption because "allocating writes" due to growing the file past its current bounds can cause metadata operations and you are not checking for metadata durability (only data durability).
if my server dies mid pwrite, or between the beginning of pwrite and the end of fdatasync, or between the mapped memory being altered and msync, I'll have a mix of old and new data, [etc.]
As the state of the data is undefined in the case of interrupted overwrites it could be anything...
I want my individual pwrite calls to be atomic and ordered. Is this the case?
Between fsync
's reordering could occur (e.g. if O_DIRECT
silently fell back to buffering).
case if they're across multiple files?
You're in even more trouble. To cover this you would need to be writing your own journal and probably using file renames.
if I write with O_DIRECT | O_DSYNC to A, then O_DIRECT | O_DSYNC to B,
No.
Does fsync even guarantee that the data's written?
Yes It is necessary (if not sufficient) to determine the above (with modern Linux and a truthful disk stack assuming no bugs).
Does the journalling of ext4 completely solve the issue of corrupt blocks
No.
(ETOOMANYQUESTIONS)
Yes the Linux software stack could be buggy (2019: see the addendum below) or the hardware could be buggy (or lie in a way it can't back up) but that doesn't stop the above being the best you can do if everything lives up to its end of the bargain on a POSIX filesystem. If you know you have a particular OS with a particular filesystem (or no filesystem) and a particular hardware setup then it is true you may be able to reduce the need for some of the above but in general you should not skip any step.
Bonus answer: O_DIRECT
alone cannot guarantee durability when used with filesystems (an initial issue would be "how do you know metadata has been persisted?"). See "Clarifying Direct IO's Semantics" in the Ext4 wiki for discussion on this point.
Addendum (March 2019)
Even with the current (at the time of writing 5.0) Linux kernel fsync
doesn't always see error notifications and kernels before 4.16 were even worse. The PostgreSQL folks found that notification of errors can be lost and unwritten pages marked as clean leading to a case where fsync
returns success even though there was a (swallowed) error asynchronously writing back the data (most Linux filesystems don't reliably keep dirty data around once a failure has happened so repeatedly "retrying" a failed fsync
doesn't necessarily indicate what you might expect). See the PostgreSQL Fsync Errors wiki page the LWN PostgreSQL's fsync() surprise article and the talk How is it possible that PostgreSQL used fsync incorrectly for 20 years, and what we'll do about it from FOSDEM 2019 for details.
So the post credits conclusion is it's complicated:
- The
fsync
dance is necessary (even if it's not always sufficient) to at least cover the non-buggy I/O stack case
- If you do your (write) I/O via direct I/O you will be able to get accurate errors when a write goes wrong
- Earlier (older than 4.16) kernels were buggy when it came to time to get errors via
fsync
Also see: