TL;DR: If the Linux kernel loses a buffered I/O write, is there any way for the application to find out?
I know you have to fsync()
the file (and its parent directory) for durability. The question is if the kernel loses dirty buffers that are pending write due to an I/O error, how can the application detect this and recover or abort?
Think database applications, etc, where order of writes and write durability can be crucial.
Lost writes? How?
The Linux kernel's block layer can under some circumstances lose buffered I/O requests that have been submitted successfully by write()
, pwrite()
etc, with an error like:
Buffer I/O error on device dm-0, logical block 12345
lost page write due to I/O error on dm-0
(See end_buffer_write_sync(...)
and end_buffer_async_write(...)
in fs/buffer.c
).
On newer kernels the error will instead contain "lost async page write", like:
Buffer I/O error on dev dm-0, logical block 12345, lost async page write
Since the application's write()
will have already returned without error, there seems to be no way to report an error back to the application.
Detecting them?
I'm not that familiar with the kernel sources, but I think that it sets AS_EIO
on the buffer that failed to be written-out if it's doing an async write:
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
but it's unclear to me if or how the application can find out about this when it later fsync()
s the file to confirm it's on disk.
It looks like wait_on_page_writeback_range(...)
in mm/filemap.c
might by do_sync_mapping_range(...)
in fs/sync.c
which is turn called by sys_sync_file_range(...)
. It returns -EIO
if one or more buffers could not be written.
If, as I'm guessing, this propagates to fsync()
's result, then if the app panics and bails out if it gets an I/O error from fsync()
and knows how to re-do its work when restarted, that should be sufficient safeguard?
There's presumably no way for the app to know which byte offsets in a file correspond to the lost pages so it can rewrite them if it knows how, but if the app repeats all its pending work since the last successful fsync()
of the file, and that rewrites any dirty kernel buffers corresponding to lost writes against the file, that should clear any I/O error flags on the lost pages and allow the next fsync()
to complete - right?
Are there then any other, harmless, circumstances where fsync()
may return -EIO
where bailing out and redoing work would be too drastic?
Why?
Of course such errors should not happen. In this case the error arose from an unfortunate interaction between the dm-multipath
driver's defaults and the sense code used by the SAN to report failure to allocate thin-provisioned storage. But this isn't the only circumstance where they can happen - I've also seen reports of it from thin provisioned LVM for example, as used by libvirt, Docker, and more. An critical application like a database should try to cope with such errors, rather than blindly carrying on as if all is well.
If the kernel thinks it's OK to lose writes without dying with a kernel panic, applications have to find a way to cope.
The practical impact is that I found a case where a multipath problem with a SAN caused lost writes that landed up causing database corruption because the DBMS didn't know its writes had failed. Not fun.
fsync()
inglibc
is backed by thesys_sync_file_range
syscall, in which case I'm pretty sure I know what's wrong now - and that thefsync
man page badly needs a big fat warning added to it about retries. – TaggartO_SYNC|O_DIRECT
combination at opening. – Myriadfsync()
and buffered I/O entirely. My question is how you can use buffered I/O andfsync()
correctly and safely. – Taggart