Writing programs to cope with I/O errors causing lost writes on Linux
Asked Answered
T

5

147

TL;DR: If the Linux kernel loses a buffered I/O write, is there any way for the application to find out?

I know you have to fsync() the file (and its parent directory) for durability. The question is if the kernel loses dirty buffers that are pending write due to an I/O error, how can the application detect this and recover or abort?

Think database applications, etc, where order of writes and write durability can be crucial.

Lost writes? How?

The Linux kernel's block layer can under some circumstances lose buffered I/O requests that have been submitted successfully by write(), pwrite() etc, with an error like:

Buffer I/O error on device dm-0, logical block 12345
lost page write due to I/O error on dm-0

(See end_buffer_write_sync(...) and end_buffer_async_write(...) in fs/buffer.c).

On newer kernels the error will instead contain "lost async page write", like:

Buffer I/O error on dev dm-0, logical block 12345, lost async page write

Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.

Detecting them?

I'm not that familiar with the kernel sources, but I think that it sets AS_EIO on the buffer that failed to be written-out if it's doing an async write:

    set_bit(AS_EIO, &page->mapping->flags);
    set_buffer_write_io_error(bh);
    clear_buffer_uptodate(bh);
    SetPageError(page);

but it's unclear to me if or how the application can find out about this when it later fsync()s the file to confirm it's on disk.

It looks like wait_on_page_writeback_range(...) in mm/filemap.c might by do_sync_mapping_range(...) in fs/sync.c which is turn called by sys_sync_file_range(...). It returns -EIO if one or more buffers could not be written.

If, as I'm guessing, this propagates to fsync()'s result, then if the app panics and bails out if it gets an I/O error from fsync() and knows how to re-do its work when restarted, that should be sufficient safeguard?

There's presumably no way for the app to know which byte offsets in a file correspond to the lost pages so it can rewrite them if it knows how, but if the app repeats all its pending work since the last successful fsync() of the file, and that rewrites any dirty kernel buffers corresponding to lost writes against the file, that should clear any I/O error flags on the lost pages and allow the next fsync() to complete - right?

Are there then any other, harmless, circumstances where fsync() may return -EIO where bailing out and redoing work would be too drastic?

Why?

Of course such errors should not happen. In this case the error arose from an unfortunate interaction between the dm-multipath driver's defaults and the sense code used by the SAN to report failure to allocate thin-provisioned storage. But this isn't the only circumstance where they can happen - I've also seen reports of it from thin provisioned LVM for example, as used by libvirt, Docker, and more. An critical application like a database should try to cope with such errors, rather than blindly carrying on as if all is well.

If the kernel thinks it's OK to lose writes without dying with a kernel panic, applications have to find a way to cope.

The practical impact is that I found a case where a multipath problem with a SAN caused lost writes that landed up causing database corruption because the DBMS didn't know its writes had failed. Not fun.

Taggart answered 24/2, 2017 at 9:19 Comment(12)
I'm afraid this would need additional fields in the SystemFileTable to store&remember these error conditions. And a possibility for the userspace process to receive or inspect them on subsequent calls. (do fsync() and close() return this kind of historic information?)Rictus
@Rictus Thanks. I just posted an answer with what I think is going on, mind having a sanity-check since you seem to know more about what's going on than the people who've posted obvious variants of "write() needs close() or fsync() for durability" without reading the question?Taggart
BTW: I think you really should delve into the kernel sources. The journalled filesystems would probably suffer from the same kind of problems. Not to mention the swap partition handling. Since these live in kernel space, the handling of these conditions will probably a bit more rigid. writev() , which is visible from userspace, also seems like a place to look. [ at Craig: yes becaus I know your name, and I know you are not a complete idiot ;-]Rictus
@Rictus Yeah, been source-spelunking for a couple of hours now. Love that a total kernel newbie can follow what's going on reasonably well, lovely clean sources. Currently trying to verify that fsync() in glibc is backed by the sys_sync_file_range syscall, in which case I'm pretty sure I know what's wrong now - and that the fsync man page badly needs a big fat warning added to it about retries.Taggart
My guess is that retry behaviour is different for internal kernel use and userspace. System usage should retry and panic, user space should (maybe retry and) but fail ASAP. Clearing the error flags inbetween requests does not seem right. Source quality: at least they don't cast malloc()s return value! BTW there are of cource a lot of options for failure inside system. (such as umount the device, etc)Rictus
In case of critical application you just have to use O_SYNC|O_DIRECT combination at opening.Myriad
@Jean-BaptisteYunès If you don't mind the utterly abysmal performance that results. Synchronous writes are awful. You can't realistically do that in something like a DBMS unless you've gone Oracle-style and re-implemented half the kernel's I/O smarts in your DBMS, with dedicated writer threads, cache management, etc. In which case you might as well use AIO. You're trying to define the problem away by avoiding fsync() and buffered I/O entirely. My question is how you can use buffered I/O and fsync() correctly and safely.Taggart
Sure it would be terribly bad in performance, but it is a price to pay. If you want to read something about ensuring ACID properties for DBMS, you can read the following quora.com/…. Interesting questions are answered very simply.Myriad
@Jean-BaptisteYunès I agree, good explanation. But again, somewhat off topic. I'm asking if using buffered I/O it is possible to detect and correctly cope with I/O errors where the kernel loses writes. I have answered this myself, below, after further research. Now, if I wanted to completely rewrite the DBMS to use AIO or a pool of threads with synchronous direct I/O or something, sure, I could follow your suggestion. But it should be possible and (per below) is possible to do it correctly with buffered I/O and careful use of fsync.Taggart
@Jean-BaptisteYunès By analogy, if I asked "how do I park a car" you've first explained "don't park the car, use a taxi so you don't have to" then "here's how parking regulations work". Both interesting, but not that relevant to the original problem.Taggart
I agree, I wasn't so fair. Alas your answer is not very satisfying, I mean there no easy solution (surprising?).Myriad
@Jean-BaptisteYunès True. For the DBMS I'm working with, "crash and enter redo" is acceptable. For most apps that's not an option and they might have to tolerate synchronous I/O's horrid performance or just accept poorly defined behaviour and corruption on I/O errors.Taggart
T
101

fsync() returns -EIO if the kernel lost a write

(Note: early part references older kernels; updated below to reflect modern kernels)

It looks like async buffer write-out in end_buffer_async_write(...) failures set an -EIO flag on the failed dirty buffer page for the file:

set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);

which is then detected by wait_on_page_writeback_range(...) as called by do_sync_mapping_range(...) as called by sys_sync_file_range(...) as called by sys_sync_file_range2(...) to implement the C library call fsync().

But only once!

This comment on sys_sync_file_range

168  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
169  * I/O errors or ENOSPC conditions and will return those to the caller, after
170  * clearing the EIO and ENOSPC flags in the address_space.

suggests that when fsync() returns -EIO or (undocumented in the manpage) -ENOSPC, it will clear the error state so a subsequent fsync() will report success even though the pages never got written.

Sure enough wait_on_page_writeback_range(...) clears the error bits when it tests them:

301         /* Check for outstanding write errors */
302         if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
303                 ret = -ENOSPC;
304         if (test_and_clear_bit(AS_EIO, &mapping->flags))
305                 ret = -EIO;

So if the application expects it can re-try fsync() until it succeeds and trust that the data is on-disk, it is terribly wrong.

I'm pretty sure this is the source of the data corruption I found in the DBMS. It retries fsync() and thinks all will be well when it succeeds.

Is this allowed?

The POSIX/SuS docs on fsync() don't really specify this either way:

If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed.

Linux's man-page for fsync() just doesn't say anything about what happens on failure.

So it seems that the meaning of fsync() errors is "I don't know what happened to your writes, might've worked or not, better try again to be sure".

Newer kernels

On 4.9 end_buffer_async_write sets -EIO on the page, just via mapping_set_error.

    buffer_io_error(bh, ", lost async page write");
    mapping_set_error(page->mapping, -EIO);
    set_buffer_write_io_error(bh);
    clear_buffer_uptodate(bh);
    SetPageError(page);

On the sync side I think it's similar, though the structure is now pretty complex to follow. filemap_check_errors in mm/filemap.c now does:

    if (test_bit(AS_EIO, &mapping->flags) &&
        test_and_clear_bit(AS_EIO, &mapping->flags))
            ret = -EIO;

which has much the same effect. Error checks seem to all go through filemap_check_errors which does a test-and-clear:

    if (test_bit(AS_EIO, &mapping->flags) &&
        test_and_clear_bit(AS_EIO, &mapping->flags))
            ret = -EIO;
    return ret;

I'm using btrfs on my laptop, but when I create an ext4 loopback for testing on /mnt/tmp and set up a perf probe on it:

sudo dd if=/dev/zero of=/tmp/ext bs=1M count=100
sudo mke2fs -j -T ext4 /tmp/ext
sudo mount -o loop /tmp/ext /mnt/tmp

sudo perf probe filemap_check_errors

sudo perf record -g -e probe:end_buffer_async_write -e probe:filemap_check_errors dd if=/dev/zero of=/mnt/tmp/test bs=4k count=1 conv=fsync

I find the following call stack in perf report -T:

        ---__GI___libc_fsync
           entry_SYSCALL_64_fastpath
           sys_fsync
           do_fsync
           vfs_fsync_range
           ext4_sync_file
           filemap_write_and_wait_range
           filemap_check_errors

A read-through suggests that yeah, modern kernels behave the same.

This seems to mean that if fsync() (or presumably write() or close()) returns -EIO, the file is in some undefined state between when you last successfully fsync()d or close()d it and its most recently write()ten state.

Test

I've implemented a test case to demonstrate this behaviour.

Implications

A DBMS can cope with this by entering crash recovery. How on earth is a normal user application supposed to cope with this? The fsync() man page gives no warning that it means "fsync-if-you-feel-like-it" and I expect a lot of apps won't cope well with this behaviour.

Bug reports

Further reading

lwn.net touched on this in the article "Improved block-layer error handling".

postgresql.org mailing list thread.

Taggart answered 24/2, 2017 at 10:17 Comment(10)
lxr.free-electrons.com/source/fs/buffer.c?v=2.6.26#L598 is a possible race, because it waits for {pending&scheduled I/O}, not for {not yet scheduled I/O}. This is obviously to avoid extra round-trips to the device. (I presume user writes() don't return until I/O is scheduled, for mmap(), this is different)Rictus
Is it possible that some other process's call to fsync for some other file on the same disk gets the error return?Pemphigus
@Pemphigus Very relevant for a multi-processing DB like PostgreSQL, so good question. Looks like probably, but I don't know the kernel code well enough to understand. Your procs had better be co-operating if they both have the same file open anyway though.Taggart
@DavidFoerster: The syscalls return failures using negative errno codes; errno is completely a construct of the userspace C library. It is common to ignore the return value differences between the syscalls and the C library like this (as Craig Ringer does, above), since the error return value reliably identifies which one (syscall or C library function) is being referred to: "-1 with errno==EIO" refers to a C library function, whereas "-EIO" refers to a syscall. Finally, Linux man pages online are the most up to date reference for Linux man pages.Fencesitter
@CraigRinger: To answer your final question: "By using low-level I/O and fsync()/fdatasync() when the transaction size is a complete file; by using mmap()/msync() when the transaction size is a page-aligned record; and by using low-level I/O, fdatasync(), and multiple concurrent file descriptors (one descriptor and a thread per transaction) to the same file otherwise". The Linux-specific open file description locks (fcntl(), F_OFD_) are very useful with the last one.Fencesitter
@NominalAnimal So normal apps must remember all write()s since the last successful fsync() or close() and be able to repeat them on -EIO? If the page is flagged AS_EIO does the app also need to re-read the page from disk to clear the flag? If it just tries to write a partial page I'd expect another -EIO result. I bet you'll find something approximating zero applications that get this right in the wild, especially given that the relevant man pages don't seem to say anything much about it.Taggart
@CraigRinger: What kind of normal apps are you referring to? For my workloads, I care about file integrity, not record integrity. This means my normal apps must report write(), fsync(), and close() failures (so that I, as the end user, know about it); and they must also do the fsync(). Most, but not all, current applications do not do the fsync() or even fdatasync(), and they do ignore close() and fsync() errors. This is bothersome to me, especially since upstream is typically uninterested in adding options for those checks (as "rare errors are not worth checking for").Fencesitter
@CraigRinger: For non-interactive tasks, logging and flagging (again, for user/admin action) is the most important part. (I for one check raw SMART attributes on my spinning disks to get indications of hardware failures -- I'm too poor for large SSDs --, and I'd love to have my services properly report any I/O errors too.)Fencesitter
@CraigRinger: For record-based stuff, I prefer to rely on remote replication instead, which reduces the problem back to "file integrity" (or table or db integrity). There are some applications that do have the necessary fsync() (often needs an option to enable) and do check it and close() for errors, but I don't have a list for you. All utilities using gnulib close_stream() (coreutils) do check for close() errors; dd even has conv=fsync support. (So, use dd for important data!)Fencesitter
@NominalAnimal Checking for errors is nice, but just tells you "um, that didn't work, and now I don't know what the state is". Actually repeating the operation in a predictable manner and coping with any prior borkedness the previous failure caused... dd can be expected to do that but I'll be surprised if many others do.Taggart
S
22

Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.

I do not agree. write can return without error if the write is simply queued, but the error will be reported on the next operation that will require the actual writing on disk, that means on next fsync, possibly on a following write if the system decides to flush the cache and at least on last file close.

That is the reason why it is essential for application to test the return value of close to detect possible write errors.

If you really need to be able to do clever error processing you must assume that everything that was written since the last successful fsync may have failed and that in all that at least something has failed.

Shinleaf answered 24/2, 2017 at 10:33 Comment(1)
Yeah, I think that nails it. This would indeed suggest that the application should re-do all its work since the last confirmed-successful fsync() or close() of the file if it gets an -EIO from write(), fsync() or close(). Well, that's fun.Taggart
Z
2

write(2) provides less than you expect. The man page is very open about the semantic of a successful write() call:

A successful return from write() does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee that space has successfully been reserved for the data. The only way to be sure is to call fsync(2) after you are done writing all your data.

We can conclude that a successful write() merely means that the data has reached the kernel's buffering facilities. If persisting the buffer fails, a subsequent access to the file descriptor will return the error code. As last resort that may be close(). The man page of the close(2) system call contains the following sentence:

It is quite possible that errors on a previous write(2) operation are first reported at the final close().

If your application needs to persist data write away it has to use fsync/fsyncdata on a regular basis:

fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.

Zacheryzack answered 24/2, 2017 at 9:52 Comment(5)
Yes, I'm aware that fsync() is required. But in the specific case where the kernel loses the pages due to an I/O error will fsync() fail? Under what circumstances can it then succeed afterwards?Taggart
I don't know the kernel source either. Let's assume fsync() returns -EIO on I/O issues (What would it be good for otherwise?). So the database knows some of a previous write failed and could go into recover mode. Is this not what you want? What is the motivation of your last question? Do you want to know which write failed or recover the file descriptor for further use?Zacheryzack
Ideally a DBMS will prefer not to enter crash recovery (kicking off all users and becoming temporarily inaccessible or at least read-only) if it can possibly avoid it. But even if the kernel could tell us "bytes 4096 to 8191 of fd X" it'd be hard to figure out what to (re)write there without pretty much doing crash recovery. So I guess the main question is whether there are any more innocent circumstances where fsync() may return -EIO where it is safe to retry, and if it's possible to tell the difference.Taggart
Sure crash recovery is the last resort. But as you already said these issues are expected to be very very rare. Therefore, I don't see an issue with going into recovery on any -EIO. If each file descriptor is only used by one thread at a time, this thread could go back to the last fsync() and redo the write() calls. But still, if those write()s only writes part of a sector the unmodified part may still be corrupt.Zacheryzack
You're right that going into crash recovery is likely reasonable. As for partly corrupt sectors, the DBMS (PostgreSQL) stores an image of the whole page the first time it touches it after any given checkpoint for just that reason, so it should be fine :)Taggart
B
-1

Use the O_SYNC flag when you open the file. It ensures the data is written to the disk.

If this won't satisfy you, there will be nothing.

Brushoff answered 24/2, 2017 at 9:34 Comment(4)
O_SYNC is a nightmare for performance. It means the application cannot do anything else while disk I/O is occurring unless it spawns off I/O threads. You might as well say that the buffered I/O interface is unsafe and everyone should use AIO. Surely silently-lost writes cannot be acceptable in buffered I/O?Taggart
(O_DATASYNC is only slightly better in that regard)Taggart
@CraigRinger You should use AIO if you have this need and need any sort of performance. Or just use a DBMS; it handles everything for you.Bridgeport
@Bridgeport The application here is a dbms (postgresql). I'm sure you can imagine that rewriting the entire application to use AIO instead of buffered I/O is not practical. Nor should it be necessary.Taggart
Y
-6

Check the return value of close. close can fail whilst buffered writes appear to succeed.

Yolandoyolane answered 24/2, 2017 at 10:5 Comment(1)
Well, we hardly want to be open()ing and close()ing the file every few seconds. that's why we have fsync() ...Taggart

© 2022 - 2024 — McMap. All rights reserved.