How to durably rename a file in POSIX?

W

4

19

What's the correct way to durably rename a file in a POSIX file system? Specifically wondering about fsyncs on the directories. (If this depends on the OS/FS, I'm asking about Linux and ext3/ext4).

Note: there are other questions on StackOverflow about durable renames, but AFAICT they don't address fsync-ing the directories (which is what matters to me - I'm not even modifying file data).

I currently have (in Python):

dstdirfd = open(dstdirpath, O_DIRECTORY|O_RDONLY)
rename(srcdirpath + '/' + filename, dstdirpath + '/' + filename)
fsync(dstdirfd)

Specific questions:

Does this also implicitly fsync the source directory? Or might I end up with the file showing up in both directories after a power cycle (meaning I'd have to check the hard link count and manually perform recovery), i.e. it's impossible to guarantee a durably atomic move operation?
If I fsync the source directory instead of the destination directory, will that also implicitly fsync the destination directory?
Are there any useful related testing/debugging/learning tools (fault injectors, introspection tools, mock filesystems, etc.)?

Thanks in advance.

Weatherworn answered 21/9, 2010 at 21:57 Comment(0)

T

14

POSIX defines that the rename function must be atomic.

So if you rename(A, B), under no circumstances should you ever see a state with the file in both directories or neither directory. There will always be exactly one, no matter what you do with fsync() or whether the system crashes.

But that doesn't solve the problem of making sure the rename() operation is durable. POSIX answers this question:

If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion.

So if you fsync() a directory, pending rename operations must be transferred to disk by the time this returns. fsync() of either directory should be sufficient because atomicity of the rename() operation would require that both directories' changes be synced atomically.

Finally, in contrast to the claim in the blog post mentioned in another answer, the rationale for this explains the following:

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.

A system that claimed to be POSIX compliant and that considered it correct behavior (i.e. not a bug or hardware failure) to complete an fsync() and not persist those changes across a system crash would have to be deliberately misrepresenting itself with respect to the spec.

(updated with additional info re: Linux-specific vs. portable behavior)

Titillate answered 27/4, 2011 at 18:56 Comment(2)

The reasoning here is very wrong. – The “atomicity” of rename(), for example, refers to the “newpath”, under which should be the old file (if there was one) or the renamed file, with no state in between (as seen by other processes). – Bannon 11/5, 2013 at 19:2

Robert pointed it out, rename is "atomic" with respect to assigning new inode to name, whether new inode points to unflushed data or not is not defined by POSIX. – Hayott 19/9, 2015 at 18:14

B

16

Unfortunately Dave’s answer is wrong.

Not all POSIX systems might even have a durable storage. And if they do, it is still “allowed” to be hosed after a system crash. For those systems a no-op fsync() makes sense, and such fsync() is explicitly allowed under POSIX. It is also legal for the file to be recoverable in the old directory, the new directory, both, or any other location. POSIX makes no guarantees for system crashes or file system recoveries.

The real question should be:

How to do a durable rename on systems which support that through the POSIX API?

You need to do a fsync() on both, source and destination directory, because the minimum those fsync()s are supposed to do is persist how source or destination directory should look like.

Does a fsync(destdirfd) also implicitly fsync the source directory?

POSIX in general: no, nothing implies that
ext3/4: I’m not sure if both changes to source and destination dir end up in the same transaction in the journal. If they do, they get both commited together.

Or might I end up with the file showing up in both directories after a power cycle (“crash”), i.e. it's impossible to guarantee a durably atomic move operation?

POSIX in general: no guarantees, but you’re supposed to fsync() both directories, which might not be atomic-durable
ext3/4: how much fsync() you minimally need depends on the mount options. E.g. if mounted with “dirsync” you don’t need any of those two fsync()s. At most you need both fsync()s, but I’m almost sure one is enough (atomic-durable then).

If I fsync the source directory instead of the destination directory, will that also implicitly fsync the destination directory?

POSIX: no
ext3/4: I really believe both end up in the same transaction, so it doesn’t matter which of them you fsync()
older kernels ext3: (if they aren’t in the same transaction) some not-so-optimal implementation did way too much syncing on fsync(), I bet it did commit every transaction which came before. And yes, a normal implementation would first link it to the destination and then remove it from the source. So the fsync(srcdirfd) would trigger the fsync() of the destination as well.
ext4/latest ext3: if they aren’t in the same transaction, you might be able to completely sync them independently (so do both)

Are there any useful related testing/debugging/learning tools (fault injectors, introspection tools, mock filesystems, etc.)?

For a real crash, no. By the way, a real crash goes beyond the viewpoint of the kernel. The hardware might reorder writes (and fail to write everything), corrupting the filesystem. Ext4 is better prepared against this, because it enables write barries (mount options) by default (ext3 does not) and can detect corruption with journal checksums (also a mount option).

And for learning: find out if both changes are somehow linked in the journal! :-P

Bannon answered 11/5, 2013 at 18:11 Comment(6)

This is a subtle point, but I was not arguing that fsync() of one directory implies fsync() of the other. It was that fsync() of one, combined with the atomicity of rename(), requires that that specific change to the other directory also be sync'd to disk. I could believe that's not the case, but that's my reading of the spec. Do you have a reference to back up your interpretation that atomicity is not guaranteed across a crash, even if part of the "atomic" change was fsync'd? – Titillate 30/8, 2013 at 18:5

Yes, I do: your link to fsync() on opengroup.org. “It is explicitly intended that a null implementation is permitted.” That is: no guarantees. – And that wishful linking of rename() and fsync() is your invention. And no, it is not a subtle point. – Bannon 3/9, 2013 at 17:3

Agreed that POSIX systems are not required to make any such changes durable, and that the real question is how to use the API to make a change durable on systems that support it. (This seems pedantic, though, since the question obviously presupposes that the underlying system supports it.) In asking for a reference, I was referring to your claim that even on systems that do support sync'ing changes to disk, a successful fsync() of either directory can still result in the file showing up in both places (or neither) after a crash. – Titillate 3/9, 2013 at 20:58

The POSIX documentation does neither talk about any categories or levels of crash-safety nor how to reach those levels. – I can hardly point you to a specific place where they do not talk about this. – I challenge you to point me to a place where they do... – On the other side: I’m not aware of any Linux ext3/ext4 documentation claiming that it supports any kind of crash-safe rename() “the POSIX way”. There not even a portable way to find out if the system does support such thing. – Bannon 6/9, 2013 at 6:39

Fsync is not even guaranteed to flush buffers to disk. POSIX explicitly allows a no-op implementation if filesystem can guarantee safety by other means. – Hayott 19/9, 2015 at 18:58

@ArekBulski, as I wrote already, POSIX allows no-op-implementation even if there is no other guarantee. – Bannon 20/9, 2015 at 16:10

T

14

POSIX defines that the rename function must be atomic.

So if you rename(A, B), under no circumstances should you ever see a state with the file in both directories or neither directory. There will always be exactly one, no matter what you do with fsync() or whether the system crashes.

But that doesn't solve the problem of making sure the rename() operation is durable. POSIX answers this question:

If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion.

So if you fsync() a directory, pending rename operations must be transferred to disk by the time this returns. fsync() of either directory should be sufficient because atomicity of the rename() operation would require that both directories' changes be synced atomically.

Finally, in contrast to the claim in the blog post mentioned in another answer, the rationale for this explains the following:

The fsync() function is intended to force a physical write of data from the buffer cache, and to assure that after a system crash or other failure that all data up to the time of the fsync() call is recorded on the disk. Since the concepts of "buffer cache", "system crash", "physical write", and "non-volatile storage" are not defined here, the wording has to be more abstract.

A system that claimed to be POSIX compliant and that considered it correct behavior (i.e. not a bug or hardware failure) to complete an fsync() and not persist those changes across a system crash would have to be deliberately misrepresenting itself with respect to the spec.

(updated with additional info re: Linux-specific vs. portable behavior)

Titillate answered 27/4, 2011 at 18:56 Comment(2)

The reasoning here is very wrong. – The “atomicity” of rename(), for example, refers to the “newpath”, under which should be the old file (if there was one) or the renamed file, with no state in between (as seen by other processes). – Bannon 11/5, 2013 at 19:2

Robert pointed it out, rename is "atomic" with respect to assigning new inode to name, whether new inode points to unflushed data or not is not defined by POSIX. – Hayott 19/9, 2015 at 18:14

E

-1

The answer to your question is going to depend a lot on the specific OS being used, the type of filesystem being used and whether the source and dest are on the same device or not.

I'd start by reading the rename(2) man page on the platform you're using.

Exsanguine answered 12/4, 2011 at 17:59 Comment(2)

Already consulted that man page - nothing relevant. You're saying there's no portable way to rename across directories? I'm willing to believe that but interested in a clearer statement and ideally supporting evidence. Also, do you know the answer for recent Linux 2.6's with ext3/4 (as this question was tagged - just updated the main text as well)? – Weatherworn 13/4, 2011 at 5:4

Ah ok, simpler if it's just linux and ext3/4 that you care about. One caveat that the linux rename(2) man page mentions is:

However, when overwriting there will probably be a window in which both oldpath and newpath refer to the file being renamed.

– Exsanguine 10/5, 2011 at 20:40

E

-4

It sounds to me like you're trying to do the job of the filesystem. If you move a file the kernel and file-system are responsible for atomic operation and fault-recovery, not your code.

Anyway, this article seems to address your questions regarding fsync: http://blogs.gnome.org/alexl/2009/03/16/ext4-vs-fsync-my-take/

Epicurus answered 13/4, 2011 at 5:14 Comment(1)

Read that post before, and it's one of many that brought me here. It specifically says: "In case of a system crash shortly after the write, its more likely that we get the new file than the old file (for maximum chance of this you additionally need to fsync the directory the file is in)." My question is about what happens when you're renaming across directories. – Weatherworn 15/4, 2011 at 4:7

Recommended topics

Hot tags