Atomicity of `write(2)` to a local filesystem
Asked Answered
G

4

24

Apparently POSIX states that

Either a file descriptor or a stream is called a "handle" on the open file description to which it refers; an open file description may have several handles. […] All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle. […] The handles need not be in the same process for these rules to apply. -- POSIX.1-2008

and

If two threads each call [the write() function], each call shall either see all of the specified effects of the other call, or none of them. -- POSIX.1-2008

My understanding of this is that when the first process issues a write(handle, data1, size1) and the second process issues write(handle, data2, size2), the writes can occur in any order but the data1 and data2 must be both pristine and contiguous.

But running the following code gives me unexpected results.

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/wait.h>
die(char *s)
{
  perror(s);
  abort();
}

main()
{
  unsigned char buffer[3];
  char *filename = "/tmp/atomic-write.log";
  int fd, i, j;
  pid_t pid;
  unlink(filename);
  /* XXX Adding O_APPEND to the flags cures it. Why? */
  fd = open(filename, O_CREAT|O_WRONLY/*|O_APPEND*/, 0644);
  if (fd < 0)
    die("open failed");
  for (i = 0; i < 10; i++) {
    pid = fork();
    if (pid < 0)
      die("fork failed");
    else if (! pid) {
      j = 3 + i % (sizeof(buffer) - 2);
      memset(buffer, i % 26 + 'A', sizeof(buffer));
      buffer[0] = '-';
      buffer[j - 1] = '\n';
      for (i = 0; i < 1000; i++)
        if (write(fd, buffer, j) != j)
          die("write failed");
      exit(0);
    }
  }
  while (wait(NULL) != -1)
    /* NOOP */;
  exit(0);
}

I tried running this on Linux and Mac OS X 10.7.4 and using grep -a '^[^-]\|^..*-' /tmp/atomic-write.log shows that some writes are not contiguous or overlap (Linux) or plain corrupted (Mac OS X).

Adding the flag O_APPEND in the open(2) call fixes this problem. Nice, but I do not understand why. POSIX says

O_APPEND If set, the file offset shall be set to the end of the file prior to each write.

but this is not the problem here. My sample program never does lseek(2) but share the same file description and thus same file offset.

I have already read similar questions on Stackoverflow but they still do not fully answer my question.

Atomic write on file from two process does not specifically address the case where the processes share the same file description (as opposed to the same file).

How does one programmatically determine if “write” system call is atomic on a particular file? says that

The write call as defined in POSIX has no atomicity guarantee at all.

But as cited above it does have some. And what’s more, O_APPEND seems to trigger this atomicity guarantee although it seems to me that this guarantee should be present even without O_APPEND.

Can you explain further this behaviour ?

Golanka answered 18/5, 2012 at 10:23 Comment(8)
Does OSX claim POSIX08 conformance? I don't think so. (I believe they claim '03 compliance only.)Soraya
Good point, according to images.apple.com/macosx/docs/OSX_for_UNIX_Users_TB_July2011.pdf it is “Open Brand UNIX 03”. I’ll have to check out what that means.Golanka
A lot of people will answer based on pre-'08 rules, where write was only atomic on pipes and even then only under certain conditions. A lot of platforms still don't support the '08 semantics. And a lot of platforms that claim to, still have one or more filesystems that don't.Soraya
OSX's claims of "POSIX conformance" are all lies. What they have is certification (which is basically a matter of paying a lot of money and passing some simplistic tests that don't catch anything but the most obvious cases of non-conformance), which does not guarantee, and could not possibly guarantee, conformance to the specification; the only thing that could do the latter is a formal proof, which for such a large system would be essentially impossible.Glendoraglendower
With that said, the Open Group and other standards bodies that issue conformance certifications really should adopt revocation procedures, whereby if an implementation that has been certified can be demonstrated not to conform to the specification, and refuses to remedy the situation for some extended period (say 6 months or 1 year), the certification automatically gets revoked.Glendoraglendower
You state unsigned char buffer[3] but then use j = 3 + i (because sizeof(buffer) - 2 == 3 - 2 == 1 and the % therefore being meaningless), and then do buf[j-1] = ... and write(..., buffer, j) - you corrupt parts of the stack and then write that out. The result of that is not well-specified, and the only reason your app doesn't crash is because it never returns from main() but calls exit() instead.Lugansk
No, it makes j = 3: x % 1 == 0 whatever x is. I did this to experiment with different sizes of buffer.Golanka
Since I just nominated this question as a duplicate of a different question, I feel obliged to note that the quote from Posix applies to the application and not to the operating system. The sentence immediately preceding the quote says "the application shall ensure that the actions below are performed..."Haemophiliac
B
17

man 2 write on my system sums it up nicely:

Note that not all file systems are POSIX conforming.

Here is a quote from a recent discussion on the ext4 mailing list:

Currently concurrent reads/writes are atomic only wrt individual pages, however are not on the system call. This may cause read() to return data mixed from several different writes, which I do not think it is good approach. We might argue that application doing this is broken, but actually this is something we can easily do on filesystem level without significant performance issues, so we can be consistent. Also POSIX mentions this as well and XFS filesystem already has this feature.

This is a clear indication that ext4 -- to name just one modern filesystem -- doesn't conform to POSIX.1-2008 in this respect.

Botulinus answered 18/5, 2012 at 10:40 Comment(1)
Although it saddens me much, the more I dig into this, the more you seem to be right.Golanka
K
16

Edit: Updated Aug 2017 with latest changes in OS behaviours.

Firstly, O_APPEND or the equivalent FILE_APPEND_DATA on Windows means that increments of the maximum file extent (file "length") are atomic under concurrent writers. This is guaranteed by POSIX, and Linux, FreeBSD, OS X and Windows all implement it correctly. Samba also implements it correctly, NFS before v5 does not as it lacks the wire format capability to append atomically. So if you open your file with append-only, concurrent writes will not tear with respect to one another on any major OS unless NFS is involved.

This says nothing about whether reads will ever see a torn write though, and on that POSIX says the following about atomicity of read() and write() to regular files:

All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links ... [many functions] ... read() ... write() ... If two threads each call one of these functions, each call shall either see all of the specified effects of the other call, or none of them. [Source]

and

Writes can be serialized with respect to other reads and writes. If a read() of file data can be proven (by any means) to occur after a write() of the data, it must reflect that write(), even if the calls are made by different processes. [Source]

but conversely:

This volume of POSIX.1-2008 does not specify behavior of concurrent writes to a file from multiple processes. Applications should use some form of concurrency control. [Source]

A safe interpretation of all three of these requirements would suggest that all writes overlapping an extent in the same file must be serialised with respect to one another and to reads such that torn writes never appear to readers.

A less safe, but still allowed interpretation could be that reads and writes only serialise with each other between threads inside the same process, and between processes writes are serialised with respect to reads only (i.e. there is sequentially consistent i/o ordering between threads in a process, but between processes i/o is only acquire-release).

So how do popular OS and filesystems perform on this? As the author of proposed Boost.AFIO an asynchronous filesystem and file i/o C++ library, I decided to write an empirical tester. The results are follows for many threads in a single process.


No O_DIRECT/FILE_FLAG_NO_BUFFERING:

Microsoft Windows 10 with NTFS: update atomicity = 1 byte until and including 10.0.10240, from 10.0.14393 at least 1Mb, probably infinite as per the POSIX spec.

Linux 4.2.6 with ext4: update atomicity = 1 byte

FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.

O_DIRECT/FILE_FLAG_NO_BUFFERING:

Microsoft Windows 10 with NTFS: update atomicity = until and including 10.0.10240 up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in. Since 10.0.14393, at least 1Mb, probably infinite as per the POSIX spec.

Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite as per the POSIX spec. Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this problem in ext4.

FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite as per the POSIX spec.


So in summary, FreeBSD with ZFS and very recent Windows with NTFS is POSIX conforming. Very recent Linux with ext4 is POSIX conforming only with O_DIRECT.

You can see the raw empirical test results at https://github.com/ned14/afio/tree/master/programs/fs-probe. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.

Krell answered 7/2, 2016 at 20:20 Comment(3)
Did you mean to say 512 bytes if FILE_FLAG_WRITE_THROUGH is on? If not, why would it make any sense for that flag to make things worse?Charismatic
The reason why write through makes atomicity of update smaller is likely because Microsoft have implemented a fast DMA path and a slow non-DMA path for updates, and when write through is on it uses the slow path which uses no DMA at all and issues i/o probably using register polling. Given how amazingly slow fsync is, and how that will dominate all other code in terms of cost, Microsoft felt no need to make the write through code path any faster. In the end only Microsoft know for sure, my empirical tester simply reveals atomicity, not the causes of why nor why not.Krell
It looks like after this answer was written the POSIX spec for write(2) changed the wording to "This volume of POSIX.1-2017 does not specify the behavior of concurrent writes to a regular file from multiple threads, except that each write is atomic (see Thread Interactions with Regular File Operations). Applications should use some form of concurrency control.". This does not change the conclusion of the empirical results though.Bashful
L
8

Some misinterpretation of what the standard mandates here comes from the use of processes vs. threads, and what that means for the "handle" situation you're talking about. In particular, you missed this part:

Handles can be created or destroyed by explicit user action, without affecting the underlying open file description. Some of the ways to create them include fcntl(), dup(), fdopen(), fileno(), and fork(). They can be destroyed by at least fclose(), close(), and the exec functions. [ ... ] Note that after a fork(), two handles exist where one existed before.

from the POSIX spec section you quote above. The reference to "create [ handles using ] fork" isn't elaborated on further in this section, but the spec for fork() adds a little detail:

The child process shall have its own copy of the parent's file descriptors. Each of the child's file descriptors shall refer to the same open file description with the corresponding file descriptor of the parent.

The relevant bits here are:

  • the child has copies of the parent's file descriptors
  • the child's copies refer to the same "thing" that the parent can access via said fds
  • file descriptors and file descriptions are not the same thing; in particular, a file descriptor is a handle in the above sense.

This is what the first quote refers to when it says "fork() creates [ ... ] handles" - they're created as copies, and therefore, from that point on, detached, and no longer updated in lockstep.

In your example program, every child process gets its very own copy which starts at the same state, but after the act of copying, these filedescriptors / handles have become independent instances, and therefore the writes race with each other. This is perfectly acceptable regarding the standard, because write() only guarentees:

On a regular file or other file capable of seeking, the actual writing of data shall proceed from the position in the file indicated by the file offset associated with fildes. Before successful return from write(), the file offset shall be incremented by the number of bytes actually written.

This means that while they all start the write at the same offset (because the fd copy was initialized as such) they might, even if successful, all write different amounts (there's no guarantee by the standard that a write request of N bytes will write exactly N bytes; it can succeed for anything 0 <= actual <= N), and due to the ordering of the writes being unspecified, the whole example program above therefore has unspecified results. Even if the total requested amount is written, all the standard above says that the file offset is incremented - it does not say it's atomically (once only) incremented, nor does it say that the actual writing of data will happen in an atomic fashion.

One thing is guaranteed though - you should never see anything in the file that has not either been there before any of the writes, or that had not come from either of the data written by any of the writes. If you do, that'd be corruption, and a bug in the filesystem implementation. What you've observed above might well be that ... if the final results can't be explained by re-ordering of parts of the writes.

The use of O_APPEND fixes this, because using that, again - see write(), does:

If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.

which is the "prior to" / "no intervening" serializing behaviour that you seek.

The use of threads would change the behaviour partially - because threads, on creation, do not receive copies of the filedescriptors / handles but operate on the actual (shared) one. Threads would not (necessarily) all start writing at the same offset. But the option for partial-write-success will still means that you may see interleaving in ways you might not want to see. Yet it'd possibly still be fully standards-conformant.

Moral: Do not count on a POSIX/UNIX standard being restrictive by default. The specifications are deliberately relaxed in the common case, and require you as the programmer to be explicit about your intent.

Lugansk answered 21/5, 2012 at 11:1 Comment(0)
G
6

You're misinterpreting the first part of the spec you cited:

Either a file descriptor or a stream is called a "handle" on the open file description to which it refers; an open file description may have several handles. […] All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle. […] The handles need not be in the same process for these rules to apply.

This does not place any requirements on the implementation to handle concurrent access. Instead, it places requirements on an application not to make concurrent access, even from different processes, if you want well-defined ordering of the output and side effects.

The only time atomicity is guaranteed is for pipes when the write size fits in PIPE_BUF.

By the way, even if the call to write were atomic for ordinary files, except in the case of writes to pipes that fit in PIPE_BUF, write can always return with a partial write (i.e. having written fewer than the requested number of bytes). This smaller-than-requested write would then be atomic, but it wouldn't help the situation at all with regards to atomicity of the entire operation (your application would have to re-call write to finish).

Glendoraglendower answered 18/5, 2012 at 13:2 Comment(8)
The case for short writes is handled in the provided sample. I am now rereading the document with your interpretation in mind… I’ll see what it gives.Golanka
“All activity by the application affecting the file offset on the first handle shall be suspended until it again becomes the active file handle”. How can the requirement be placed on the application? If it suspends all activity affecting the file offset then, in the case where one of the handle is a stream, it can never do the [lf]seek() required to “again become the active file handle”.Golanka
All that says is that you're not allowed to call seek functions on an inactive handle. It's the active handle that you might have to call a seek function on in order to switch to a new handle, and this is perfectly legal.Glendoraglendower
“you're not allowed to call seek functions on an inactive handle” is in contradiction with “For a handle to become the active handle […] the application shall perform an lseek() or fseek() [on it]”Golanka
I suppose I should have said "when another handle is active".Glendoraglendower
@R.. Isn't that whole section about interaction of file descriptors and standard I/O streams? And isn't the original question strictly about file descriptors?Soraya
Presumably they apply to multiple file descriptors referring to the same open file description as well, though the rules seem unnecessarily ugly and ridiculous for that case. Actually I think you're right however that they don't/can't apply to concurrent access to the same file descriptor by different threads.Glendoraglendower
@R..: I don't see why they would apply to any case that involved no standard I/O streams.Soraya

© 2022 - 2024 — McMap. All rights reserved.