Concurrent writes to a file using multiple threads

I have a userlevel program which opens a file using the flags O_WRONLY|O_SYNC. The program creates 256 threads which attempt to write 256 or more bytes of data each to the file. I want to have a total of 1280000 requests, making it a total of about 300 MB of data. The program ends once 1280000 requests have been completed.

I use pthread_spin_trylock() to increment a variable which keeps track of the number of requests that have been completed. To ensure that each thread writes to a unique offset, I use pwrite() and calculate the offset as a function of the number of requests that have been written already. Hence, I don't use any mutex when actually writing to the file (does this approach ensure data integrity?)

When I check the average time for which the pwrite() call was blocked and the corresponding numbers (i.e., the average Q2C times -- which is the measure of the times for the complete life cycle of BIOs) as found using blktrace, I find that there is a significant difference. In fact, the average completion time for a given BIO is much greater than the average latency of a pwrite() call. What is the reason behind this discrepancy? Shouldn't these numbers be similar since O_SYNC ensures that the data is actually written to the physical medium before returning?

pwrite() is suppose to be atomic, so you should be safe there ...

In regards to the difference in latency between your syscall and the actual BIO, according to this information on the man-pages at kernel.org for open(2):

POSIX provides for three different variants of synchronized I/O, corresponding to the flags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31), Linux only implements O_SYNC, but glibc maps O_DSYNC and O_RSYNC to the same numerical value as O_SYNC. Most Linux file systems don't actually implement the POSIX O_SYNC semantics, which require all metadata updates of a write to be on disk on returning to userspace, but only the O_DSYNC semantics, which require only actual file data and metadata necessary to retrieve it to be on disk by the time the system call returns.

So this basically implies that with the O_SYNC flag the entirety of the data you're attempting to write does not need to be flushed to disk before a syscall returns, but rather just enough information to be capable of retrieving it from disk ... depending on what you're writing, that could be quite a bit less than the entire buffer of data you were intending to write to disk, and therefore the actual writing of all the data will take place at a later time, after the syscall has been completed and the process has moved on to something else.

Recommended topics

Hot tags