How does mpi_file_write differ from mpi_file_write_all?
Asked Answered
D

1

5

That's pretty much the question. I mean, I know mpi_file_write_all is the "collective" version, but I figure mpi_file_write is going to be called by several processes all at once anyway, so what is the actual difference in their operation? Thanks.

Deibel answered 27/7, 2016 at 14:28 Comment(0)
J
9

Functionally, there is little difference in most practical situations. If your IO works correctly with mpi_file_write_all(), then it should work correctly with mpi_file_write() unless you're doing something very complicated. The converse isn't strictly true but in most real situations I've seen, where all processes are doing simple regular IO patterns at the same time, mpi_file_write_all() works if mpi_file_write() does.

Anyway, the point is that if you call mpi_file_write() then the IO library has to process that IO request there and then as it cannot assume that other processes are also performing IO. In anything but the most simple parallel decompositions, the data from a single process will not comprise a single contiguous chunk of the file. As a result, each process will do a large number of small IO transactions (write, seek, write, seek, ...) which is very inefficient on a parallel file system. Worse than that, it probably locks the file while it is doing IO to stop other processes interfering with what it's doing so IO can become effectively serialised across processes.

With write_all(), the IO library has a global view and knows what every process is doing. First, this enables it to reorganise the data so each process has a single large chunk of data to write to the file. Second, as it is in control of all the processes, it can avoid the need to lock the file as it can ensure that writes don't conflict.

For simple regular patterns, e.g. a large 3D array distributed across a 3D grid of processes, I've seen massive differences between the collective and non-collective approaches on a Cray with a Lustre filesystem. The difference can be gigabytes/second vs tens of megabytes/second.

PS I'm assuming here that the pattern is lots of processes writing data to a single shared file. For reading there should also be an improvement (a small number of large contiguous reads) but perhaps not so dramatic as file locking isn't needed for read.

Javier answered 27/7, 2016 at 17:9 Comment(3)
Yes, the pattern is lots of processes writing to a shared file. Thanks for the explanation. I am definitely seeing big gains in performance with write_all. Nice to understand why. Is this documented publicly anywhere? Couldn't find much of anything.Deibel
My explanation is largely based on doing some simple benchmarks (see "Performance of Parallel IO on ARCHER" at archer.ac.uk/documentation/white-papers) then talking to local Cray staff to try and understand what was going on. There are a bunch of useful links at the bottom of this page: rc.colorado.edu/support/examples-and-tutorials/…Javier
Thanks for the links!Deibel

© 2022 - 2024 — McMap. All rights reserved.