MPI Alltoallv or better individual Send and Recv? (Performance)
Asked Answered
M

1

5

I have a number of processes (of the order of 100 to 1000) and each of them has to send some data to some (say about 10) of the other processes. (Typically, but not necessary always, if A sends to B, B also sends to A.) Every process knows how much data it has to receive from which process.

So I could just use MPI_Alltoallv, with many or most of the message lengths zero. However, I heard that for performance reasons it would be better to use several MPI_send and MPI_recv communications rather than the global MPI_Alltoallv. What I do not understand: if a series of send and receive calls are more efficient than one Alltoallv call, why is Alltoallv not just implemented as a series of sends and receives?

It would be much more convenient for me (and others?) to use just one global call. Also I might have to be concerned about not running into a deadlock situation with several Send and Recv (fixable by some odd-even strategy or more complex? or by using buffered send/recv?).

Would you agree that MPI_Alltoallv is necessary slower than the, say, 10 MPI_Send and MPI_Recv; and if yes, why and how much?

Markland answered 22/11, 2012 at 4:14 Comment(1)
The answer to your question will depend on the alltoallv implementation, any tuning parameters you give to guide the collectives, and the scale and sparsity of your communication pattern. As with so many optimization-type questions, the only way one can possibly know which is better in your particular case is to try both ways. But first I'd just get it working with the alltoallv and see if that really even is a significant bottleneck in your code.Salo
H
7

Usually the default advice with collectives is the opposite: use a collective operation when possible instead of coding your own. The more information the MPI library has about the communication pattern, the more opportunities it has to optimize internally.

Unless special hardware support is available, collective calls are in fact implemented internally in terms of sends and receives. But the actual communication pattern will probably not be just a series of sends and receives. For example, using a tree to broadcast a piece of data can be faster than having the same rank send it to a bunch of receivers. A lot of work goes into optimizing collective communications, and it is difficult to do better.

Having said that, MPI_Alltoallv is somewhat different. It can be difficult to optimize for all irregular communication scenarios at the MPI level, so it is conceivable that some custom communication code can do better. For example, an implementation of MPI_Alltoallv might be synchronizing: it could require that all processes "check in", even if they have to send a 0-length message. I though that such an implementation is unlikely, but here is one in the wild.

So the real answer is "it depends". If the library implementation of MPI_Alltoallv is a bad match for the task, custom communication code will win. But before going down that path, check if the MPI-3 neighbor collectives are a good fit for your problem.

Hemistich answered 22/11, 2012 at 5:43 Comment(3)
Synchronising MPI_ALLTOALLV implementations are much more common than you might think. Open MPI switched its default algorithm to the synchornising pairwise implementation in 1.6.1.Hodgkins
@HristoIliev Interesting. What are the benefits of a synchronizing Alltoallv? I'm actually working on a related project, so it would be interesting to learn more. Any pointers to additional reading?Hemistich
I believe that in most real-life cases MPI_ALLTOALLV is used as a replacement of MPI_ALLTOALL in cases where the number of processes does not divide the problem size. Then you don't deal with empty messages and correctly scheduled synchronous communication can make for better utilisation of the network equipment (e.g. on fat-tree IB networks), especially when the number of processes is huge.Hodgkins

© 2022 - 2024 — McMap. All rights reserved.