A paper by Donzis & Aditya suggests, that it is possible to use a finite difference scheme that might have a delay in the stencil. What does this mean? A FD scheme might be used to solve the heat equation and reads (or some simplification of it)
u[t+1,i] = u[t,i] + c (u[t,i-1]-u[t,i+1])
meaning, that the value at the next time step depends on the value at the same position and its neighbours at the previous time step.
This problem can easily be parallized by splitting the (in our case 1D) domain onto the different processors. However, we need communication when computing the boundary nodes at a processor, since the element u[t,i+-1]
is only available on another processor.
The problem is illustrated in the following graphic, which is taken from the cited paper.
An MPI implementation might use MPI_Send
and MPI_Recv
for synchronous computation.
Since the computation itself is fairly easy, it is the communication which might become a possible bottleneck.
A solution to the problem is given in the paper:
Instead of a synchronous process, just take the boundary note that is available, despite the fact that it might be the value of an earlier time step. The method then still converges (under some assumptions)
For my work, I would like to implement the asynchronous MPI case (which is not part of the paper). The synchronous part using MPI_Send
and MPI_Recv
is working correctly. I extended the memory by two elements as ghost cells for the neighbouring elements and send the needed values via send and receive. The code below is basically the implementation of the figure above and is performed during each time step prior to the computation.
MPI_Send(&u[NpP],1,MPI_DOUBLE,RIGHT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[0],1,MPI_DOUBLE,LEFT,LEFT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
MPI_Send(&u[1],1,MPI_DOUBLE,LEFT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[NpP+1],1,MPI_DOUBLE,RIGHT,RIGHT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Now, I'm by no means an MPI expert. I figured out, that MPI_Put
might be what I need for the asynchronous case and reading a little bit, I came up with the following implementation.
Before the time loop:
MPI_Win win;
double *boundary;
MPI_Alloc_mem(sizeof(double) * 2, MPI_INFO_NULL, &boundary);
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info,"no_locks","true");
MPI_Win_create(boundary, 2*sizeof(double), sizeof(double), info, MPI_COMM_WORLD, &win);
Inside the time loop:
MPI_Put(&u[1],1,MPI_DOUBLE,LEFT,1,1,MPI_DOUBLE,win);
MPI_Put(&u[NpP],1,MPI_DOUBLE,RIGHT,0,1,MPI_DOUBLE,win);
MPI_Win_fence(0,win);
u[0] = boundary[0];
u[NpP+1] = boundary[1];
which puts the needed elements in the window, namely boundary
(array with two elements) on the neighbouring processors and takes the values u[0]
and u[NpP+1]
from the boundary
array itself.
This implementation is working and I get the same result was with MPI_Send/Recv
. However, this isn't really asynchronous since I'm still using MPI_Win_fence
, which, as far as I understood, ensures synchronization.
The problem is: If I take out the MPI_Win_fence
the values inside boundary
are never updated and stay the inital values. My understanding was, that without MPI_Win_fence
you would take any value that is available inside boundary
which might (or might not) have been updated by a neighbouring processor.
Does anybody have an idea to avoid the use of MPI_Win_fence
while also solving the problem, that the values inside boundary
are never updated?
I'm also not sure, if the code I provided is enough to understand my problem or to give any hints. If that is the case, feel free to ask, as I will try to add all the parts that are missing.
MPI_WIN_FENCE
) the put command only puts into a buffer? This would explain, why without the use ofMPI_WIN_FENCE
the inital values insideboundary
are never changed. – FormalMPI_Win_lock(MPI_LOCK_SHARED,LEFT,0,win);
MPI_Put(&u[1],1,MPI_DOUBLE,LEFT,1,1,MPI_DOUBLE,win);
MPI_Win_unlock(LEFT,win);
The error is shown imgur.com/oJPrqDg Any ideas? – FormalMPI_Put
as long as you are okay with no more than byte-level atomicity guarantees. – SwainMPI_WIN_UNIFIED
, which provides a single view of window memory. – Swain