Asynchronous Finite Difference Scheme using MPI_Put
Asked Answered
F

2

9

A paper by Donzis & Aditya suggests, that it is possible to use a finite difference scheme that might have a delay in the stencil. What does this mean? A FD scheme might be used to solve the heat equation and reads (or some simplification of it)

u[t+1,i] = u[t,i] + c (u[t,i-1]-u[t,i+1])

meaning, that the value at the next time step depends on the value at the same position and its neighbours at the previous time step.

This problem can easily be parallized by splitting the (in our case 1D) domain onto the different processors. However, we need communication when computing the boundary nodes at a processor, since the element u[t,i+-1] is only available on another processor.

The problem is illustrated in the following graphic, which is taken from the cited paper.

enter image description here

An MPI implementation might use MPI_Sendand MPI_Recv for synchronous computation. Since the computation itself is fairly easy, it is the communication which might become a possible bottleneck.

A solution to the problem is given in the paper:

Instead of a synchronous process, just take the boundary note that is available, despite the fact that it might be the value of an earlier time step. The method then still converges (under some assumptions)

For my work, I would like to implement the asynchronous MPI case (which is not part of the paper). The synchronous part using MPI_Send and MPI_Recv is working correctly. I extended the memory by two elements as ghost cells for the neighbouring elements and send the needed values via send and receive. The code below is basically the implementation of the figure above and is performed during each time step prior to the computation.

MPI_Send(&u[NpP],1,MPI_DOUBLE,RIGHT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[0],1,MPI_DOUBLE,LEFT,LEFT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

MPI_Send(&u[1],1,MPI_DOUBLE,LEFT,rank,MPI_COMM_WORLD);
MPI_Recv(&u[NpP+1],1,MPI_DOUBLE,RIGHT,RIGHT,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

Now, I'm by no means an MPI expert. I figured out, that MPI_Put might be what I need for the asynchronous case and reading a little bit, I came up with the following implementation.

Before the time loop:

MPI_Win win;
double *boundary;
MPI_Alloc_mem(sizeof(double) * 2, MPI_INFO_NULL, &boundary);
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info,"no_locks","true");
MPI_Win_create(boundary, 2*sizeof(double), sizeof(double), info, MPI_COMM_WORLD, &win);

Inside the time loop:

MPI_Put(&u[1],1,MPI_DOUBLE,LEFT,1,1,MPI_DOUBLE,win);
MPI_Put(&u[NpP],1,MPI_DOUBLE,RIGHT,0,1,MPI_DOUBLE,win);
MPI_Win_fence(0,win);
u[0] = boundary[0];
u[NpP+1] = boundary[1];

which puts the needed elements in the window, namely boundary (array with two elements) on the neighbouring processors and takes the values u[0] and u[NpP+1] from the boundary array itself. This implementation is working and I get the same result was with MPI_Send/Recv. However, this isn't really asynchronous since I'm still using MPI_Win_fence, which, as far as I understood, ensures synchronization.

The problem is: If I take out the MPI_Win_fence the values inside boundary are never updated and stay the inital values. My understanding was, that without MPI_Win_fence you would take any value that is available inside boundary which might (or might not) have been updated by a neighbouring processor.

Does anybody have an idea to avoid the use of MPI_Win_fence while also solving the problem, that the values inside boundary are never updated?

I'm also not sure, if the code I provided is enough to understand my problem or to give any hints. If that is the case, feel free to ask, as I will try to add all the parts that are missing.

Formal answered 10/10, 2014 at 12:20 Comment(15)
+1 - This is nice work. I wish I was still working in this area and could comment.Kohinoor
The theory itself is really easy to understand and I've produced some convergence results with Matlab and "simulating" processors myself, however I thought that a performance analysis might be of interest.Formal
Yes, you're partitioning the problem by region and using MPI to synch up the common boundary points at the end of each time step. You parallelize the computations needed at each time point for the spatial points.Kohinoor
But that is what I want to avoid. I would like to take any available value which might have been updated (depending on which processor is faster). Assume, that processor 1 is twice as fast as processor 2 ( for whatever reason) then it is okay for processor 1 to perform the computation of two time levels with the not updated value of the boundary.Formal
A lot of iterative methods have nice properties like this, where they don't have to be as rigidly synchronous - a pretty big deal at scale. In this case, you don't want to use MPI_Win_fence synchronization - it'll defeat your purpose, as it's collectively blocking. You'll want to you want to use fence start/complete, or better, lock/unlock.Peculate
The portable RMA in MPI is a nice idea (though some would rather argue that it is a terrible one) but the implementation in most widespread MPI libraries is really crap and varies greatly in performance. The standard is partially to blame for the "double view" window abstraction and the semantics of the operations that bring both views in sync.Silas
@HristoIliev, I found a quote from you, taken from another question about asynchronous MPI: " "An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner." " Does this mean, thath without any of the commands (e.g. MPI_WIN_FENCE) the put command only puts into a buffer? This would explain, why without the use of MPI_WIN_FENCE the inital values inside boundary are never changed.Formal
Yes, that twisted language basically means that all operations might get queued for execution at any side of the communication until a proper synchronisation call is made. And most MPI implementers read it as "the operations SHOULD get queued until ..." since properly implementing asynchronous RMA even on RDMA hardware like InfiniBand is extremely complex.Silas
@HristoIliev is right about performance widely varying. That may or may not be an issue if you can find an implementation that works for you. But another approach would be to sidestep MPI - for which this stuff is a bolted-on afterthought with implementations of varying quality, when it comes right down to it - and use something like GASnet or the things built atop it (Open SHMEM, Co-array Fortran, UPC).Peculate
@JonathanDursi, at ISC this year someone said that the PGAS concept is great but nobody wants it since it only seems to perform on Cray. Do you by any chance happen to operate a Cray machine? :)Silas
@HristoIliev - Well it depends on what you want to perform, right? GASnet directly over IB seems to be reliably significantly faster than one-sided MPI (at least old MPI-2 implementations of same); whereas if what you are doing is well matched to blocking sendrecvs or whatever, the latency of MPI implementations will generally beat gasnet stuff hands down, packet-for-packet. They're tuned for different use cases.Peculate
Thank you very much everybody for you helpful answers. Since I have the MPI code, I tried to use lock/unlock as @JonathanDursi suggested. However, I do not have implemented the code correct as I get a runtime error. What I use is MPI_Win_lock(MPI_LOCK_SHARED,LEFT,0,win); MPI_Put(&u[1],1,MPI_DOUBLE,LEFT,1,1,MPI_DOUBLE,win); MPI_Win_unlock(LEFT,win); The error is shown imgur.com/oJPrqDg Any ideas?Formal
@sonystarmap , you can't use mpi_lock_shared for a put; it has to be exclusive. Multiple tasks shouldn't be using the window if it's being updated.Peculate
@JonathanDursi You can use shared locks with MPI_Put as long as you are okay with no more than byte-level atomicity guarantees.Swain
@HristoIliev Your comments about bad implementations and MPI RMA semantics are out-dated. MPI-3, which was introduced approximately two years before you commented, has MPI_WIN_UNIFIED, which provides a single view of window memory.Swain
P
2

The following works seems to work for me, in the sense of correct execution - a small 1d heat equation taken from one of our tutorials, using for the RMA stuff:

MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
MPI_Put(&(temperature[current][1]),         1, MPI_FLOAT, left,  0, 1, MPI_FLOAT, rightwin);
MPI_Win_unlock( left, rightwin );

MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
MPI_Win_unlock( right, leftwin );

MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
temperature[current][0]           = *leftgc;
MPI_Win_unlock( rank, leftwin );

MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
temperature[current][locpoints+1] = *rightgc;
MPI_Win_unlock( rank, rightwin );

In the code I have even ranks wait an extra 10ms each time step to try to make sure that things get out of sync; but looking at traces it actually looks like things remain pretty synced up. I don't know if that high degree of synchrony can be fixed by tweaking the code, or is a restriction of the implementation (IntelMPI 5.0.1), or just happens because the amount of time passing in computation is too little and communication time is dominating (but as to the last, cranking up the usleep interval doesn't seem to have an effect).

#define _BSD_SOURCE     /* usleep */

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>


int main(int argc, char **argv) {
    /* simulation parameters */
    const int totpoints=1000;
    int locpoints;
    const float xleft = -12., xright = +12.;
    float locxleft, locxright;
    const float kappa = 1.;

    const int nsteps=100;

    /* data structures */
    float *x;
    float **temperature;

    /* parameters of the original temperature distribution */
    const float ao=1., sigmao=1.;

    float fixedlefttemp, fixedrighttemp;

    int current, new;
    int step, i;
    float time;
    float dt, dx;
    float rms;

    int rank, size;
    int start,end;
    int left, right;
    int lefttag=1, righttag=2;

    /* MPI Initialization */
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD,&size);
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);

    locpoints = totpoints/size;
    start = rank*locpoints;
    end   = (rank+1)*locpoints - 1;
    if (rank == size-1)
        end = totpoints-1;
    locpoints = end-start+1;

    left = rank-1;
    if (left < 0) left = MPI_PROC_NULL;
    right= rank+1;
    if (right >= size) right = MPI_PROC_NULL;

    #ifdef ONESIDED
    if (rank == 0)
        printf("Onesided: Allocating windows\n");
    MPI_Win leftwin, rightwin;
    float *leftgc, *rightgc;
    MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc,  &leftwin);
    MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
    #endif
    /* set parameters */

    dx = (xright-xleft)/(totpoints-1);
    dt = dx*dx * kappa/10.;

    locxleft = xleft + start*dx;
    locxright = xleft + end*dx;

    x      = (float *)malloc((locpoints+2)*sizeof(float));
    temperature = (float **)malloc(2 * sizeof(float *));
    temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
    temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
    current = 0;
    new = 1;

    /* setup initial conditions */

    time = 0.;
    for (i=0; i<locpoints+2; i++) {
        x[i] = locxleft + (i-1)*dx;
        temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
    }
    fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
    fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
    #ifdef ONESIDED
    *leftgc  = fixedlefttemp;
    *rightgc = fixedrighttemp;
    #endif

    /* evolve */
    for (step=0; step < nsteps; step++) {
        /* boundary conditions: keep endpoint temperatures fixed. */

        #ifdef ONESIDED
            MPI_Win_lock( MPI_LOCK_EXCLUSIVE, left, 0, rightwin );
            MPI_Put(&(temperature[current][1]),         1, MPI_FLOAT, left,  0, 1, MPI_FLOAT, rightwin);
            MPI_Win_unlock( left, rightwin );

            MPI_Win_lock( MPI_LOCK_EXCLUSIVE, right, 0, leftwin );
            MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
            MPI_Win_unlock( right, leftwin );

            MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, leftwin );
            temperature[current][0]           = *leftgc;
            MPI_Win_unlock( rank, leftwin );

            MPI_Win_lock( MPI_LOCK_EXCLUSIVE, rank, 0, rightwin );
            temperature[current][locpoints+1] = *rightgc;
            MPI_Win_unlock( rank, rightwin );
        #else
            temperature[current][0] = fixedlefttemp;
            temperature[current][locpoints+1] = fixedrighttemp;

            /* send data rightwards */
            MPI_Sendrecv(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, righttag,
                         &(temperature[current][0]), 1, MPI_FLOAT, left,  righttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

            /* send data leftwards */
            MPI_Sendrecv(&(temperature[current][1]), 1, MPI_FLOAT, left, lefttag,
                         &(temperature[current][locpoints+1]), 1, MPI_FLOAT, right,  lefttag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        #endif

        for (i=1; i<locpoints+1; i++) {
            temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
                (temperature[current][i+1] - 2.*temperature[current][i] +
                 temperature[current][i-1]) ;
        }

        time += dt;

        if ((rank % 2) == 0)
            usleep(10000u);

        current = new;
        new = 1 - current;
    }

    rms  = 0.;
    for (i=1;i<locpoints+1;i++) {
        rms += (temperature[current][i])*(temperature[current][i]);
    }
    float totrms;
    MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);

    if (rank == 0) {
        totrms = sqrt(totrms/totpoints);
        printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
    }


    #ifdef ONESIDED
    MPI_Win_free(&leftwin);
    MPI_Win_free(&rightwin);
    #endif

    free(temperature[1]);
    free(temperature[0]);
    free(temperature);
    free(x);

    MPI_Finalize();
    return 0;
}
Peculate answered 13/10, 2014 at 1:19 Comment(3)
Why not use MPI-3 RMA passive-target synchronization instead? You can use MPI_Win_lock_all, MPI_Accumulate(..,MPI_REPLACE,..) (this is equivalent to an element-wise atomic put) and MPI_Win_flush to get a more efficient implementation of what it show here.Swain
@Jeff - Because I didn't know MPI-3 RMA that well in October, and for that matter am still getting comfortable with it. Why don't you add an answer using the newer features, and contrasting to the older method? It would be beneficial for people coming to this question in the future. (And me!)Peculate
Done. See inline comments to determine if my changes will be valid.Swain
S
1

This is a clone of Jonathen Dursi's post, but with changes for MPI-3 RMA synchronization...

#define _BSD_SOURCE     /* usleep */

#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>


int main(int argc, char **argv) {
    /* simulation parameters */
    const int totpoints=1000;
    int locpoints;
    const float xleft = -12., xright = +12.;
    float locxleft, locxright;
    const float kappa = 1.;

    const int nsteps=100;

    /* data structures */
    float *x;
    float **temperature;

    /* parameters of the original temperature distribution */
    const float ao=1., sigmao=1.;

    float fixedlefttemp, fixedrighttemp;

    int current, new;
    int step, i;
    float time;
    float dt, dx;
    float rms;

    int rank, size;
    int start,end;
    int left, right;
    int lefttag=1, righttag=2;

    /* MPI Initialization */
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD,&size);
    MPI_Comm_rank(MPI_COMM_WORLD,&rank);

    locpoints = totpoints/size;
    start = rank*locpoints;
    end   = (rank+1)*locpoints - 1;
    if (rank == size-1)
        end = totpoints-1;
    locpoints = end-start+1;

    left = rank-1;
    if (left < 0) left = MPI_PROC_NULL;
    right= rank+1;
    if (right >= size) right = MPI_PROC_NULL;

    #ifdef ONESIDED
    if (rank == 0)
        printf("Onesided: Allocating windows\n");
    MPI_Win leftwin, rightwin;
    float *leftgc, *rightgc;
    MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &leftgc,  &leftwin);
    MPI_Win_allocate(sizeof(float), sizeof(float), MPI_INFO_NULL, MPI_COMM_WORLD, &rightgc, &rightwin);
    MPI_Win_lock_all(MPI_MODE_NOCHECK, leftwin);
    MPI_Win_lock_all(MPI_MODE_NOCHECK, rightwin);
    #endif
    /* set parameters */

    dx = (xright-xleft)/(totpoints-1);
    dt = dx*dx * kappa/10.;

    locxleft = xleft + start*dx;
    locxright = xleft + end*dx;

    x      = (float *)malloc((locpoints+2)*sizeof(float));
    temperature = (float **)malloc(2 * sizeof(float *));
    temperature[0] = (float *)malloc((locpoints+2)*sizeof(float));
    temperature[1] = (float *)malloc((locpoints+2)*sizeof(float));
    current = 0;
    new = 1;

    /* setup initial conditions */

    time = 0.;
    for (i=0; i<locpoints+2; i++) {
        x[i] = locxleft + (i-1)*dx;
        temperature[current][i] = ao*exp(-(x[i]*x[i]) / (2.*sigmao*sigmao));
    }
    fixedlefttemp = ao*exp(-(locxleft-dx)*(locxleft-dx) / (2.*sigmao*sigmao));
    fixedrighttemp= ao*exp(-(locxright+dx)*(locxright+dx)/(2.*sigmao*sigmao));
    #ifdef ONESIDED
    *leftgc  = fixedlefttemp;
    *rightgc = fixedrighttemp;
    #endif

    /* evolve */
    for (step=0; step < nsteps; step++) {
        /* boundary conditions: keep endpoint temperatures fixed. */

        /* RMA code assumes no conflicts in updates via MPI_Put.
           If that is wrong, hopefully it is fine to use MPI_Accumulate
           with MPI_SUM to accumulate the result. */
        #ifdef ONESIDED
            MPI_Put(&(temperature[current][1]),         1, MPI_FLOAT, left,  0, 1, MPI_FLOAT, rightwin);
            MPI_Win_flush( left, rightwin );

            MPI_Put(&(temperature[current][locpoints]), 1, MPI_FLOAT, right, 0, 1, MPI_FLOAT, leftwin);
            MPI_Win_flush( right, leftwin );

            temperature[current][0]           = *leftgc;
            MPI_Win_flush( rank, leftwin );

            temperature[current][locpoints+1] = *rightgc;
            MPI_Win_flush( rank, rightwin );
        #else
        #error Define ONESIDED...
        #endif

        for (i=1; i<locpoints+1; i++) {
            temperature[new][i] = temperature[current][i] + dt*kappa/(dx*dx) *
                (temperature[current][i+1] - 2.*temperature[current][i] +
                 temperature[current][i-1]) ;
        }

        time += dt;

        if ((rank % 2) == 0)
            usleep(10000u);

        current = new;
        new = 1 - current;
    }

    rms  = 0.;
    for (i=1;i<locpoints+1;i++) {
        rms += (temperature[current][i])*(temperature[current][i]);
    }
    float totrms;
    MPI_Reduce(&rms, &totrms, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);

    if (rank == 0) {
        totrms = sqrt(totrms/totpoints);
        printf("Step = %d, Time = %g, RMS value = %g\n", step, time, totrms);
    }


    #ifdef ONESIDED
    MPI_Win_unlock_all(leftwin);
    MPI_Win_unlock_all(rightwin);
    MPI_Win_free(&leftwin);
    MPI_Win_free(&rightwin);
    #endif

    free(temperature[1]);
    free(temperature[0]);
    free(temperature);
    free(x);

    MPI_Finalize();
    return 0;
}
Swain answered 5/4, 2015 at 0:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.