MPI with C: Passive RMA synchronization

Asked 11/9, 2013 at 9:33 Answered 5/10, 2015 at 9:3

as I found no answer for my question so far and am on the edge of going crazy about the problem, I just ask the question tormenting my mind ;-)

I'm working on a parallelization of a node-elimination algorithm I already programmed. Target environment is a cluster.

In my parallel program I distinguish on master process (in my case rank 0) and the working slaves (every rank except 0). My idea is it, that the master is keeping track which slaves are available and send them then work. Therefore and for some other reasons I try to establish a workflow basing on passive RMA with lock-put-unlock sequences. I use an integer array named schedule in which for every position in the array representing a rank is either 0 for a working process or 1 for an available process (so if schedule[1]=1 one is available for work). If a process is done with its work, it puts in the array on the master the 1 signalising its availability. The code I tried for that is as follows:

 MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win); // a exclusive window is locked on process 0
 printf("Process %d:\t exclusive lock on process 0 started\n",myrank);
 MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); // the line myrank of schedule is put into process 0
 printf("Process %d:\t put operation called\n",myrank);
 MPI_Win_unlock(0,win); // the window is unlocked

It worked perfectly, especially when the master process was synchronized with a barrier to the end of the lock because then the output of master was made after the put operation.

As a next step I tried to let master check on regular basis whether there are available slaves or not. Therefore I created a while-loop to repeat until every process signalized its availability (I repeat that it is program teaching me the principles, I know that the implementation still doesn't do what I want). The loop is in a base variant just printing my array schedule and then checking in a function fnz whether there are other working processes than master:

while(j!=1){
printf("Process %d:\t following schedule evaluated:\n",myrank);
for(i=0;i<size;i++)printf("%d\t",schedule[i]);//print the schedule
printf("\n");
j=fnz(schedule);
}

And then the concept blew up. After inverting the process and getting the required information with get from the slaves by the master instead of putting it with put from the slaves to the master I found out my main problem is the acquiring of the lock: the unlock command doesn't succeed because in the case of the put the lock isn't granted at all and in the case of the get the lock is only granted when the slave process is done with its work and waiting in a barrier. In my opinion there has to be a serious error in my thinking. It can't be the idea of passive RMA that the lock can only be achieved when the target process is in a barrier synchronizing the whole communicator. Then I could just go along with standard Send/Recv operations. What I want to achieve is, that process 0 is working all the time in delegating work and being able by RMA of the slaves to identify to whom it can delegate. Can please someone help me and explain how I can get a break on process 0 to allow the other processes getting locks?

Thank you in advance!

UPDATE: I'm not sure if you ever worked with a lock and just want to stress out that I'm perfectly able to get an updated copy of a remote memory window. If I get the availability from the slaves the lock is only granted when the slaves are waiting in a barrier. So what I got to work is, that process 0 performs lock-get-unlock while process 1 and 2 are simulating work such that process 2 is remarkably longer occupied than one. what I expect as a result is that process 0 prints a schedule (0,1,0) because process 0 isn't asked at all wether it's working, process 1 is done with working and process 2 is still working. In the next step, when process 2 is ready, I expect the output (0,1,1), since the slaves are both ready for new work. What I get is that the slaves only grant the lock for process 0 when they are waiting in a barrier, so that the first and only output I get at all is the last one I expect, showing me that the lock was granted for each individual process first, when it was done with its work. So if please someone could tell me when a lock can be granted by the target process instead of trying to confuse my knowledge about passive RMA, I would be very grateful

Hosfmann answered 11/9, 2013 at 9:33 Comment(5)

What error code do you receive from MPI_Win_lock? – Deafen 11/9, 2013 at 14:41

Also note that the values of schedule[] in the master process do not need to reflect those set by remote MPI_Put calls since you are missing calls to lock/unlock the window in rank 0. §11.7 (of MPI 2.2) defines the semantics of RMA operations and it allows for the visibility of remotely made changes to be postponed until the local process locks the window. – Deafen 11/9, 2013 at 14:58

I don't get a error code, the lock isn't given by the corresponding process, so I create a Deadlock. Furthermore, rank 0 doesn't need a call to lock/unlock because it is passive RMA and is synchronized only by the calling process, which would be one of the slaves. Thats the difference to fence-synchronisation. In §11.7 (and fully explained in §11.4.3) is therefore stated that the "same" not the "matching" call is the point where target synchronistaion is achieved. – Hosfmann 11/9, 2013 at 15:52

§11.7, p. 365, ll. 1-4 - "An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner." See also Example 11.12. – Deafen 11/9, 2013 at 16:55

Anyway, one could only speculate on the origin of your problem given that you did not provide details about how you allocate the window in rank 0, the specific MPI implementation and the network interconnect being used. – Deafen 11/9, 2013 at 16:57

First of all, the passive RMA mechanism does not somehow magically poke into the remote process' memory since not many MPI transports have real RDMA capabilities and even those that do (e.g. InfiniBand) require a great deal of not-that-passive involvement of the target in order to allow for passive RMA operations to happen. This is explained in the MPI standard but in the very abstract form of public and private copies of the memory exposed through an RMA window.

Achieving working and portable passive RMA with MPI-2 involves several steps.

Step 1: Window allocation in the target process

For portability and performance reasons the memory for the window should be allocated using MPI_ALLOC_MEM:

int size;
MPI_Comm_rank(MPI_COMM_WORLD, &size);

int *schedule;
MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);

for (int i = 0; i < size; i++)
{
   schedule[i] = 0;
}

MPI_Win win;
MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
   MPI_COMM_WORLD, &win);

...

MPI_Win_free(win);
MPI_Free_mem(schedule);

Step 2: Memory synchronisation at the target

The MPI standard forbids concurrent access to the same location in the window (§11.3 from the MPI-2.2 specification):

It is erroneous to have concurrent conflicting accesses to the same memory location in a window; if a location is updated by a put or accumulate operation, then this location cannot be accessed by a load or another RMA operation until the updating operation has completed at the target.

Therefore each access to schedule[] in the target has to be protected by a lock (shared since it only reads the memory location):

while (!ready)
{
   MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
   ready = fnz(schedule, oldschedule, size);
   MPI_Win_unlock(0, win);
}

Another reason for locking the window at the target is to provide entries into the MPI library and thus facilitate progression of the local part of the RMA operation. MPI provides portable RMA even when using non-RDMA capable transports, e.g. TCP/IP or shared memory, and that requires a lot of active work (called progression) to be done at the target in order to support "passive" RMA. Some libraries provide asynchronous progression threads that can progress the operation in the background, e.g. Open MPI when configured with --enable-opal-multi-threads (disabled by default), but relying on such behaviour results in non-portable programs. That's why the MPI standard allows for the following relaxed semantics of the put operation (§11.7, p. 365):

6 . An update by a put or accumulate call to a public window copy becomes visible in the private copy in process memory at latest when an ensuing call to MPI_WIN_WAIT, MPI_WIN_FENCE, or MPI_WIN_LOCK is executed on that window by the window owner.

If a put or accumulate access was synchronized with a lock, then the update of the public window copy is complete as soon as the updating process executed MPI_WIN_UNLOCK. On the other hand, the update of private copy in the process memory may be delayed until the target process executes a synchronization call on that window (6). Thus, updates to process memory can always be delayed until the process executes a suitable synchronization call. Updates to a public window copy can also be delayed until the window owner executes a synchronization call, if fences or post-start-complete-wait synchronization is used. Only when lock synchronization is used does it becomes necessary to update the public window copy, even if the window owner does not execute any related synchronization call.

This is also illustrated in Example 11.12 in the same section of the standard (p. 367). And indeed, both Open MPI and Intel MPI do not update the value of schedule[] if the lock/unlock calls in the code of the master are commented out. The MPI standard further advises (§11.7, p. 366):

Advice to users. A user can write correct programs by following the following rules:

...

lock: Updates to the window are protected by exclusive locks if they may conflict. Nonconflicting accesses (such as read-only accesses or accumulate accesses) are protected by shared locks, both for local accesses and for RMA accesses.

Step 3: Providing the correct parameters to MPI_PUT at the origin

MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); would transfer everything into the first element of the target window. The correct invocation given that the window at the target was created with disp_unit == sizeof(int) is:

int one = 1;
MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);

The local value of one is thus transferred into rank * sizeof(int) bytes following the beginning of the window at the target. If disp_unit was set to 1, the correct put would be:

MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win);

Step 4: Dealing with implementation specifics

The above detailed program works out-of-the box with Intel MPI. With Open MPI one has to take special care. The library is built around a set of frameworks and implementing modules. The osc (one-sided communication) framework comes in two implementations - rdma and pt2pt. The default (in Open MPI 1.6.x and probably earlier) is rdma and for some reason it does not progress the RMA operations at the target side when MPI_WIN_(UN)LOCK is called, which leads to deadlock-like behaviour unless another communication call is made (MPI_BARRIER in your case). On the other hand the pt2pt module progresses all operations as expected. Therefore with Open MPI one has to start the program like following in order to specifically select the pt2pt component:

$ mpiexec --mca osc pt2pt ...

A fully working C99 sample code follows:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mpi.h>

// Compares schedule and oldschedule and prints schedule if different
// Also displays the time in seconds since the first invocation
int fnz (int *schedule, int *oldschedule, int size)
{
    static double starttime = -1.0;
    int diff = 0;

    for (int i = 0; i < size; i++)
       diff |= (schedule[i] != oldschedule[i]);

    if (diff)
    {
       int res = 0;

       if (starttime < 0.0) starttime = MPI_Wtime();

       printf("[%6.3f] Schedule:", MPI_Wtime() - starttime);
       for (int i = 0; i < size; i++)
       {
          printf("\t%d", schedule[i]);
          res += schedule[i];
          oldschedule[i] = schedule[i];
       }
       printf("\n");

       return(res == size-1);
    }
    return 0;
}

int main (int argc, char **argv)
{
    MPI_Win win;
    int rank, size;

    MPI_Init(&argc, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    if (rank == 0)
    {
       int *oldschedule = malloc(size * sizeof(int));
       // Use MPI to allocate memory for the target window
       int *schedule;
       MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule);

       for (int i = 0; i < size; i++)
       {
          schedule[i] = 0;
          oldschedule[i] = -1;
       }

       // Create a window. Set the displacement unit to sizeof(int) to simplify
       // the addressing at the originator processes
       MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL,
          MPI_COMM_WORLD, &win);

       int ready = 0;
       while (!ready)
       {
          // Without the lock/unlock schedule stays forever filled with 0s
          MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
          ready = fnz(schedule, oldschedule, size);
          MPI_Win_unlock(0, win);
       }
       printf("All workers checked in using RMA\n");

       // Release the window
       MPI_Win_free(&win);
       // Free the allocated memory
       MPI_Free_mem(schedule);
       free(oldschedule);

       printf("Master done\n");
    }
    else
    {
       int one = 1;

       // Worker processes do not expose memory in the window
       MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

       // Simulate some work based on the rank
       sleep(2*rank);

       // Register with the master
       MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win);
       MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
       MPI_Win_unlock(0, win);

       printf("Worker %d finished RMA\n", rank);

       // Release the window
       MPI_Win_free(&win);

       printf("Worker %d done\n", rank);
    }

    MPI_Finalize();
    return 0;
}

Sample output with 6 processes:

$ mpiexec --mca osc pt2pt -n 6 rma
[ 0.000] Schedule:      0       0       0       0       0       0
[ 1.995] Schedule:      0       1       0       0       0       0
Worker 1 finished RMA
[ 3.989] Schedule:      0       1       1       0       0       0
Worker 2 finished RMA
[ 5.988] Schedule:      0       1       1       1       0       0
Worker 3 finished RMA
[ 7.995] Schedule:      0       1       1       1       1       0
Worker 4 finished RMA
[ 9.988] Schedule:      0       1       1       1       1       1
All workers checked in using RMA
Worker 5 finished RMA
Worker 5 done
Worker 4 done
Worker 2 done
Worker 1 done
Worker 3 done
Master done

Deafen answered 12/9, 2013 at 14:44 Comment(4)

Where does the pt2pt module come from? The Open-MPI github repo apparently doesn't have 'pt2pt' among the osc modules. github.com/open-mpi/ompi/tree/master/ompi/mca/osc – Terraqueous 7/1, 2015 at 13:27

@Praxeolitic, it is part of the production version from the time when the answer was written - 1.6: svn.open-mpi.org/trac/ompi/browser/tags/v1.6-series/v1.6.5/ompi/…. Also present in earlier versions. Versions since 1.8 drop pt2pt. – Deafen 7/1, 2015 at 16:19

@HristoIliev: "[..] does not progress the RMA operations at the target side when MPI_WIN_(UN)LOCK is called, which leads to deadlock-like behaviour [..]" - do you have any idea why this is the case? I observed this behavior as well, and while my code does not rely on it this behavior significantly degrades performance. – Sanjuanitasank 14/9, 2016 at 16:27

@cschwan: I don't know. It is something implementation-specific. Probably the rdma module is incomplete in 1.6. I haven't tested the recent versions. – Deafen 19/9, 2016 at 22:46

The answer by Hristo Lliev works perfectly if I use newer versions of the Open-MPI library.

However, on the cluster we are currently using, this is not possible and for the older versions there was deadlock behavior for the final unlock calls, as described by Hhristo. Adding the options --mca osc pt2pt did solve the deadlock in a sense but the MPI_Win_unlock calls still didn't seem to complete until the process owning the accessed variable did its own lock/unlock of the window. This is not very useful when you have jobs with very different completion times.

Therefore from a pragmatical point of view, though strictly speaking leaving the topic of passive RMA synchronization (for which I do apologize), I would like to point out a workaround which makes use of external files for those who are stuck with using old versions of the Open-MPI library so they don't have to loose so much time as I did:

You basically create an external file containing the information about which (slave) process does which job, instead of an internal array. This way, you don't even have to have a master process only dedicated to the bookkeeping of the slaves: It can also perform a job. Anyway, every process can go look in this file which job is to be done next and possibly determine that everything is done.

The important point is now that this information file is not accessed at the same time by multiple processes, as this might cause work to be duplicated or worse. The equivalent of the locking and unlocking of the window in MPI is here imitated easiest by using a locking file: This file is created by the process currently accessing the information file. The other processes have to wait for the current process to finish by checking with a slight time delay whether the lock file still exists.

The full information can be found here.

Terresaterrestrial answered 5/10, 2015 at 9:3 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags