Synchronizing access to MPI3 shared memory : is this code guaranteed to work by MPI standards?

Asked 19/2, 2020 at 10:33 Answered 6/3, 2020 at 18:41

c parallel-processing mpi shared-memory numa

The MPI-3 standard introduces shared-memory, that can be read and written by all processes sharing this memory without using calls to the MPI library. While there are examples of one-sided communications using shared or non-shared memory, I did not find much information about how to use shared memory correctly with direct access.

I ended up doing something like this, which works well, but I was wondering if the MPI standard guarantees that it will always work?

// initialization:
MPI_Comm comm_shared;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, i_mpi, MPI_INFO_NULL, &comm_shared);

// allocation
const int N_WIN=10;
const int mem_size = 1000*1000;
double* mem[10];
MPI_Win win[N_WIN];
for (int i=0; i<N_WIN; i++) {   // I need several buffers.
    MPI_Win_allocate_shared( mem_size, sizeof(double), MPI_INFO_NULL, comm_shared, &mem[i], &win[i] );
    MPI_Win_lock_all(0, win);
}

while(1) {
    MPI_Barrier(comm_shared);
    ... // write anywhere on shared memory
    MPI_Barrier(comm_shared);
    ... // read on shared memory written by other processes
}

// deallocation
for (int i=0; i<N_WIN; i++) {
    MPI_Win_unlock_all(win[i]);
    MPI_Win_free(&win[i]);
}

Here, I ensure synchronization by using MPI_Barrier() and assume the hardware makes the memory view consistent. Furthermore, because I have several shared windows, a single call to MPI_Barrier seems more efficient than calling MPI_Win_fence() on every shared memory window.

It seems to work well an my x86 laptops and servers. But is this programm a valid/correct MPI program? Is there a more efficient method of achieving the same thing?

Bullroarer answered 19/2, 2020 at 10:33 Comment(0)

There are two key issues here:

MPI_Barrier is absolutely not a memory barrier and should never be used that way. It may synchronize memory as a side-effect of its implementation in most cases, but users can never assume that. MPI_Barrier is only guaranteed to synchronize process execution. (If it helps, you can imagine a system where MPI_Barrier is implemented using a hardware widget that does not more than the MPI standard requires. IBM Blue Gene sort of did this in some cases.)
This question is unanswerable without details on what you are actually doing with shared-memory here:

while(1) {
    MPI_Barrier(comm_shared);
    ... // write anywhere on shared memory
    MPI_Barrier(comm_shared);
    ... // read on shared memory written by other processes
}

It may not be written clearly, but it was assumed by the authors of the relevant text of the MPI-3 standard - I was part of this group - that one could reason about shared-memory using the memory model of the underlying/host language. Thus, if you are writing this code in C11, you can reason about it according to the C11 memory model.

If you want to use MPI to synchronize shared memory, then you should use MPI_Win_sync on all the windows for load-store accesses and MPI_Win_flush for RMA operations (Put/Get/Accumulate/Get_accumulate/Fetch_and_op/Compare_and_swap).

I expect MPI_Win_sync to be implemented as a CPU memory barrier, so it is redundant to call it for every window. This is why it may be more effective to assume C11 or C++11 memory models and use https://en.cppreference.com/w/c/atomic/atomic_thread_fence and https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence, respectively.

Led answered 6/3, 2020 at 18:41 Comment(6)

Thank you very much for your answer. May I thus assume that, within a hybrid MPI-OpenMP program, something as #pragma omp barrier followed by MPI_Barrier(comm_shared); and another #pragma omp barrier might do the trick ? (If I understood correctly, #pragma omp barrier is also a memory barrier). – Bullroarer 10/3, 2020 at 16:59

#pragma omp barrier is primarily a thread execution barrier but implies a memory barrier (i.e. #pragma omp flush). While in practice #pragma omp barrier is sufficient, technically, it only applies within the context of OpenMP. I know of no such case, but one could build a system where OpenMP would not synchronize interprocess load-store operations. I'm sorry to be difficult here, but I am a "HPC language lawyer" of sorts. – Led 10/3, 2020 at 19:36

Could you elaborate on the use of atomic_thread_fence() ? Do you suggest I could use MPI_Barrier() together with atomic_thread_fence() to replace MPI_Win_flush() ? If so, should I put the fence before or after the barrier? or on both sides? – Bullroarer 11/3, 2020 at 8:54

Flush works fine but is overkill. I doubt you’ll detect the difference in cost though. It’s relatively cheap to flush an empty RMA queue. – Led 13/3, 2020 at 4:32

Yeah, sorry, I meant MPI_Win_sync(). So should I put atomic_thread_fence() on both sides of the MPI_Barrier() to replace MPI_Win_sync() ? – Bullroarer 13/3, 2020 at 11:8

I would not replace MPI_Win_sync with atomic_thread_fence unless you are using C(++)1z atomics. – Led 13/3, 2020 at 16:11

I would be tempted to say this MPI program is not valid.

To explain what I base my opinion on

In the description of MPI_Win_allocate_shared:

The consistency of load/store accesses from/to the shared memory as observed by the user program depends on the architecture. A consistent view can be created in the unified memory model (see Section 11.4) by utilizing the window synchronization functions (see Section 11.5) or explicitly completing outstanding store accesses (e.g., by calling MPI_WIN_FLUSH). MPI does not define semantics for accessing shared memory windows in the separate memory model.

Section 11.4, about the memory models, which states:

In the RMA unified model, public and private copies are identical and updates via put or accumulate calls are eventually observed by load operations without additional RMA calls. A store access to a window is eventually visible to remote get or accumulate calls without additional RMA calls. These stronger semantics of the RMA unified model allow the user to omit some synchronization calls and potentially improve performance.

In the advice to users that follows only indicates:

If accesses in the RMA unified model are not synchronized (with locks or flushes, see Section 11.5.3), load and store operations might observe changes to the memory while they are in progress.

Section 11.7, semantic and correctness says:

MPI_BARRIER provides process synchronization, but not memory synchronization.

The different examples in 11.8 explain well how to use flush and sync operations.

The only synchronization ever addressed is always and only one-sided ones, i.e. in your case, MPI_Win_flush{,_all}, or MPI_Win_unlock{,_all} (except the mutual exclusion of active and passive concurrent synchronization that has to be enforced by the user, or the usage of MPI_MODE_NOCHECK assert flag).

So either you access directly memory with store, and you need to call MPI_Win_sync() on each of your windows before calling MPI_Barrier (as explained in example 11.10) to ensure synchronization, or you are doing RMA accesses and then you would have to call at least MPI_Win_flush_all before the second barrier to ensure the operations have been propagated. If you try to read using load operation, you may have to synchronize after the second barrier as well before doing so.

Another solution would be to unlock and re-lock between barriers, or to use Compiler and hardware specific notations could ensure the load occurs after the data is updated.

Goldston answered 19/2, 2020 at 16:0 Comment(9)

Thank you for your answer. Looking at the documentation for MPI_Win_flush_all, it seems to be useful for RMA operations, which I thought where put, get or accumulate calls. I'm not sure this applies to direct access in a shared memory window. I find the standard a bit vague about that... – Bullroarer 19/2, 2020 at 21:47

From the understanding of what I read, you can do direct accesses, but then you would have to refer to examples 11.7 and 11.9. You need to call MPI_Win_sync after the "before-reading" barrier, so your local view of the shared-buffer is updated before reading, and to use an MPI_Win_sync after all writing have been done, to update your "public copy" of the window. Or simply call MPI_Win_unlock_all before the barrier and MPI_Win_lock_all after. You may improve the performance with the rights hints/assert though (MPI_MODE_NOCHECK as an example). – Brenneman 20/2, 2020 at 10:1

One should never unlock and relock with MPI-3. Flush is equivalent to toggling an epoch. – Led 6/3, 2020 at 19:43

@Jeff I didn't know that. Why is that? Isn't the whole point of the passive synchronization to allow the asynchronous lock-modification-unlock of remote memory? As for the flush toggling an epoch, it doesn't enforce the memory synchronization, does it? Or if so, what is the point of MPI_Sync? – Brenneman 9/3, 2020 at 14:19

MPI_Win_flush is specified to be equivalent to MPI_Win_unlock; MPI_Win_lock. Flush and Unlock synchronize RMA operations, which include direct access. MPI_Win_sync synchronizes the public window (used for direct access) and the private window (used for RMA). In the unified memory model, one gets eventual consistency between these, but MPI_Win_sync makes that immediate. These is a super complicated topic and probably warrants a separate Q&A. But please read mpi-forum.org/docs/mpi-3.1/mpi31-report/node289.htm and related. – Led 10/3, 2020 at 20:8

wgropp.cs.illinois.edu/courses/cs598-s16/lectures/lecture35.pdf may be useful. That content is aligned with the understanding of the authors of the RMA chapter of MPI 3.0. – Led 10/3, 2020 at 20:8

My question ("Why is that?") was about the call to the sequence MPI_Win_unlock; MPKI_Win_lock being forbidden that surprised me. – Brenneman 12/3, 2020 at 9:20

I understand that if a call to MPI_Win_flush is strictly equivalent to MPI_Win_unlock; MPI_Win_lock then it does the memory synchronization, but the definition of the function only defines MPI_Win_flush as executing all pending RMA operations. MPI_Win_sync, however, would be the memory synchronization, to manage direct memory access (load/store). – Brenneman 12/3, 2020 at 9:35

However, in semantic and correctness, in the user rationale about UM it says "In the unified memory model, in the case where the window is in shared memory, SYNC can be used to order store operations and make store updates to the window visible to other processes and threads. Use of this routine is necessary […] when point-to-point, collective, or shared memory synchronization is used in place of an RMA synchronization routine. SYNC should be called by the writer before the non-RMA synchronization operation and by the reader after the non-RMA synchronization, as shown in Example 11.21." – Brenneman 12/3, 2020 at 9:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags