None of the barriers are needed!
MPI_Gather
is a blocking operation, that is the outputs are available after the call completes. That does not imply a barrier, because non-root-ranks are allowed to, but not guaranteed to, complete before the root / other ranks starts it's operation. However, it is perfectly safe to access global
on the MASTER_ID
rank and reuse localdata
on any rank after the local call completes.
Synchronization with the message-based MPI is different from the shared-memory OpenMP. For blocking communication, usually no explicit synchronization ins necessary - the result is guaranteed to be available after the call completes.
Synchronization of sorts is necessary for non-blocking communication, but that is done via MPI_Test
/MPI_Wait
on specific messages - barriers might even provide a false sense of correctness if you tried to substitute a MPI_Wait
with MPI_Barrier
. With one-sided communication, it gets more complicated and barriers can play a role.
Actually, you only rarely need a barrier, instead avoid them to not introduce any unnecessary synchronization.
Edit: Given the contradicting other answers, here is the standard (MPI 3.1, Section 5.1) citation (emphasis mine).
Collective operations can (but are not required to) complete as soon
as the caller’s participation in the collective communication is
finished. A blocking operation is complete as soon as the call
returns. A nonblocking (immediate) call requires a separate completion
call (cf. Section 3.7). The completion of a collective operation
indicates that the caller is free to modify locations in the
communication buffer. It does not indicate that other processes in the
group have completed or even started the operation (unless otherwise
implied by the description of the operation). Thus, a collective
communication operation may, or may not, have the effect of
synchronizing all calling processes. This statement excludes, of
course, the barrier operation.
To address the recent edit: No, data sizes have no impact of the correctness in this case. Data sizes in MPI sometimes have an impact on whether a incorrect MPI program will deadlock or not.