MPI Fortran code: how to share data on node via openMP?
Asked Answered
T

3

12

I am working on an Fortan code that already uses MPI.

Now, I am facing a situation, where a set of data grows very large but is same for every process, so I would prefer to store it in memory only once per node and all processes on one node access the same data.

Storing it once for every process would go beyond the available RAM.

Is it somehow possible to achieve something like that with openMP?

Data sharing per node is the only thing I would like to have, no other per node paralellisation required, because this is already done via MPI.

Therapeutic answered 17/7, 2014 at 7:24 Comment(3)
Yes, what you want to do is possible and it is often called hybridisation, a mating, if you will, of MPI and OpenMP. Use your favourite search engine to learn more. Here on SO look for some of the answers from Hristo Iliev and Jonathan Dursi, I'm sure they've both covered the topic and when those guys cover a topic it stays covered.Rudd
I guess the problem will be how to do it without also parallelizing all the computations with OpenMP.Miliaria
Thanks for the suggestions! And yes, what Vladimir F pointed out, this is exactly what I am thinking about, data sharing but NO further parallelization. I will have a closer look at hybridisation then.Therapeutic
G
20

You don't need to implement a hybrid MPI+OpenMP code if it is only for sharing a chunk of data. What you actually have to do is:

1) Split the world communicator into groups that span the same host/node. That is really easy if your MPI library implements MPI-3.0 - all you need to do is call MPI_COMM_SPLIT_TYPE with split_type set to MPI_COMM_TYPE_SHARED:

USE mpi_f08

TYPE(MPI_Comm) :: hostcomm

CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, &
                         MPI_INFO_NULL, hostcomm)

MPI-2.2 or earlier does not provide the MPI_COMM_SPLIT_TYPE operation and one has to get somewhat creative. You could for example use my simple split-by-host implementation that can be found on Github here.

2) Now that processes that reside on the same node are part of the same communicator hostcomm, they can create a block of shared memory and use it to exchange data. Again, MPI-3.0 provides an (relatively) easy and portable way to do that:

USE mpi_f08
USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER

INTEGER :: hostrank

INTEGER(KIND=MPI_ADDRESS_KIND) :: size
INTEGER :: disp_unit
TYPE(C_PTR) :: baseptr
TYPE(MPI_Win) :: win

TYPE(MY_DATA_TYPE), POINTER :: shared_data

! We only want one process per host to allocate memory
! Set size to 0 in all processes but one
CALL MPI_Comm_rank(hostcomm, hostrank)
if (hostrank == 0) then
   size = 10000000 ! Put the actual data size here
else
   size = 0
end if
disp_unit = 1
CALL MPI_Win_allocate_shared(size, disp_unit, MPI_INFO_NULL, &
                             hostcomm, baseptr, win)

! Obtain the location of the memory segment
if (hostrank /= 0) then
   CALL MPI_Win_shared_query(win, 0, size, disp_unit, baseptr)
end if

! baseptr can now be associated with a Fortran pointer
! and thus used to access the shared data
CALL C_F_POINTER(baseptr, shared_data)

! Use shared_data as if it was ALLOCATE'd
! ...

! Destroy the shared memory window
CALL MPI_Win_free(win)

The way that code works is that it uses the MPI-3.0 functionality for allocating shared memory windows. MPI_WIN_ALLOCATE_SHARED allocates a chunk of shared memory in each process. Since you want to share one block of data, it only makes sense to allocate it in a single process and not have it spread across the processes, therefore size is set to 0 for all but one ranks while making the call. MPI_WIN_SHARED_QUERY is used to find out the address at which that shared memory block is mapped in the virtual address space of the calling process. One the address is known, the C pointer can be associated with a Fortran pointer using the C_F_POINTER() subroutine and the latter can be used to access the shared memory. Once done, the shared memory has to be freed by destroying the shared memory window with MPI_WIN_FREE.

MPI-2.2 or earlier does not provide shared memory windows. In that case one has to use the OS-dependent APIs for creation of shared memory blocks, e.g. the standard POSIX sequence shm_open() / ftruncate() / mmap(). A utility C function callable from Fortran has to be written in order to perform those operations. See that code for some inspiration. The void * returned by mmap() can be passed directly to the Fortran code in a C_PTR type variable that can be then associated with a Fortran pointer.

Glaciology answered 17/7, 2014 at 9:44 Comment(5)
Wow, that sounds really promising! Thanks a lot! I have to work through this. Small additional question: are there MPI-3 implementations that are usable for production? Edit: nevermind open-mpi as full MPI-3 standard supportTherapeutic
MPICH also supports full MPI-3. I believe some of the vendor implementations also support it or will very soon as well.Brok
MPI-3.0 support in Open MPI is present in the 1.8 series that still haven't quite lived up to the "stable release" label.Glaciology
@HristoIliev i have a similar question to chris. I have a code that uses mpi, but i have a vector that is the same for every process. Rigth now after each time step each process modifies their corresponding chunk of the vector then sends that chunk to all other processes while receiving all other chunks from the other processes. My professor told me that is slow and not very optimal. Can i do something like chris ?Capacity
@Capacity you should look into adding OpenMP parallelisation on top of MPI. This is somehow less involved than MPI shared memory windows and also the code will be cleaner.Glaciology
G
7

With this answer I want to add a complete running code example (for ifort 15 and mvapich 2.1). The MPI shared memory concept is still pretty new and in particular for Fortran there aren't many code examples out there. It is based on the answer from Hristo and a very useful email on the mvapich mailing list (http://mailman.cse.ohio-state.edu/pipermail/mvapich-discuss/2014-June/005003.html).

The code example is based on the problems I ran into and adds to Hristo's answer in the following ways:

  • uses mpi instead of mpi_f08 (some libraries do not provide a full fortran 2008 interface yet)
  • Has ierr added to the respective MPI calls
  • Explicit calculation of the windowsize elements*elementsize
  • How to use C_F_POINTER to map the shared memory to a multi dimensional array
  • Reminds to use MPI_WIN_FENCE after modifying the shared memory
  • Intel mpi (5.0.1.035) needs an additional MPI_BARRIER after the MPI_FENCE since it only guarantees that between "between two MPI_Win_fence calls, all RMA operations are completed." (https://software.intel.com/en-us/blogs/2014/08/06/one-sided-communication)

Kudos go to Hristo and Michael Rachner.

program sharedmemtest
  USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
  use mpi
  implicit none
  integer, parameter :: dp = selected_real_kind(14,200)
  integer :: win,win2,hostcomm,hostrank
  INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
  INTEGER :: disp_unit,my_rank,ierr,total
  TYPE(C_PTR) :: baseptr,baseptr2
  real(dp), POINTER :: matrix_elementsy(:,:,:,:)
  integer,allocatable :: arrayshape(:)

  call MPI_INIT( ierr )

  call MPI_COMM_RANK(MPI_COMM_WORLD,MY_RANK,IERR)  !GET THE RANK OF ONE PROCESS                                                                                                                                                                                                
  call MPI_COMM_SIZE(MPI_COMM_WORLD,Total,IERR)  !GET THE TOTAL PROCESSES OF THE COMM                                                                                                                                                                                          
  CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
  CALL MPI_Comm_rank(hostcomm, hostrank,ierr)

  ! Gratefully based on: https://mcmap.net/q/898362/-mpi-fortran-code-how-to-share-data-on-node-via-openmp                                                                                                                                                     
  ! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html                                                                                                                                                                                                       
  ! We only want one process per host to allocate memory                                                                                                                                                                                                                       
  ! Set size to 0 in all processes but one                                                                                                                                                                                                                                     
  allocate(arrayshape(4))
  arrayshape=(/ 10,10,10,10 /)
  if (hostrank == 0) then
     windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here                                                                                                                                                                
  else
     windowsize = 0_MPI_ADDRESS_KIND
  end if
  disp_unit = 1
  CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)    

  ! Obtain the location of the memory segment                                                                                                                                                                                                                                  
  if (hostrank /= 0) then
     CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)     
  end if

  ! baseptr can now be associated with a Fortran pointer                                                                                                                                                                                                                       
  ! and thus used to access the shared data                                                                                                                                                                                                                                    
  CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)

  !!! your code here!                                                                                                                                                                                                                                                          
  !!! sample below                                                                                                                                                                                                                                                             

  if (hostrank == 0) then
     matrix_elementsy=0.0_dp
     matrix_elementsy(1,2,3,4)=1.0_dp
  end if
  CALL MPI_WIN_FENCE(0, win, ierr)

  print *,"my_rank=",my_rank,matrix_elementsy(1,2,3,4),matrix_elementsy(1,2,3,5)

  !!! end sample code                                                                                                                                                                                                                                                          

  call MPI_WIN_FENCE(0, win, ierr) 
  call MPI_BARRIER(MPI_COMM_WORLD,ierr) 
  call MPI_Win_free(win,ierr)     
  call MPI_FINALIZE(IERR)

  end program
Gibbous answered 6/1, 2016 at 18:17 Comment(5)
Thanks for the useful post. Can I just ask if you left out the MPI_WIN_FREE call on purpose? I have both MPI_WIN_FREE and MPI_FINALIZE and get a seg fault for the latter.Residence
Thanks for pointing this out. No I did not leave them out on purpose. The correct end of the program should be: CALL MPI_WIN_FENCE(0, win, ierr); call MPI_BARRIER(MPI_COMM_WORLD,ierr); CALL MPI_Win_free(win,ierr); call MPI_FINALIZE(IERR). I corrected the post accordingly. ThanksGibbous
Did you know Stack Overflow has an automatic version control? If you clicked on the edited + date you will see the complete history including the dates. There is no reason for stuff like ==== EDIT 1/7/2016 ==== .Miliaria
Thanks for pointing that out Vladimir. I didn't know and have now cleaned up my answer accordingly.Gibbous
Although it doesn't matter in this example since only node 0 writes to memory, I suggest separating the initialization of memory step matrix_elementsy=0.0_dp from using memory step(s) matrix_elementsy(1,2,3,4)=1.0_dp with a MPI_Win_fence. This bit me when I tried to make my loop in code below. User error, but a reference point for others.Bitner
B
2

In the spirit of adding Fortran shared memory MPI examples, I'd like to extend ftiaronsem's code to incorporate a loop so that the behavior of MPI_Win_fence and MPI_Barrier is clearer (at least it is for me now, anyway).

Specifically, try running the code with either or both of the MPI_Win_Fence or MPI_Barrier commands in the loop commented out to see the effect. Alternatively, reverse their order.

Removing the MPI_Win_Fence allows the write statement to display memory that has not been updated yet.

Removing the MPI_Barrier allows other processes to run the next iteration and change memory before a process has the chance to write.

The previous answers really helped me implement the shared memory paradigm in my MPI code. Thanks.

program sharedmemtest
  USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
  use mpi
  implicit none
  integer, parameter :: dp = selected_real_kind(14,200)
  integer :: win,win2,hostcomm,hostrank
  INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
  INTEGER :: disp_unit,my_rank,ierr,total, i
  TYPE(C_PTR) :: baseptr,baseptr2
  real(dp), POINTER :: matrix_elementsy(:,:,:,:)
  integer,allocatable :: arrayshape(:)

  call MPI_INIT( ierr )

  call MPI_COMM_RANK(MPI_COMM_WORLD,my_rank, ierr)  !GET THE RANK OF ONE PROCESS
  call MPI_COMM_SIZE(MPI_COMM_WORLD,total,ierr)  !GET THE TOTAL PROCESSES OF THE COMM
  CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
  CALL MPI_Comm_rank(hostcomm, hostrank,ierr)

  ! Gratefully based on: https://mcmap.net/q/898362/-mpi-fortran-code-how-to-share-data-on-node-via-openmp
  ! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
  ! We only want one process per host to allocate memory
  ! Set size to 0 in all processes but one
  allocate(arrayshape(4))
  arrayshape=(/ 10,10,10,10 /)
  if (hostrank == 0) then
     windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double ! Put the actual data size here
  else
     windowsize = 0_MPI_ADDRESS_KIND
  end if
  disp_unit = 1
  CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)

  ! Obtain the location of the memory segment
  if (hostrank /= 0) then
     CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
  end if

  ! baseptr can now be associated with a Fortran pointer
  ! and thus used to access the shared data
  CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)

  !!! your code here!
  !!! sample below
  if (hostrank == 0) then
     matrix_elementsy=0.0_dp
  endif
  call MPI_WIN_FENCE(0, win, ierr)
  do i=1, 15
     if (hostrank == 0) then
        matrix_elementsy(1,2,3,4)=i * 1.0_dp
        matrix_elementsy(1,2,2,4)=i * 2.0_dp
     elseif ((hostrank > 5) .and. (hostrank < 11)) then  ! code for non-root nodes to do something different
        matrix_elementsy(1,2,hostrank, 4) = hostrank * 1.0 * i
     endif
     call MPI_WIN_FENCE(0, win, ierr)
     write(*,'(A, I4, I4, 10F7.1)') "my_rank=",my_rank, i, matrix_elementsy(1,2,:,4)
     call MPI_BARRIER(MPI_COMM_WORLD, ierr)
  enddo
  !!! end sample code

  call MPI_WIN_FENCE(0, win, ierr)
  call MPI_BARRIER(MPI_COMM_WORLD,ierr)
  call MPI_Win_free(win,ierr)
  call MPI_FINALIZE(IERR)

  end program
Bitner answered 20/12, 2016 at 21:15 Comment(2)
This solution mostly works for me, thanks. But I find I get a segfault on Cori at NERSC from the MPI_FINALIZE line. Also, does the MPI_WIN_FREE free the shared memory as well?Ellora
I can't comment on any differences between Cori and standard MPI, so not sure why the segfault. It is strange that it happens at MPI_FINALIZE, since everything is done at that point. Actually, MPI_WIN_FREE is only freeing shared memory (more specifically whatever is pointed to by the first argument)Bitner

© 2022 - 2024 — McMap. All rights reserved.