Segmentation Fault in Fortran program using RMA functions of MPI-2
Asked Answered
Z

1

5

The following short Fortran90 program crashes as long as it contains the MPI_GET call. Rank 1 tries to read a value from rank 0 and hangs in MPI_WIN_UNLOCK. Rank 0 tries crashes in MPI_BARRIER with a segmentation fault.

I repeatedly check the syntax of the commands, but they seem to be right. Similiar code in C/C++ works on the same system.

I'm using OpenMPI 1.4.3 and gfortran 4.4.5.

PROGRAM mpitest
USE mpi
IMPLICIT NONE

INTEGER :: ierr, npe, rnk, win
INTEGER (KIND=MPI_ADDRESS_KIND) lowerbound, sizeofreal
REAL :: val = 1.0, oval = 2.0

CALL MPI_INIT( ierr )
CALL MPI_COMM_RANK( MPI_COMM_WORLD, rnk, ierr )
CALL MPI_COMM_SIZE( MPI_COMM_WORLD, npe, ierr )

CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lowerbound, sizeofreal, ierr)

CALL MPI_WIN_CREATE(val, sizeofreal, sizeofreal, MPI_INFO_NULL, MPI_COMM_WORLD, win, ierr)

IF( rnk .EQ. 1 ) THEN
   CALL MPI_WIN_LOCK( MPI_LOCK_SHARED, 0, 0, win, ierr )
   CALL MPI_GET( oval, 1, MPI_REAL, 0, 0, 1, MPI_REAL, win, ierr )
   CALL MPI_WIN_UNLOCK( 0, win, ierr )
END IF

CALL MPI_BARRIER( MPI_COMM_WORLD, ierr )
CALL MPI_WIN_FREE(win, ierr)
CALL MPI_FINALIZE(ierr)

END PROGRAM mpitest

mpif90 mpitest.f90
mpirun -n 2 ./a.out

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x34006020a0
[ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36420) [0x7f2d1c8c1420]
[ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x13ae70) [0x7f2d1c9c5e70]
[ 2] /usr/lib/libmpi.so.0(ompi_convertor_pack+0x199) [0x7f2d1c61d629]
[ 3] /usr/lib/openmpi/lib/openmpi/mca_osc_pt2pt.so(+0x56b0) [0x7f2d166876b0]
[ 4] /usr/lib/openmpi/lib/openmpi/mca_osc_pt2pt.so(+0x3a81) [0x7f2d16685a81]
[ 5] /usr/lib/openmpi/lib/openmpi/mca_osc_pt2pt.so(+0x23ac) [0x7f2d166843ac]
[ 6] /usr/lib/libopen-pal.so.0(opal_progress+0x5b) [0x7f2d1ba700db]
[ 7] /usr/lib/libmpi.so.0(+0x35635) [0x7f2d1c60f635]
[ 8] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0x1afa) [0x7f2d1688eafa]
[ 9] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0x958f) [0x7f2d1689658f]
[10] /usr/lib/libmpi.so.0(MPI_Barrier+0x8d) [0x7f2d1c6250cd]
[11] /usr/lib/libmpi_f77.so.0(PMPI_BARRIER+0x13) [0x7f2d1cf661d3]
[12] ./a.out() [0x401003]
[13] ./a.out(main+0x34) [0x401058]
[14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f2d1c8ac30d]
[15] ./a.out() [0x400da9]
*** End of error message ***
Zoophobia answered 15/6, 2012 at 14:16 Comment(0)
M
7

This is a tricky one, but a clue comes from the fact that the segfault occurs here in an unrelated and utterly safe routine, MPI_Barrier(). The problem is that there's stack corruption.

The underlying problem is just an argument kind mismatch (which I'd have hoped that the MPI Fortran bindings would have caught, but didn't). The target displacement argument to MPI_Get is an integer of kind MPI_ADDRESS_KIND, but you're just passing it an integer.

If you use lowerbound as your target offset, or promote the 0 you pass in to be of kind MPI_ADDRESS_KIND explicitly, your program works.

Maulstick answered 15/6, 2012 at 14:46 Comment(1)
Thanks for helping out. I told my coworker to post this here, but he didn't. I decided to help him. He should know better now :-)Zoophobia

© 2022 - 2024 — McMap. All rights reserved.