MPI Send and Recv Hangs with Buffer Size Larger Than 64kb
Asked Answered
M

2

6

I am trying to send data from process 0 to process 1. This program succeeds when the buffer size is less than 64kb, but hangs if the buffer gets much larger. The following code should reproduce this issue (should hang), but should succeed if n is modified to be less than 8000.

int main(int argc, char *argv[]){
  int world_size, world_rank,
      count;
  MPI_Status status;


  MPI_Init(NULL, NULL);

  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  if(world_size < 2){
    printf("Please add another process\n");
    exit(1);
  }

  int n = 8200;
  double *d = malloc(sizeof(double)*n);
  double *c = malloc(sizeof(double)*n);
  printf("malloc results %p %p\n", d, c);

  if(world_rank == 0){
    printf("sending\n");
    MPI_Send(c, n, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD);
    printf("sent\n");
  }
  if(world_rank == 1){
    printf("recv\n");
    MPI_Recv(d, n, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status);

    MPI_Get_count(&status, MPI_DOUBLE, &count);
    printf("recved, count:%d source:%d tag:%d error:%d\n", count, status.MPI_SOURCE, status.MPI_TAG, status.MPI_ERROR);
  }

  MPI_Finalize();

}

Output n = 8200;
malloc results 0x1cb05f0 0x1cc0640
recv
malloc results 0x117d5f0 0x118d640
sending

Output n = 8000;
malloc results 0x183c5f0 0x184c000
recv
malloc results 0x1ea75f0 0x1eb7000
sending
sent
recved, count:8000 source:0 tag:0 error:0

I found this question and this question which are similar, but I believe the issue there is with creating deadlocks. I would not expect a similar issue here because each process is performing only one send or receive.

EDIT: Added status checking.

EDIT2: It seems the issue was that I have OpenMPI installed but also installed an implementation of MPI from Intel when I installed MKL. My code was being compiled with the OpenMPI header and libraries, but run with Intel's mpirun. All works as expected when I ensure I run with the mpirun executable from OpenMPI.

Mcdougald answered 3/4, 2016 at 17:21 Comment(5)
The code looks fine and in fact runs just fine in my OpenMPI installation. Please provide more information about your installation. Were you able to run any sufficiently complex MPI code with that installation? Please also provide information where this hangs. The output, and also an attempt to debug the hanging processes would help.Samphire
I would agree with @Zulan, but I would ask Ruvu to check status.Chandelle
Also: Do check the result values of malloc!Samphire
@Samphire I added status checking and program output in an edit. My current project is an implementation of the SUMMA algorithm for matrix multiplication, which is working correctly until I need to send large sub-matrices. I then am unable to complete the receive calls. I'll try to collect more information about my installation.Mcdougald
I am using OpenMPI 1.10.2-1 x86_64, specifically archlinux.org/packages/extra/x86_64/openmpiMcdougald
M
4

The issue was with having both Intel's MPI and OpenMPI installed. I saw that /usr/include/mpi.h was owned by OpenMPI, but mpicc and mpirun were from Intel's implementation:

$ which mpicc
/opt/intel/composerxe/linux/mpi/intel64/bin/mpicc
$ which mpirun
/opt/intel/composerxe/linux/mpi/intel64/bin/mpirun

I was able to solve the issue by running

/usr/bin/mpicc

and

/usr/bin/mpirun

to ensure I used OpenMPI.

Thanks to @Zulan and @gsamaras for the suggestion to check my installation.

Mcdougald answered 3/4, 2016 at 18:21 Comment(2)
You are welcome Ruvu! Good thing you tried hard to solve this too, +2.Chandelle
I strongly recommend you to check Environment modules. It allows you to keep multiple versions and installations of the same software. Later on, you can decide which version you want to use without worrying whether you have set PATH properly or not.Whisper
C
1

The code is fine! I just checked with version 3.1.3 (mpiexec --version):

linux16:/home/users/grad1459>mpicc -std=c99 -O1 -o px px.c -lm
linux16:/home/users/grad1459>mpiexec -n 2 ./px
malloc results 0x92572e8 0x9267330
sending
sent
malloc results 0x9dc92e8 0x9dd9330
recv
recved, count:8200 source:0 tag:0 error:1839744

As a result, the problem comes with your installation. Run through the following troubleshoot options:

  1. Check the result of malloc*
  2. Check status

I would bet that the return value of malloc() is NULL, since you mention that it fails if you request more memory. It might be that the system refuses to give that memory.


I was partly correct, the problem came with the installation, but as the OP said:

It seems the issue was that I have OpenMPI installed but also installed an implementation of MPI from Intel when I installed MKL. My code was being compiled with the OpenMPI header and libraries, but run with Intel's mpirun. All works as expected when I ensure I run with the mpirun executable from OpenMPI.

*checking that `malloc` succeeded in C

Chandelle answered 3/4, 2016 at 17:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.