Simple MPI_Send and Recv gives segmentation fault (11) and Invalid Permission (2) with CUDA
Asked Answered
C

1

7

I am attempting to MPI a CUDA code for lattice boltzmann modelling, and have run into frustrating problems with the MPI_Send and MPI_Recv functions. I have verified that I have CUDA-aware MPI with some simple device buffer to device buffer MPI send/recv code, so I can send and recv arrays between GPU device memory fine without going through the CPU/Host.

My code is for a 3D lattice, which is divided up along the z direction among the various nodes, with Halos passed between the nodes to ensure that fluid can flow between these divisions. The Halos are on the GPUs. The below code is a simplification and compiles giving the same error as my main code. Here, a GPU Halo on the Rank 0 node is MPI_Send() to the rank 1 node, which MPI_Recv()s it. My problem seems really simple at the moment, I cannot get the MPI_Send and MPI_Recv calls to function! The code does not progress to the "//CODE DOES NOT REACH HERE." lines, leading me to conclude that the MPI_etc() calls are not working.

My code is basically as follows, with much of the code deleted but still sufficient to be compilable with the same error:

#include <mpi.h>
using namespace std; 

    //In declarations:
    const int DIM_X = 30;
    const int DIM_Y = 50;
    const int Q=19;
    const int NumberDevices = 1;
    const int NumberNodes = 2;

    __host__        int SendRecvID(int UpDown, int rank, int Cookie) {int a =(UpDown*NumberNodes*NumberDevices) + (rank*NumberDevices) + Cookie; return a;} //Use as downwards memTrnsfr==0, upwards==1

    int main(int argc, char *argv[])
    {
       //MPI functions (copied from online tutorial somewhere)
       int numprocessors, rank, namelen;
       char processor_name[MPI_MAX_PROCESSOR_NAME];

       MPI_Init(&argc, &argv);
       MPI_Comm_size(MPI_COMM_WORLD, &numprocessors);
       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
       MPI_Get_processor_name(processor_name, &namelen);

       /* ...code for splitting other arrays removed... */

       size_t size_Halo_z   = Q*DIM_X*DIM_Y*sizeof(double);  //Size variable used in cudaMalloc and cudaMemcpy.
       int NumDataPts_f_halo    = DIM_X*DIM_Y*Q;                 //Number of data points used in MPI_Send/Recv calls.
       MPI_Status status;                                        //Used in MPI_Recv.

       //Creating arrays for GPU data below, using arrays of pointers:
       double   *Device_HaloUp_Take[NumberDevices];              //Arrays on the GPU which will be the Halos.
       double   *Device_HaloDown_Take[NumberDevices];            //Arrays on the GPU which will be the Halos.
       double   *Device_HaloUp_Give[NumberDevices];              //Arrays on the GPU which will be the Halos.
       double   *Device_HaloDown_Give[NumberDevices];            //Arrays on the GPU which will be the Halos.

       for(int dev_i=0; dev_i<NumberDevices; dev_i++)   //Initialising the GPU arrays:
       {
          cudaSetDevice(dev_i);

          cudaMalloc( (void**)&Device_HaloUp_Take[dev_i],   size_Halo_z);
          cudaMalloc( (void**)&Device_HaloDown_Take[dev_i],     size_Halo_z);
          cudaMalloc( (void**)&Device_HaloUp_Give[dev_i],   size_Halo_z);
          cudaMalloc( (void**)&Device_HaloDown_Give[dev_i],     size_Halo_z);
       }

       int Cookie=0;             //Counter used to count the devices below.
       for(int n=1;n<=100;n++)   //Each loop iteration is one timestep.
       {    
       /* Run computation on GPUs */


          cudaThreadSynchronize();

          if(rank==0)   //Rank 0 node makes the first MPI_Send().
          {
             for(Cookie=0; Cookie<NumberDevices; Cookie++)
             {
                if(NumberDevices==1)            //For single GPU codes (which for now is what I am stuck on):
                {
                   cout << endl << "Testing X " << rank << endl;
                   MPI_Send(Device_HaloUp_Take[Cookie],     NumDataPts_f_halo,  MPI_DOUBLE, (rank+1), SendRecvID(1,rank,Cookie), MPI_COMM_WORLD);
                   cout << endl << "Testing Y " << rank << endl;   //CODE DOES NOT REACH HERE.
                   MPI_Recv(Device_HaloUp_Give[Cookie], NumDataPts_f_halo,  MPI_DOUBLE, (rank+1), SendRecvID(0,rank+1,0), MPI_COMM_WORLD, &status);     
                   /*etc */
                }
             }

          }
          else if(rank==(NumberNodes-1))
          {
             for(Cookie=0; Cookie<NumberDevices; Cookie++)
             {
                if(NumberDevices==1)
                {
                   cout << endl << "Testing  A " << rank << endl;
                   MPI_Recv(Device_HaloDown_Give[Cookie],   NumDataPts_f_halo,  MPI_DOUBLE, (rank-1), SendRecvID(1,rank-1,NumberDevices-1), MPI_COMM_WORLD, &status);
                   cout << endl << "Testing  B " << rank << endl;    //CODE DOES NOT REACH HERE.
                   MPI_Send(Device_HaloUp_Take[Cookie],     NumDataPts_f_halo,  MPI_DOUBLE, 0, SendRecvID(1,rank,Cookie), MPI_COMM_WORLD);
                   /*etc*/
                }
            }
         }
      }
      /* Then some code to carry out rest of lattice boltzmann method. */

   MPI_Finalize();
}

As I have 2 nodes (NumberNodes==2 variable in code), I have one as rank==0, and another as rank==1==NumberNodes-1. The rank 0 code goes to the if(rank==0) loop where it outputs "Testing X 0" but never gets to output "Testing Y 0" because it breaks beforehand on the MPI_Send() function. The variable Cookie at this point is 0 as there is only one GPU/device so the SendRecvID() function takes "(1,0,0)". The first parameter of MPI_Send is a pointer, as Device_Halo_etc is an array of pointers, whilst the location that the data is sent to is (rank+1)=1.

Similarly, the rank 1 code goes to the if(rank==NumberNodes-1) loop where it outputs "Testing A 1" but not "Testing B 1" as the code stops before completing the MPI_Recv call. As far as I can tell the parameters of MPI_Recv are correct, as (rank-1)=0 is correct, the number of data points sent and received is correct, and the ID is the same.

What I have tried so far is to make sure they each have the exact same tag (although the SendRecvID() in each case takes (1,0,0) so is the same anyway) by hand writing 999 or so, but this made no difference. I have also changed the Device_Halo_etc parameter to &Device_Halo_etc in both MPI calls, just in case I messed up with pointers there, but also no difference. The only way I could get it to work so far is by changing the Device_Halo_etc parameters in the MPI_Send/Recv() call to be some arbitrary arrays on the Host to test if they transfer, doing so allows it to get passed the first MPI call and of course get stuck onto the next, but even that only works when I change the number of variables to Send/Recv to 1 (instead of it being NumDataPts_f_halo==14250). And of course, moving host arrays around is of no interest.

Running the code using the nvcc compiler with additional linking variables (I am not too sure on how these work, having copied the method online somewhere, but given that more simple device to device MPI calls have worked I see no problem with this), through:

nvcc TestingMPI.cu -o run_Test -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl

and compiling with:

mpirun -np 2 run_Test

Doing so gives me an error that typically looks like this:

Testing  A 1

Testing X 0
[Anastasia:16671] *** Process received signal ***
[Anastasia:16671] Signal: Segmentation fault (11)
[Anastasia:16671] Signal code: Invalid permissions (2)
[Anastasia:16671] Failing at address: 0x700140000
[Anastasia:16671] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f20327774a0]
[Anastasia:16671] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x147fe5) [0x7f2032888fe5]
[Anastasia:16671] [ 2] /usr/lib/libmpi.so.1(opal_convertor_pack+0x14d) [0x7f20331303bd]
[Anastasia:16671] [ 3] /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(+0x20c8) [0x7f202cad20c8]
[Anastasia:16671] [ 4] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x100f0) [0x7f202d9430f0]
[Anastasia:16671] [ 5] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x772b) [0x7f202d93a72b]
[Anastasia:16671] [ 6] /usr/lib/libmpi.so.1(MPI_Send+0x17b) [0x7f20330bc57b]
[Anastasia:16671] [ 7] run_Test() [0x400ff7]
[Anastasia:16671] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f203276276d]
[Anastasia:16671] [ 9] run_Test() [0x400ce9]
[Anastasia:16671] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 16671 on node Anastasia exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I am running the code on my laptop (Anastasia), a Lenovo Y500 with dual GT650m NVIDIA graphics cards running on Linux Ubuntu 12.04LTS, if that makes a difference. nvcc --version gives "release 5.0, V0.2.1221", and mpirun --version gives "mpirun (Open MPI) 1.5.4".

Clouet answered 6/8, 2013 at 1:30 Comment(10)
That's a lot of code to scan. Run your program through valgrind first.Deangelis
I don't see how you can expect an answer tothis question. without a short, complete example example which reproduces the problem it is basically impossible to say what is going wrong. just based on the output it appears that the failure is occurring in code you haven't shown. vote to close as off topic because no SCCE provided.Fenderson
@talonmies: The above code should be a complete example, it compiles for me running the nvcc/mpirun commands, but of course it isn't short! Sorry for that, I have tried in an edit to shorten it, and there is no code missing. By the way, what is SCCE? I tried to google it but figured asking is easier.Clouet
@Anycorn: I have run it through valgrind, and only the command "valgrind --tool=helgrind -v mpirun -np 2 run_Test" seems to give errors: "ERROR SUMMARY: 20 errors from 10 contexts (suppressed: 0 from 0)", with the line "pthread_cond_destroy: destruction of unknown cond var" being common. If you are familiar with valgrind, any help is much appreciated! I have no idea how to read the valgrind outputs...Clouet
helgrind is for threaded code (race conditions, etc). Run valgrind as is. Run in serial first, without MPI. Get rid of all errors valgrind may report. Then start debugging your parallel runs.Deangelis
And you do realize you are giving MPI_Send a buffer that is allocated in the CUDA devices? That memory is not mapped to host address.Deangelis
I'm not sure I can run the above code in serial as it would freeze on the MPI_Send() call, given that in serial it would never reach an MPI_Recv(). Taking out the MPI_Send/Recv() calls allows it to run in serial and in parallel, with no errors from valgrind. But regardless, even before, valgrind returns no errors for me when I run "valgrind mpirun -np 2 run_Test" on the above code. Also, the buffers should be on the GPU, and I have verified that I can access them using the CUDA-aware MPI version I have and CUDA 4.0's UVA. I will double check this though to really make sure!Clouet
@SonkeHee I see. Here are things to try: don't rely on MPI magic if you have issues like that, that's another layer where things can fail. Exchange a packet of 0 byte - does that work? If so, exchange 1 double buffer of host memory. Still works? Replace host buffer with CUDA buffer. etc etc.Deangelis
@SonkeHee with regards to valgrind, run it as mpirun valgrind ./binary. you may also need to build MPI wrappers for MPI, see the valgrind doc. It is very hard to track error in code sample that large by just reading it, plus considering that you rely on CUDA-MPI - nearly impossible. If you are still stuck, hit me up on gchat, perhaps I could help further. Good luck.Deangelis
@Anycorn: Thanks for the help! You were correct about CUDA-MPI not being trustworthy, I found that the previous check to see whether I could MPI_Send/Recv() from GPU buffers was insufficient and that in truth I was not able to access them - hence the "invalid permissions" error! It was odd actually since how I thought it had passed the test first time around was due to a mistake with pointers. Really thank you for the vigilance and ideas! I have started on my work around now... not too successfully yet, but hopefully it will work soon! MPI is tough...Clouet
C
5

Thanks to Anycorn for the assistance with the code!

If it interests anyone with a similar problem, my error here turned out to be in determining whether I was able to access CUDA memory using MPI calls. I was not able to MPI_Send/Recv() GPU memory, hence I got "invalid permission" errors. If anyone has a similar problem, I suggest you test a simple code for sending device memory around using the MPI_Send/Recv() functions, as suggested by Anycorn under the comments section of the above question.

Keep an eye out for accidentally sending a pointer to the pointer-to-device-memory instead of the pointer-to-device-memory (a pointer is required in the MPI_Send/Recv() functions, the first argument it takes). I had sent that pointer between different nodes, and as the pointer was on Host/CPU memory, the calls worked fine. The result was that node 1 would give node 0 the pointer to a pointer - when I output the data which I thought I had collected from node 1, I got the data pointed to on node 0 by the newly received pointer... this was pointing to the same array I had initialised on both nodes through sloppy coding (an "if(node==1) initialise array" line would have saved me there). Hence, I received the correct output and thought all was well.

Thanks again Anycorn!

Clouet answered 8/8, 2013 at 18:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.