I am attempting to MPI a CUDA code for lattice boltzmann modelling, and have run into frustrating problems with the MPI_Send and MPI_Recv functions. I have verified that I have CUDA-aware MPI with some simple device buffer to device buffer MPI send/recv code, so I can send and recv arrays between GPU device memory fine without going through the CPU/Host.
My code is for a 3D lattice, which is divided up along the z direction among the various nodes, with Halos passed between the nodes to ensure that fluid can flow between these divisions. The Halos are on the GPUs. The below code is a simplification and compiles giving the same error as my main code. Here, a GPU Halo on the Rank 0 node is MPI_Send() to the rank 1 node, which MPI_Recv()s it. My problem seems really simple at the moment, I cannot get the MPI_Send and MPI_Recv calls to function! The code does not progress to the "//CODE DOES NOT REACH HERE." lines, leading me to conclude that the MPI_etc() calls are not working.
My code is basically as follows, with much of the code deleted but still sufficient to be compilable with the same error:
#include <mpi.h>
using namespace std;
//In declarations:
const int DIM_X = 30;
const int DIM_Y = 50;
const int Q=19;
const int NumberDevices = 1;
const int NumberNodes = 2;
__host__ int SendRecvID(int UpDown, int rank, int Cookie) {int a =(UpDown*NumberNodes*NumberDevices) + (rank*NumberDevices) + Cookie; return a;} //Use as downwards memTrnsfr==0, upwards==1
int main(int argc, char *argv[])
{
//MPI functions (copied from online tutorial somewhere)
int numprocessors, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocessors);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
/* ...code for splitting other arrays removed... */
size_t size_Halo_z = Q*DIM_X*DIM_Y*sizeof(double); //Size variable used in cudaMalloc and cudaMemcpy.
int NumDataPts_f_halo = DIM_X*DIM_Y*Q; //Number of data points used in MPI_Send/Recv calls.
MPI_Status status; //Used in MPI_Recv.
//Creating arrays for GPU data below, using arrays of pointers:
double *Device_HaloUp_Take[NumberDevices]; //Arrays on the GPU which will be the Halos.
double *Device_HaloDown_Take[NumberDevices]; //Arrays on the GPU which will be the Halos.
double *Device_HaloUp_Give[NumberDevices]; //Arrays on the GPU which will be the Halos.
double *Device_HaloDown_Give[NumberDevices]; //Arrays on the GPU which will be the Halos.
for(int dev_i=0; dev_i<NumberDevices; dev_i++) //Initialising the GPU arrays:
{
cudaSetDevice(dev_i);
cudaMalloc( (void**)&Device_HaloUp_Take[dev_i], size_Halo_z);
cudaMalloc( (void**)&Device_HaloDown_Take[dev_i], size_Halo_z);
cudaMalloc( (void**)&Device_HaloUp_Give[dev_i], size_Halo_z);
cudaMalloc( (void**)&Device_HaloDown_Give[dev_i], size_Halo_z);
}
int Cookie=0; //Counter used to count the devices below.
for(int n=1;n<=100;n++) //Each loop iteration is one timestep.
{
/* Run computation on GPUs */
cudaThreadSynchronize();
if(rank==0) //Rank 0 node makes the first MPI_Send().
{
for(Cookie=0; Cookie<NumberDevices; Cookie++)
{
if(NumberDevices==1) //For single GPU codes (which for now is what I am stuck on):
{
cout << endl << "Testing X " << rank << endl;
MPI_Send(Device_HaloUp_Take[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank+1), SendRecvID(1,rank,Cookie), MPI_COMM_WORLD);
cout << endl << "Testing Y " << rank << endl; //CODE DOES NOT REACH HERE.
MPI_Recv(Device_HaloUp_Give[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank+1), SendRecvID(0,rank+1,0), MPI_COMM_WORLD, &status);
/*etc */
}
}
}
else if(rank==(NumberNodes-1))
{
for(Cookie=0; Cookie<NumberDevices; Cookie++)
{
if(NumberDevices==1)
{
cout << endl << "Testing A " << rank << endl;
MPI_Recv(Device_HaloDown_Give[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank-1), SendRecvID(1,rank-1,NumberDevices-1), MPI_COMM_WORLD, &status);
cout << endl << "Testing B " << rank << endl; //CODE DOES NOT REACH HERE.
MPI_Send(Device_HaloUp_Take[Cookie], NumDataPts_f_halo, MPI_DOUBLE, 0, SendRecvID(1,rank,Cookie), MPI_COMM_WORLD);
/*etc*/
}
}
}
}
/* Then some code to carry out rest of lattice boltzmann method. */
MPI_Finalize();
}
As I have 2 nodes (NumberNodes==2 variable in code), I have one as rank==0, and another as rank==1==NumberNodes-1. The rank 0 code goes to the if(rank==0) loop where it outputs "Testing X 0" but never gets to output "Testing Y 0" because it breaks beforehand on the MPI_Send() function. The variable Cookie at this point is 0 as there is only one GPU/device so the SendRecvID() function takes "(1,0,0)". The first parameter of MPI_Send is a pointer, as Device_Halo_etc is an array of pointers, whilst the location that the data is sent to is (rank+1)=1.
Similarly, the rank 1 code goes to the if(rank==NumberNodes-1) loop where it outputs "Testing A 1" but not "Testing B 1" as the code stops before completing the MPI_Recv call. As far as I can tell the parameters of MPI_Recv are correct, as (rank-1)=0 is correct, the number of data points sent and received is correct, and the ID is the same.
What I have tried so far is to make sure they each have the exact same tag (although the SendRecvID() in each case takes (1,0,0) so is the same anyway) by hand writing 999 or so, but this made no difference. I have also changed the Device_Halo_etc parameter to &Device_Halo_etc in both MPI calls, just in case I messed up with pointers there, but also no difference. The only way I could get it to work so far is by changing the Device_Halo_etc parameters in the MPI_Send/Recv() call to be some arbitrary arrays on the Host to test if they transfer, doing so allows it to get passed the first MPI call and of course get stuck onto the next, but even that only works when I change the number of variables to Send/Recv to 1 (instead of it being NumDataPts_f_halo==14250). And of course, moving host arrays around is of no interest.
Running the code using the nvcc compiler with additional linking variables (I am not too sure on how these work, having copied the method online somewhere, but given that more simple device to device MPI calls have worked I see no problem with this), through:
nvcc TestingMPI.cu -o run_Test -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl
and compiling with:
mpirun -np 2 run_Test
Doing so gives me an error that typically looks like this:
Testing A 1
Testing X 0
[Anastasia:16671] *** Process received signal ***
[Anastasia:16671] Signal: Segmentation fault (11)
[Anastasia:16671] Signal code: Invalid permissions (2)
[Anastasia:16671] Failing at address: 0x700140000
[Anastasia:16671] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f20327774a0]
[Anastasia:16671] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x147fe5) [0x7f2032888fe5]
[Anastasia:16671] [ 2] /usr/lib/libmpi.so.1(opal_convertor_pack+0x14d) [0x7f20331303bd]
[Anastasia:16671] [ 3] /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(+0x20c8) [0x7f202cad20c8]
[Anastasia:16671] [ 4] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x100f0) [0x7f202d9430f0]
[Anastasia:16671] [ 5] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x772b) [0x7f202d93a72b]
[Anastasia:16671] [ 6] /usr/lib/libmpi.so.1(MPI_Send+0x17b) [0x7f20330bc57b]
[Anastasia:16671] [ 7] run_Test() [0x400ff7]
[Anastasia:16671] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f203276276d]
[Anastasia:16671] [ 9] run_Test() [0x400ce9]
[Anastasia:16671] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 16671 on node Anastasia exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
I am running the code on my laptop (Anastasia), a Lenovo Y500 with dual GT650m NVIDIA graphics cards running on Linux Ubuntu 12.04LTS, if that makes a difference. nvcc --version
gives "release 5.0, V0.2.1221", and mpirun --version
gives "mpirun (Open MPI) 1.5.4".
mpirun valgrind ./binary
. you may also need to build MPI wrappers for MPI, see the valgrind doc. It is very hard to track error in code sample that large by just reading it, plus considering that you rely on CUDA-MPI - nearly impossible. If you are still stuck, hit me up on gchat, perhaps I could help further. Good luck. – Deangelis