I have the following situation: I have written some short MPI testcodes in order to find out which combination of send and receive operations works best in my code.
The code works perfectly fine on my own computer (tested with 8 processes), but as soon as I run it one the cluster I'm working on, I get a huge error output of a corrupted or doubly freed pointer, this is the output: http://pastebin.com/pXTRSf89
What I am doing in my code is the following: I call my communication function 100K times and measure the time. This function is shown below. What I found out, is that the error always happens in the same iteration (somewhere around 6K). The reported processorID does change however. The iteration is the same even if I use 64 procs instead of 8. The problem is: I have absolutely no idea, what could be wrong, especilly since there are no pointer freed or assigned.
void communicateGrid(int level, real* grid, const Subdomain& subdomain, std::vector<TimeMap>& tm_) {
tm_[level]["CommGrid"].start();
MPI_Status status[2];
MPI_Request request[2];
// x
MPI_Isend(&grid[getIndexInner(level, 1, 1, 1) + innerGridpoints_[level][0] - numOuterGridpoints_[level]], 1, mpiTypes_[level * 4 + 1], subdomain.upperNeighbors[0], 0, MPI_COMM_WORLD, &request[0]);
MPI_Isend(&grid[getIndexInner(level, 1, 1, 1)], 1, mpiTypes_[level * 4 + 1], subdomain.lowerNeighbors[0], 1, MPI_COMM_WORLD, &request[1]);
MPI_Recv(&grid[getIndexInner(level, 1,1,1) + innerGridpoints_[level][0]], 1, mpiTypes_[level * 4 + 1], subdomain.upperNeighbors[0], 1, MPI_COMM_WORLD, &status[0]);
MPI_Recv(&grid[getIndexInner(level, 1,1,1) - numOuterGridpoints_[level]], 1, mpiTypes_[level * 4 + 1], subdomain.lowerNeighbors[0], 0, MPI_COMM_WORLD, &status[1]);
//y
MPI_Isend(&grid[getIndex(level, 0, innerGridpoints_[level][1], 0)], 1, mpiTypes_[level * 4 + 2], subdomain.upperNeighbors[1], 2, MPI_COMM_WORLD, &request[0]);
MPI_Isend(&grid[getIndex(level, 0, numOuterGridpoints_[level], 0)], 1, mpiTypes_[level * 4 + 2], subdomain.lowerNeighbors[1], 3, MPI_COMM_WORLD, &request[1]);
MPI_Recv(&grid[getIndex(level, 0, innerGridpoints_[level][1] + numOuterGridpoints_[level], 0)], 1, mpiTypes_[level * 4 + 2], subdomain.upperNeighbors[1], 3, MPI_COMM_WORLD, &status[0]);
MPI_Recv(grid, 1, mpiTypes_[level * 4 + 2], subdomain.lowerNeighbors[1], 2, MPI_COMM_WORLD, &status[1]);
// z
MPI_Isend(&grid[getIndex(level, 0, 0, innerGridpoints_[level][2])], 1, mpiTypes_[level * 4 + 3], subdomain.upperNeighbors[2], 4, MPI_COMM_WORLD, &request[0]);
MPI_Isend(&grid[getIndex(level, 0, 0, numOuterGridpoints_[level])], 1, mpiTypes_[level * 4 + 3], subdomain.lowerNeighbors[2], 5, MPI_COMM_WORLD, &request[1]);
MPI_Recv(&grid[getIndex(level, 0, 0, numOuterGridpoints_[level] + innerGridpoints_[level][2])], 1, mpiTypes_[level * 4 + 3], subdomain.upperNeighbors[2], 5, MPI_COMM_WORLD, &status[0]);
MPI_Recv(grid, 1, mpiTypes_[level * 4 + 3], subdomain.lowerNeighbors[2], 4, MPI_COMM_WORLD, &status[1]);
tm_[level]["CommGrid"].stop();
}
mpiTypes_ is a global variable of type MPI_Datatype*, innerGridpoints_ and numOuterGridpoints_ are global as well (I know that this is not a good coding style, but I took it only for timing). I'm pretty sure that my Datatypes are correct, as they work in another setup of communicaton functions (e.g. Irecv followed by Send).
Final Note: I just tried to run this with just one processes. Then the following error occured:
Rank 0 [Mon Apr 22 02:11:23 2013] [c0-0c1s3n0] Fatal error in PMPI_Isend: Internal MPI error!, error stack: PMPI_Isend(148): MPI_Isend(buf=0x2aaaab7b531c, count=1, dtype=USER, dest=0, tag=1, MPI_COMM_WORLD, request=0x7fffffffb4d4) failed (unknown)(): Internal MPI error! _pmiu_daemon(SIGCHLD): [NID 00070] [c0-0c1s3n0] [Mon Apr 22 02:11:23 2013] PE RANK 0 exit signal Aborted
Again, this happened only on the cluster, but worked on my machine.
I am happy for anything would I could check or where the error might be! Thanks
&grid[getIndex(level, 0, 0, numOuterGridpoints_[level] + innerGridpoints_[level][2])]
, is there any chance that that points to a block that isn't yours? Or could there be a memory leak in the function you are calling... is it the exact same version of compiler / libraries on both machines? – Zayas