C++: Weird pointer corruption error

I have the following situation: I have written some short MPI testcodes in order to find out which combination of send and receive operations works best in my code.

The code works perfectly fine on my own computer (tested with 8 processes), but as soon as I run it one the cluster I'm working on, I get a huge error output of a corrupted or doubly freed pointer, this is the output: http://pastebin.com/pXTRSf89

What I am doing in my code is the following: I call my communication function 100K times and measure the time. This function is shown below. What I found out, is that the error always happens in the same iteration (somewhere around 6K). The reported processorID does change however. The iteration is the same even if I use 64 procs instead of 8. The problem is: I have absolutely no idea, what could be wrong, especilly since there are no pointer freed or assigned.

void communicateGrid(int level, real* grid, const Subdomain& subdomain, std::vector<TimeMap>& tm_) {
    tm_[level]["CommGrid"].start();

    MPI_Status status[2];
    MPI_Request request[2];

    // x 
    MPI_Isend(&grid[getIndexInner(level, 1, 1, 1) + innerGridpoints_[level][0] - numOuterGridpoints_[level]], 1, mpiTypes_[level * 4 + 1], subdomain.upperNeighbors[0], 0, MPI_COMM_WORLD, &request[0]);
    MPI_Isend(&grid[getIndexInner(level, 1, 1, 1)], 1, mpiTypes_[level * 4 + 1], subdomain.lowerNeighbors[0], 1, MPI_COMM_WORLD, &request[1]);

    MPI_Recv(&grid[getIndexInner(level, 1,1,1) + innerGridpoints_[level][0]], 1, mpiTypes_[level * 4 + 1], subdomain.upperNeighbors[0], 1, MPI_COMM_WORLD, &status[0]);
    MPI_Recv(&grid[getIndexInner(level, 1,1,1) - numOuterGridpoints_[level]], 1, mpiTypes_[level * 4 + 1], subdomain.lowerNeighbors[0], 0, MPI_COMM_WORLD, &status[1]);

    //y 
    MPI_Isend(&grid[getIndex(level, 0, innerGridpoints_[level][1], 0)], 1, mpiTypes_[level * 4 + 2], subdomain.upperNeighbors[1], 2, MPI_COMM_WORLD, &request[0]);
    MPI_Isend(&grid[getIndex(level, 0, numOuterGridpoints_[level], 0)], 1, mpiTypes_[level * 4 + 2], subdomain.lowerNeighbors[1], 3, MPI_COMM_WORLD, &request[1]);

    MPI_Recv(&grid[getIndex(level, 0, innerGridpoints_[level][1] + numOuterGridpoints_[level], 0)], 1, mpiTypes_[level * 4 + 2], subdomain.upperNeighbors[1], 3, MPI_COMM_WORLD, &status[0]);
    MPI_Recv(grid, 1, mpiTypes_[level * 4 + 2], subdomain.lowerNeighbors[1], 2, MPI_COMM_WORLD, &status[1]);

    // z
    MPI_Isend(&grid[getIndex(level, 0, 0, innerGridpoints_[level][2])], 1, mpiTypes_[level * 4 + 3], subdomain.upperNeighbors[2], 4, MPI_COMM_WORLD, &request[0]);
    MPI_Isend(&grid[getIndex(level, 0, 0, numOuterGridpoints_[level])], 1, mpiTypes_[level * 4 + 3], subdomain.lowerNeighbors[2], 5, MPI_COMM_WORLD, &request[1]);

    MPI_Recv(&grid[getIndex(level, 0, 0, numOuterGridpoints_[level] + innerGridpoints_[level][2])], 1, mpiTypes_[level * 4 + 3], subdomain.upperNeighbors[2], 5, MPI_COMM_WORLD, &status[0]);
    MPI_Recv(grid, 1, mpiTypes_[level * 4 + 3], subdomain.lowerNeighbors[2], 4, MPI_COMM_WORLD, &status[1]);

    tm_[level]["CommGrid"].stop();
}

mpiTypes_ is a global variable of type MPI_Datatype*, innerGridpoints_ and numOuterGridpoints_ are global as well (I know that this is not a good coding style, but I took it only for timing). I'm pretty sure that my Datatypes are correct, as they work in another setup of communicaton functions (e.g. Irecv followed by Send).

Final Note: I just tried to run this with just one processes. Then the following error occured:

Rank 0 [Mon Apr 22 02:11:23 2013] [c0-0c1s3n0] Fatal error in PMPI_Isend: Internal MPI error!, error stack: PMPI_Isend(148): MPI_Isend(buf=0x2aaaab7b531c, count=1, dtype=USER, dest=0, tag=1, MPI_COMM_WORLD, request=0x7fffffffb4d4) failed (unknown)(): Internal MPI error! _pmiu_daemon(SIGCHLD): [NID 00070] [c0-0c1s3n0] [Mon Apr 22 02:11:23 2013] PE RANK 0 exit signal Aborted

Again, this happened only on the cluster, but worked on my machine.

I am happy for anything would I could check or where the error might be! Thanks

You have to Wait or Test or something on those MPI Requests created by the MPI_Isend(), or you will leak internal resources, and eventually crash, which is what's happening.

Jeff Squyres puts it very well in his blog post at Cisco.

You know that those Isends are completing, but the MPI library has no way of knowing this and cleaning up the resources allocated and pointed to by those MPI_Requests. How much and what kind of resources are required depends on a lot of things, including the underlying network connection (can take up scarce infiniband resources, for instance), so it's not necessarily surprising that it worked on your own machine but not on the cluster.

You can fix this by adding

MPI_Waitall(2, request,  status);

after each stage of MPI_Isend/MPI_Recv()s.

This is not just necessary for cleaning up resources, it's actually required for correctness of a program with nonblocking requests.

Recommended topics

Hot tags