I'm writing a multi-threaded OpenMPI application, using MPI_Isend and MPI_Irecv from several threads to exchange hundreds of messages per second between ranks over InfiniBand RDMA.
Transfers are in the order of 400 - 800KByte, generating about 9 Gbps in and out for each rank, well within the capacity of FDR. Simple MPI benchmarks also show good performance.
The completion of the transfers is checked upon by polling all active transfers using MPI_Testsome in a dedicated thread.
The transfer rates I achieve depend on the message rate, but more importantly also on the polling frequency of MPI_Testsome. That is, if I poll, say, every 10ms, the requests finish later than if I poll every 1ms.
I'd expect that if I poll evert 10ms instead of every 1ms, I'd at most be informed of finished requests 9ms later. I'd not expect the transfers themselves to delay completion by fewer calls to MPI_Testsome, and thus slow down the total transfer rates. I'd expect MPI_Testsome to be entirely passive.
Anyone here have a clue why this behaviour could occur?