Consider the following program:
#include <iostream>
#include <mpi.h>
int main() {
int provided = -1;
MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);
if (provided != MPI_THREAD_MULTIPLE) {
return -1;
}
int this_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &this_rank);
double aze[36864]{};
MPI_Request req = MPI_REQUEST_NULL;
std::cout << this_rank << " starting bcast" << std::endl;
MPI_Ibcast(aze, 36864, MPI_DOUBLE, 1, MPI_COMM_WORLD, &req);
std::cout << this_rank << " req0 " << req << std::endl;
#pragma omp parallel
{
MPI_Status stat{};
// do {
MPI_Wait(&req, &stat);
// } while(req != MPI_REQUEST_NULL);
if (req != MPI_REQUEST_NULL) {
std::cout << this_rank << " wait returned non null request: " << req
<< " vs " << MPI_REQUEST_NULL << std::endl;
std::cout << this_rank << " MPI_SOURCE: " << stat.MPI_SOURCE << std::endl;
std::cout << this_rank << " MPI_TAG: " << stat.MPI_TAG << std::endl;
std::cout << this_rank << " MPI_ERROR: " << stat.MPI_ERROR << std::endl;
}
}
{
volatile int dummy = 0;
while (dummy != 1'000'000'000) {
dummy++;
}
std::cout << this_rank << " sleep done" << std::endl;
}
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
I'm using OpenMPI 5.0.2. I used clang and gcc in their near latest release. I run and build the reproducer above like so:
$ g++ -fopenmp ~/Downloads/trash/repro.mpi.cc -isystem /usr/include/openmpi-x86_64 -L /usr/lib64/openmpi/lib/ -lmpi
$ export OMP_NUM_THREADS=2
$ mpirun -n 4 ./a.out
The expected stdout is (sorted by rank identifier):
0 starting bcast
0 req0 0x3e573f28
0 sleep done
1 starting bcast
1 req0 0x79d9298
1 sleep done
2 starting bcast
2 req0 0xc4841b8
2 sleep done
3 starting bcast
3 req0 0x2fdf2f18
3 sleep done
Note that the address may obviously change.
The observed behavior, is like so (again, sorted by rank identifier):
0 starting bcast
0 req0 0x25aa6f28
0 wait returned non null request: 0x25aa6f28 vs 0x4045e0
0 MPI_SOURCE: 0
0 MPI_TAG: 0
0 MPI_ERROR: 0
0 sleep done
1 starting bcast
1 req0 0xb169298
1 sleep done
2 starting bcast
2 req0 0xc4f81b8
2 wait returned non null request: 0xc4f81b8 vs 0x4045e0
2 MPI_SOURCE: 0
2 MPI_TAG: 0
2 MPI_ERROR: 0
2 sleep done
3 starting bcast
3 req0 0x10ccbf18
3 wait returned non null request: 0x10ccbf18 vs 0x4045e0
3 MPI_SOURCE: 0
3 MPI_TAG: 0
3 MPI_ERROR: 0
3 sleep done
What we observe is that while MPI_Wait returns, does not report any error in MPI_Status nor in the logs, the MPI_Request is not freed and set to MPI_REQUEST_NULL.
According to the OpenMPI doc and standard:
A call to MPI_Wait returns when the operation identified by request is complete. If the communication object associated with this request was created by a nonblocking send or receive call, then the object is deallocated by the call to MPI_Wait and the request handle is set to MPI_REQUEST_NULL.
(https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man3/MPI_Wait.3.html#description).
Is the code snippet above unsound ? Note that if the do/while loop around the MPI_Wait is uncommented, the the code then produces the output I expect. But then, it turns into MPI_Test's semantic (polling). The openmp bit is key to triggering the issue.
The relevant text is (copied from MPI 4.0, but similarly present in other versions):
Multiple threads completing the same request. A program in which two threads block, waiting on the same request, is erroneous. Similarly, the same request cannot appear in the array of requests of two concurrent MPI_{WAIT|TEST}{ANY|SOME|ALL} calls. In MPI, a request can only be completed once. Any combination of wait or test that violates this rule is erroneous.
The only completion function that you can call concurrently with a pointer to the same request handle is MPI_Test
. The important point there is that all threads really refer to the same storage for the common request handle.