mpiopenmpopenmpi

MPI_Wait does not free the MPI_Ibcast request


Consider the following program:

#include <iostream>
#include <mpi.h>

int main() {
  int provided = -1;
  MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &provided);

  if (provided != MPI_THREAD_MULTIPLE) {
    return -1;
  }

  int this_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &this_rank);

  double aze[36864]{};
  MPI_Request req = MPI_REQUEST_NULL;

  std::cout << this_rank << " starting bcast" << std::endl;
  MPI_Ibcast(aze, 36864, MPI_DOUBLE, 1, MPI_COMM_WORLD, &req);
  std::cout << this_rank << " req0 " << req << std::endl;

#pragma omp parallel
  {
    MPI_Status stat{};
    // do {
    MPI_Wait(&req, &stat);
    // } while(req != MPI_REQUEST_NULL);

    if (req != MPI_REQUEST_NULL) {
      std::cout << this_rank << " wait returned non null request: " << req
                << " vs " << MPI_REQUEST_NULL << std::endl;

      std::cout << this_rank << " MPI_SOURCE: " << stat.MPI_SOURCE << std::endl;
      std::cout << this_rank << " MPI_TAG: " << stat.MPI_TAG << std::endl;
      std::cout << this_rank << " MPI_ERROR: " << stat.MPI_ERROR << std::endl;
    }
  }

  {
    volatile int dummy = 0;
    while (dummy != 1'000'000'000) {
      dummy++;
    }
    std::cout << this_rank << " sleep done" << std::endl;
  }
  MPI_Barrier(MPI_COMM_WORLD);
  MPI_Finalize();
  return 0;
}

I'm using OpenMPI 5.0.2. I used clang and gcc in their near latest release. I run and build the reproducer above like so:

$ g++ -fopenmp ~/Downloads/trash/repro.mpi.cc -isystem /usr/include/openmpi-x86_64 -L /usr/lib64/openmpi/lib/ -lmpi
$ export OMP_NUM_THREADS=2
$ mpirun -n 4 ./a.out

The expected stdout is (sorted by rank identifier):

0 starting bcast
0 req0 0x3e573f28
0 sleep done
1 starting bcast
1 req0 0x79d9298
1 sleep done
2 starting bcast
2 req0 0xc4841b8
2 sleep done
3 starting bcast
3 req0 0x2fdf2f18
3 sleep done

Note that the address may obviously change.

The observed behavior, is like so (again, sorted by rank identifier):

0 starting bcast
0 req0 0x25aa6f28
0 wait returned non null request: 0x25aa6f28 vs 0x4045e0
0 MPI_SOURCE: 0
0 MPI_TAG: 0
0 MPI_ERROR: 0
0 sleep done
1 starting bcast
1 req0 0xb169298
1 sleep done
2 starting bcast
2 req0 0xc4f81b8
2 wait returned non null request: 0xc4f81b8 vs 0x4045e0
2 MPI_SOURCE: 0
2 MPI_TAG: 0
2 MPI_ERROR: 0
2 sleep done
3 starting bcast
3 req0 0x10ccbf18
3 wait returned non null request: 0x10ccbf18 vs 0x4045e0
3 MPI_SOURCE: 0
3 MPI_TAG: 0
3 MPI_ERROR: 0
3 sleep done

What we observe is that while MPI_Wait returns, does not report any error in MPI_Status nor in the logs, the MPI_Request is not freed and set to MPI_REQUEST_NULL.

According to the OpenMPI doc and standard: A call to MPI_Wait returns when the operation identified by request is complete. If the communication object associated with this request was created by a nonblocking send or receive call, then the object is deallocated by the call to MPI_Wait and the request handle is set to MPI_REQUEST_NULL. (https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man3/MPI_Wait.3.html#description).

Is the code snippet above unsound ? Note that if the do/while loop around the MPI_Wait is uncommented, the the code then produces the output I expect. But then, it turns into MPI_Test's semantic (polling). The openmp bit is key to triggering the issue.


Solution

  • The relevant text is (copied from MPI 4.0, but similarly present in other versions):

    Multiple threads completing the same request. A program in which two threads block, waiting on the same request, is erroneous. Similarly, the same request cannot appear in the array of requests of two concurrent MPI_{WAIT|TEST}{ANY|SOME|ALL} calls. In MPI, a request can only be completed once. Any combination of wait or test that violates this rule is erroneous.

    The only completion function that you can call concurrently with a pointer to the same request handle is MPI_Test. The important point there is that all threads really refer to the same storage for the common request handle.