mpisupercomputers

Performance of MPI_Reduce vs (MPI_Gather + Reduction on Root)


CRAY supercomputer using the MPICH2 library. Each node has 32 CPU's.

I have a single float on N different MPI ranks, where each of these ranks is on a different node. I need to perform a reduction operation on this group of floats. I would like to know whether an MPI_Reduce is faster than MPI_Gather with the reduction calculated on the root, for any value of N. Please assume that the reduction done on the root rank will be done using a good parallel reduction algorithm that can utilize N threads.

If it isn't faster for any value of N, would it tend to be true for smaller N, like 16, or larger N?

If it is true, why? (For example, will MPI_Reduce use a tree communication pattern that tends to hide the reduction operation's time in the approach it uses to communicate with the next level of the tree?)


Solution

  • Assume that MPI_Reduce is always faster than MPI_Gather + local reduce.

    Even if there was a case of N where reduction is slower than gather, an MPI implementation could easily implement reduction in this case in terms of gather + local reduce.

    MPI_Reduce has only advantages over MPI_Gather + local reduce:

    1. MPI_Reduce is the more high-level operation giving the implementation more opportunity to optimize.
    2. MPI_Reduce needs to allocate much less memory
    3. MPI_Reduce needs to communicate less data (if using a tree) or less data over the same link (if using direct all-to-one)
    4. MPI_Reduce can distribute the computation across more resources (e.g. using a tree communication pattern)

    That said: Never assume anything about performance. Measure.