CRAY supercomputer using the MPICH2 library. Each node has 32 CPU's.
I have a single float on N different MPI ranks, where each of these ranks is on a different node. I need to perform a reduction operation on this group of floats. I would like to know whether an MPI_Reduce is faster than MPI_Gather with the reduction calculated on the root, for any value of N. Please assume that the reduction done on the root rank will be done using a good parallel reduction algorithm that can utilize N threads.
If it isn't faster for any value of N, would it tend to be true for smaller N, like 16, or larger N?
If it is true, why? (For example, will MPI_Reduce use a tree communication pattern that tends to hide the reduction operation's time in the approach it uses to communicate with the next level of the tree?)
Assume that MPI_Reduce
is always faster than MPI_Gather
+ local reduce.
Even if there was a case of N where reduction is slower than gather, an MPI implementation could easily implement reduction in this case in terms of gather + local reduce.
MPI_Reduce
has only advantages over MPI_Gather
+ local reduce:
MPI_Reduce
is the more high-level operation giving the implementation more opportunity to optimize.MPI_Reduce
needs to allocate much less memoryMPI_Reduce
needs to communicate less data (if using a tree) or less data over the same link (if using direct all-to-one)MPI_Reduce
can distribute the computation across more resources (e.g. using a tree communication pattern)That said: Never assume anything about performance. Measure.