I use 12 nodes windows HPC cluster (each with 24 cores) to run a C++ MPI program (use Boost MPI). One run with the MPI reduce, one comment out MPI reduce (for speed test only). The run time is 01:17:23 and 01:03:49. It seems to me that MPI reduce take a large portion of time. I think it might be worthy to try to first reduce at node level, then reduce to the head node to improve performance.
Below is a simple example for test purpose. Suppose there is 4 computer nodes, each has 2 cores. I want to first use mpi to reduce on each node. After that, reduce to the head node. I am not quite familiar with mpi and the below program crashes.
#include <iostream>
#include <boost/mpi.hpp>
namespace mpi = boost::mpi;
using namespace std;
int main()
{
mpi::environment env;
mpi::communicator world;
int i = world.rank();
boost::mpi::communicator local = world.split(world.rank()/2); // total 8 cores, divide in 4 groups
boost::mpi::communicator heads = world.split(world.rank()%4);
int res = 0;
boost::mpi::reduce(local, i, res, std::plus<int>(), 0);
if(world.rank()%2==0)
cout<<res<<endl;
boost::mpi::reduce(heads, res, res, std::plus<int>(), 0);
if(world.rank()==0)
cout<<res<<endl;
return 0;
}
The output is illegible, something like this
Z
h
h
h
h
a
a
a
a
n
n
n
n
g
g
g
g
\
\
\
\
b
b
b
b
o
o
o
o
o
o
o
o
s
...
...
...
The error message is
Test.exe ended prematurely and may have crashed. exit code 3
I suspect I did something wrong with the group split/or reduce but cannot figure it out with several trials.How do I change to make this work? Thanks.
The reason for the cash is because you pass the same variable twice to MPI in the following line
boost::mpi::reduce(heads, res, res, std::plus<int>(), 0);
That's not quite well documented in Boost.MPI, but boost takes these by reference and passes the respective pointers to MPI. MPI in general forbids you to pass the same buffer twice to the same call. To be precise an output buffer passed to an MPI function must not alias (overlap) to any other buffer passed in this call.
You can easily fix this by creating a copy of res
.
I also think you probably want to restrict calling the second reduce from the processes with local.rank() == 0
.
Also reiterating the comment - I doubt you will get any benefit from re-implementing a reduction. Trying to optimize a performance issue whose bottleneck do not fully understand is generally a bad idea.