I am trying to write my own version of MPI_Allreduce in C, but only for sizes of power two, i.e. size = 2,4,8,16,... and only for MPI_INT datatype. My code thus far is:
72 int tree_allreduce(const int *sendbuf, int *recvbuf, int count, MPI_Op op, MPI_Comm comm){
73
74 // Create variables for rank and size
75 int rank, size;
76 MPI_Comm_rank(comm, &rank);
77 MPI_Comm_size(comm, &size);
78
79 // While size is greater than 1 there is 2 or more ranks to operate on
80 while(size > 1){ // While loop active until size=1 when only process remaining is rank 0
81 if(rank < size){ // Filter out odd ranks which are always bigger than size after sending their data to their left even recvbuffer
82 if( (rank % 2) != 0 ){ // If rank is odd
83 MPI_Send(sendbuf, count, MPI_INT, rank-1, rank, comm); // Send contents of the sendbuf to the recvbuf, using rank of odd process as tag
84 rank *= size; // multiplying odd ranks by sizes ensures they are always > or = size when the if(rank < size) comes from next while iteration
85 }
86 else{ // If rank is even
87 // For an even rank, the values for the even number is stored in sendbuf, and the values of the odd rank is stored in recvbuf.
88 MPI_Recv(recvbuf, count, MPI_INT, rank+1, rank+1, comm, MPI_STATUS_IGNORE); // Receive contents of sendbuf from rank+1 into recvbuf
89 rank /= 2; // Half the rank so for next iteration of while loop rank 0 --> rank 0, rank 2 --> rank 1, rank 4 --> rank 2, etc...
90 MPI_Reduce_local(sendbuf, recvbuf, count, MPI_INT, op); // Use MPI_Reduce_local to do SUM/PROD/MIN/MAX operations and return result into recvbuf
91 }
92 }
93 size /= 2; // Half the size to reflect the processes contracting pairwise
94 }
95
96 // Broadcast result back to all processes
97 MPI_Bcast(recvbuf, count, MPI_INT, 0, comm);
98
99 return 0;
100 }
Which works fine for size 2, however for any greater size the code hangs indefinitely and I cannot seem to figure out why. I think I am making some kind of newbie MPI mistake so please let me know where I have gone wrong here.
Let's say you have 8 processors (rank var
is stored in your variable rank
, rank actual
is an actual worker rank).
rank var |01234567
rank actual|01234567
First iteration works fine, data is sent according the to scheme
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
rcv(1) | snd(0) | rcv(3) | snd(2) | rcv(5) | snd(4) | rcv(7) | snd(6) |
After that you remove odd workers via line rank *= size
, and update rank vars rank /= 2
rank var |0_1_2_3_
rank actual|01234567
At the next iteration data is sent according to the scheme
0 | - | 2 | - | 4 | - | 6 | - |
---|---|---|---|---|---|---|---|
rcv(1) | - | snd(0) | - | rcv(3) | - | snd(2) | - |
As you can see, it's messed up. Workers wait for data that is not being send to them.