cmpi

Writing my own version of MPI_Allreduce in C - why does my code hang indefinitely?


I am trying to write my own version of MPI_Allreduce in C, but only for sizes of power two, i.e. size = 2,4,8,16,... and only for MPI_INT datatype. My code thus far is:

 72 int tree_allreduce(const int *sendbuf, int *recvbuf, int count, MPI_Op op, MPI_Comm comm){
 73 
 74   // Create variables for rank and size
 75   int rank, size;
 76   MPI_Comm_rank(comm, &rank);
 77   MPI_Comm_size(comm, &size);
 78 
 79   // While size is greater than 1 there is 2 or more ranks to operate on
 80   while(size > 1){  // While loop active until size=1 when only process remaining is rank 0
 81     if(rank < size){  // Filter out odd ranks which are always bigger than size after sending their data to their left even recvbuffer
 82       if( (rank % 2) != 0 ){ // If rank is odd
 83         MPI_Send(sendbuf, count, MPI_INT, rank-1, rank, comm);  // Send contents of the sendbuf to the recvbuf, using rank of odd process as tag
 84         rank *= size;  // multiplying odd ranks by sizes ensures they are always > or = size when the if(rank < size) comes from next while iteration
 85       }
 86       else{  // If rank is even
 87         // For an even rank, the values for the even number is stored in sendbuf, and the values of the odd rank is stored in recvbuf.
 88         MPI_Recv(recvbuf, count, MPI_INT, rank+1, rank+1, comm, MPI_STATUS_IGNORE);  // Receive contents of sendbuf from rank+1 into recvbuf
 89         rank /= 2;  // Half the rank so for next iteration of while loop rank 0 --> rank 0, rank 2 --> rank 1, rank 4 --> rank 2, etc...
 90         MPI_Reduce_local(sendbuf, recvbuf, count, MPI_INT, op);  // Use MPI_Reduce_local to do SUM/PROD/MIN/MAX operations and return result into recvbuf
 91       }
 92     }
 93     size /= 2;  // Half the size to reflect the processes contracting pairwise
 94   }
 95 
 96   // Broadcast result back to all processes
 97   MPI_Bcast(recvbuf, count, MPI_INT, 0, comm);
 98 
 99   return 0;
100 }

Which works fine for size 2, however for any greater size the code hangs indefinitely and I cannot seem to figure out why. I think I am making some kind of newbie MPI mistake so please let me know where I have gone wrong here.


Solution

  • Let's say you have 8 processors (rank var is stored in your variable rank, rank actual is an actual worker rank).

    rank var   |01234567
    rank actual|01234567
    

    First iteration works fine, data is sent according the to scheme

    0 1 2 3 4 5 6 7
    rcv(1) snd(0) rcv(3) snd(2) rcv(5) snd(4) rcv(7) snd(6)

    After that you remove odd workers via line rank *= size, and update rank vars rank /= 2

    rank var   |0_1_2_3_
    rank actual|01234567
    

    At the next iteration data is sent according to the scheme

    0 - 2 - 4 - 6 -
    rcv(1) - snd(0) - rcv(3) - snd(2) -

    As you can see, it's messed up. Workers wait for data that is not being send to them.