parallel-processingcudacomputer-sciencegpugpu-warp

Control Divergence with simple matrix multiplication kernel


Given the following simple matrix multiplication kernel

`__global__ void MatrixMulKernel(float* M, float* N, float* P, int 
Width)
{
  int Row = blockIdx.y*blockDim.y+threadIdx.y;
  int Col = blockIdx.x*blockDim.x+threadIdx.x;
  if ((Row < Width) && (Col < Width)) {
      float Pvalue = 0;
      for (int k = 0; k < Width; ++k) 
         {
            Pvalue += M[Row*Width+k]*N[k*Width+Col];
         }
  P[Row*Width+Col] = Pvalue;
   }
 }`

If we launch the kernel with a block size of 16X16 on a 1000X1000 matrix, how many warps will have control divergence?

Answer: 500

Explanation: There will be 63 blocks in the horizontal direction. 8 threads in the x dimension in each row will be in the invalid range. Every two rows form a warp. Therefore, there are 1000/2=500 warps that will straddle the valid and invalid ranges in the horizontal direction. As for the warps in the bottom blocks, there are 8 warps in the valid range and 8 warps in the invalid range. Threads in these warps are either totally in the valid range or invalid range.

Question: I am trying to understand why in this scenario 8 threads in the x dimension will be in the invalid range?


Solution

  • Each block covers a 16x16 array of elements. To cover a matrix of 1000x1000 elements, I need a square threadblock array that has dimensions of 1000/16 = 62.5 blocks in the horizontal direction and 62.5 blocks in the vertical direction.

    But I can't launch 62.5x62.5 blocks, so in order to have full coverage I must launch 63x63 blocks, acknowledging that this will create extra threads in the "invalid range" (i.e. that would map to an element location outside the 1000x1000 matrix).

    When I launch 63 blocks in the horizontal direction, I get 63x16 = 1008 threads in the horizontal direction. But I only need 1000 so 8 threads (in each row) are in the "invalid range".