Given the following simple matrix multiplication kernel
`__global__ void MatrixMulKernel(float* M, float* N, float* P, int
Width)
{
int Row = blockIdx.y*blockDim.y+threadIdx.y;
int Col = blockIdx.x*blockDim.x+threadIdx.x;
if ((Row < Width) && (Col < Width)) {
float Pvalue = 0;
for (int k = 0; k < Width; ++k)
{
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}
}`
If we launch the kernel with a block size of 16X16 on a 1000X1000 matrix, how many warps will have control divergence?
Answer: 500
Explanation: There will be 63 blocks in the horizontal direction. 8 threads in the x dimension in each row will be in the invalid range. Every two rows form a warp. Therefore, there are 1000/2=500 warps that will straddle the valid and invalid ranges in the horizontal direction. As for the warps in the bottom blocks, there are 8 warps in the valid range and 8 warps in the invalid range. Threads in these warps are either totally in the valid range or invalid range.
Question: I am trying to understand why in this scenario 8 threads in the x dimension will be in the invalid range?
Each block covers a 16x16 array of elements. To cover a matrix of 1000x1000 elements, I need a square threadblock array that has dimensions of 1000/16 = 62.5 blocks in the horizontal direction and 62.5 blocks in the vertical direction.
But I can't launch 62.5x62.5 blocks, so in order to have full coverage I must launch 63x63 blocks, acknowledging that this will create extra threads in the "invalid range" (i.e. that would map to an element location outside the 1000x1000 matrix).
When I launch 63 blocks in the horizontal direction, I get 63x16 = 1008 threads in the horizontal direction. But I only need 1000 so 8 threads (in each row) are in the "invalid range".