cudawarp-scheduler

CUDA Warps and Thread Divergence


I am trying to understand CUDA warps and thread divergence. Suppose I have a naive matrix multiplication kernel to multiply n x n matrices.

__global__ void matrix_multiply(float* a, float* b, float* c, int n)
{
    int row = blockIdx.y + blockDim.y + threadIdx.y;
    int col = blockIdx.x + blockDim.x + threadIdx.x;

    if(row < n && col < n) {
        float tmp = 0.0f;
        for(int i = 0; i < n; ++i)
            tmp += a[row * n + i] * b[i * n + col];
        c[row * n + col] = tmp;
    }
}

If I launch a kernel with grid size 32 by 32 and block size 16 by 16 and the matrices are 500 by 500, how many warps have threads which will encounter thread divergence?

Since each thread block on the right edge of the matrix will have thread divergence, shouldn't the number of warps with thread divergence be 256?


Solution

  • There are two potential divergence points in your code. The first one can be created by the if statement and the second by the condition in for loop. The second one is harmless from the warp divergence perspective since the input n is uniform across threads.

    For the first one, those threads not satisfying the condition will exit quickly. If n is 500, which seems to be, the number of quickly existing threads is (16*16)*(32*32)-(500*500)=12144. Having in mind the answer to this question, there are 250 warps facing divergence, each coming from two rows in 16*16 top-most blocks that pass the right edge. In each of them, lanes with IDs 0, 1, 2, 3, 16, 17, 18, and 19 satisfy the condition and get into the if block while the rest are disabled. There will be 6*(512/16)=192 warps that if condition will be false for all their lanes hence they do not face divergence.

    Below picture shows what happens in the bottom-right-most tiles.

    enter image description here