cudagpu-shared-memorybank-conflict

GPU shared memory practical example


I have an array like this:

data[16] = {10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2}

I want to compute the reduction of this array using shared memory on a G80 GPU.

The kernel as cited in the NVIDIA document is like that:

__global__ void reduce1(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();

// here the reduction :

for (unsigned int s=1; s < blockDim.x; s *= 2) {
int index = 2 * s * tid;
if (index < blockDim.x) {
sdata[index] += sdata[index + s];
}
__syncthreads();
}

The author of the paper said that there is a problem of bank conflict in this method. I tried to understand but I couldn't figure out why? I know the definition of the bank conflict and broadcast access but still can't understand this.

Bank Conflicts


Solution

  • The G80 processor is a very old CUDA capable GPU, in the first generation of CUDA GPUs, with a compute capability of 1.0. These devices are no longer supported by recent CUDA versions (after 6.5) so the online documentation no longer contains the necessary information to understand the bank structure in these devices.

    Therefore I will excerpt the necessary info for cc 1.x devices from the CUDA 6.5 C programming guide here:

    G.3.3. Shared Memory

    Shared memory has 16 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per two clock cycles.

    A shared memory request for a warp is split into two memory requests, one for each half-warp, that are issued independently. As a consequence, there can be no bank conflict between a thread belonging to the first half of a warp and a thread belonging to the second half of the same warp.

    In these devices, shared memory has a 16 bank structure, such that each bank has a "width" of 32-bits or 4-bytes. Each bank has the same width as an int or float quantity, for example. Therefore lets envision the first 32 4-byte quantities that might be stored in this kind of shared memory, and their corresponding banks (using f instead of sdata for the name of the array):

    extern __shared__ int f[];
    
    index: f[0] f[1] f[2] f[3] ... f[15] f[16] f[17] f[18] f[19] ... f[31]
    bank:    0    1    2    3  ...   15     0     1     2     3  ...   15
    

    The first 16 int quantities in shared memory belong to banks 0 to 15, and the next 16 int quantities in shared memory also belong to banks 0 to 15 (and so on, if we had more data in our int array).

    Now let's look at the lines of code that will trigger a bank conflict:

    for (unsigned int s=1; s < blockDim.x; s *= 2) {
    int index = 2 * s * tid;
    if (index < blockDim.x) {
    sdata[index] += sdata[index + s];
    }
    

    Let's consider the first pass through the above loop, where s is 1. That means index is 2*1*tid, so for each thread, index is just double the value of threadIdx.x:

    threadIdx.x: 0 1 2 3 4  5  6  7  8  9 10 11 ...
     index:      0 2 4 6 8 10 12 14 16 18 20 22 ...
     bank:       0 2 4 6 8 10 12 14  0  2  4  6 ...
    

    so for this read operation:

    += sdata[index + s]
    

    we have:

    threadIdx.x: 0 1 2 3 4  5  6  7  8  9 10 11 ...
     index:      0 2 4 6 8 10 12 14 16 18 20 22 ...
     index + s:  1 3 5 7 9 11 13 15 17 19 21 23 ...
     bank:       1 3 5 7 9 11 13 15  1  3  5  7 ...
    

    So, within the first 16 threads, we have two threads that want to read from bank 1, two that want to read from bank 3, two that want to read from bank 5, etc. This read cycle therefore encounters 2-way bank conflicts across the first 16-thread group. Note that the other read and write operations on the same line of code are similarly bank-conflicted:

    sdata[index] +=
    

    as this will read, and then write, to banks 0, 2, 4, etc. twice per group of 16 threads.

    Note to others who may be reading this example: as written, it pertains to cc 1.x devices only. The methodology to demonstrate bank conflicts on cc 2.x and newer devices may be similar, but the specifics are different, due to warp execution differences and the fact that these newer devices have a 32-way bank structure, not a 16-way bank structure.