Bank conflict CUDA shared memory?

I'm running into (what I believe are) shared-memory bank conflicts in a CUDA kernel. The code itself is fairly complex, but I reproduced it in the simple example attached below.

In this case it is simplified to a simple copy from global -> shared -> global memory, of a 2D array of size 16x16, using a shared-memory array which might be padded at the right side (variable ng).

If I compile the code with ng=0 and examine the shared memory access pattern with NVVP, it tells me that there are "no issues". With e.g. ng=2 I get "Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1" at the lines marked with "NVVP warning". I don't understand why (or more specific: why the padding causes the warnings).

EDIT as mentioned by Greg Smith below, on Kepler there are 32 banks of 8 bytes wide (http://gpgpu.org/wp/wp-content/uploads/2013/09/08-opti-smem-instr.pdf, slide 18). But I don't see how that changes the problem.

If I understand things correctly, with 32 banks (B1, B2, ..) of 4 bytes, doubles (D01, D02, ..) are stored as:

B1   B2   B3   B4   B5    ..   B31
----------------------------------
D01       D02       D03   ..   D15
D16       D17       D18   ..   D31
D32       D33       D34   ..   D47

Without the padding, half warps write (as[ijs] = in[ij]) to shared-memory D01 .. D15, D16 .. D31, etc. With padding (of size 2) the first half warp writes to D01 .. D15, the second after the padding to D18 .. D33, which still shouldn't cause a bank conflict?

Any idea what might be going wrong here?

Simplified example (tested with cuda 6.5.14):

// Compiled with nvcc -O3 -arch=sm_35 -lineinfo 

__global__ void copy(double * const __restrict__ out, const double * const __restrict__ in, const int ni, const int nj, const int ng)

{
    extern __shared__ double as[];
    const int ij=threadIdx.x + threadIdx.y*blockDim.x;
    const int ijs=threadIdx.x + threadIdx.y*(blockDim.x+ng);

    as[ijs] = in[ij]; // NVVP warning
    __syncthreads();
    out[ij] = as[ijs]; // NVVP warning
}

int main()
{
    const int itot = 16;
    const int jtot = 16;
    const int ng = 2;
    const int ncells = itot * jtot;

    double *in  = new double[ncells];
    double *out = new double[ncells];
    double *tmp = new double[ncells];
    for(int n=0; n<ncells; ++n)
        in[n]  = 0.001 * (std::rand() % 1000) - 0.5;

    double *ind, *outd;
    cudaMalloc((void **)&ind,  ncells*sizeof(double));
    cudaMalloc((void **)&outd, ncells*sizeof(double));
    cudaMemcpy(ind,  in,  ncells*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(outd, out, ncells*sizeof(double), cudaMemcpyHostToDevice);

    dim3 gridGPU (1, 1 , 1);
    dim3 blockGPU(16, 16, 1);

    copy<<<gridGPU, blockGPU, (itot+ng)*jtot*sizeof(double)>>>(outd, ind, itot, jtot, ng);

    cudaMemcpy(tmp, outd, ncells*sizeof(double), cudaMemcpyDeviceToHost);

    return 0;
}

Solution

It turns out that I didn’t understand the Keppler architecture correctly. As pointed out in one of the comments above by Greg Smith, Keppler can be configured to have 32 shared memory banks of 8 bytes. In such a case, using cudaDeviceSetSharedMemConfig( cudaSharedMemBankSizeEightByte ), the shared memory layout looks like:

bank:  B0   B1   B2   B3   B4    ..   B31
       ----------------------------------
index: D00  D01  D02  D03  D04   ..   D31
       D32  D33  D34  D35  D36   ..   D63

Now, for my simple example (using itot=16), the writing/reading to/from shared memory on e.g. the first two rows (threadIdx.y=0, threadIdx.y=1) is handled within one warp. This means that for threadIdx.y=0 values D00..D15 are stored in B0..B15, then there is a padding of two doubles, after which within the same warp values D18..D33 are stored in B18..B31+B00..B01, which causes a bank conflict on B00-B01. Without the padding (ng=0) the first row is written to D00..D15 in B00..B15, the second row in D16..D31 in B16..B31, so no bank conflict occurs.

For a thread block of blockDim.x>=32 the problem shouldn’t occur. For example, for itot=32, blockDim.x=32, ng=2, the first row is stored in banks B00..B31, then two cells padding, second row in B02..B31+B00..B01, etc.