I'm running into (what I believe are) shared-memory bank conflicts in a CUDA kernel. The code itself is fairly complex, but I reproduced it in the simple example attached below.
In this case it is simplified to a simple copy from global -> shared -> global memory, of a 2D array of size 16x16, using a shared-memory array which might be padded at the right side (variable ng
).
If I compile the code with ng=0
and examine the shared memory access pattern with NVVP, it tells me that there are "no issues". With e.g. ng=2
I get "Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1" at the lines marked with "NVVP warning". I don't understand why (or more specific: why the padding causes the warnings).
EDIT as mentioned by Greg Smith below, on Kepler there are 32 banks of 8 bytes wide (http://gpgpu.org/wp/wp-content/uploads/2013/09/08-opti-smem-instr.pdf, slide 18). But I don't see how that changes the problem.
If I understand things correctly, with 32 banks (B1, B2, ..)
of 4 bytes, doubles (D01, D02, ..)
are stored as:
B1 B2 B3 B4 B5 .. B31
----------------------------------
D01 D02 D03 .. D15
D16 D17 D18 .. D31
D32 D33 D34 .. D47
Without the padding, half warps write (as[ijs] = in[ij]
) to shared-memory D01 .. D15
, D16 .. D31
, etc. With padding (of size 2) the first half warp writes to D01 .. D15
, the second after the padding to D18 .. D33
, which still shouldn't cause a bank conflict?
Any idea what might be going wrong here?
Simplified example (tested with cuda 6.5.14):
// Compiled with nvcc -O3 -arch=sm_35 -lineinfo
__global__ void copy(double * const __restrict__ out, const double * const __restrict__ in, const int ni, const int nj, const int ng)
{
extern __shared__ double as[];
const int ij=threadIdx.x + threadIdx.y*blockDim.x;
const int ijs=threadIdx.x + threadIdx.y*(blockDim.x+ng);
as[ijs] = in[ij]; // NVVP warning
__syncthreads();
out[ij] = as[ijs]; // NVVP warning
}
int main()
{
const int itot = 16;
const int jtot = 16;
const int ng = 2;
const int ncells = itot * jtot;
double *in = new double[ncells];
double *out = new double[ncells];
double *tmp = new double[ncells];
for(int n=0; n<ncells; ++n)
in[n] = 0.001 * (std::rand() % 1000) - 0.5;
double *ind, *outd;
cudaMalloc((void **)&ind, ncells*sizeof(double));
cudaMalloc((void **)&outd, ncells*sizeof(double));
cudaMemcpy(ind, in, ncells*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(outd, out, ncells*sizeof(double), cudaMemcpyHostToDevice);
dim3 gridGPU (1, 1 , 1);
dim3 blockGPU(16, 16, 1);
copy<<<gridGPU, blockGPU, (itot+ng)*jtot*sizeof(double)>>>(outd, ind, itot, jtot, ng);
cudaMemcpy(tmp, outd, ncells*sizeof(double), cudaMemcpyDeviceToHost);
return 0;
}
It turns out that I didn’t understand the Keppler architecture correctly. As pointed out in one of the comments above by Greg Smith, Keppler can be configured to have 32 shared memory banks of 8 bytes. In such a case, using cudaDeviceSetSharedMemConfig( cudaSharedMemBankSizeEightByte )
, the shared memory layout looks like:
bank: B0 B1 B2 B3 B4 .. B31
----------------------------------
index: D00 D01 D02 D03 D04 .. D31
D32 D33 D34 D35 D36 .. D63
Now, for my simple example (using itot=16
), the writing/reading to/from shared memory on e.g. the first two rows (threadIdx.y=0
, threadIdx.y=1
) is handled within one warp. This means that for threadIdx.y=0
values D00..D15
are stored in B0..B15
, then there is a padding of two doubles, after which within the same warp values D18..D33
are stored in B18..B31+B00..B01
, which causes a bank conflict on B00-B01
. Without the padding (ng=0
) the first row is written to D00..D15
in B00..B15
, the second row in D16..D31
in B16..B31
, so no bank conflict occurs.
For a thread block of blockDim.x>=32
the problem shouldn’t occur. For example, for itot=32
, blockDim.x=32
, ng=2
, the first row is stored in banks B00..B31
, then two cells padding, second row in B02..B31+B00..B01
, etc.