I am working with a CUDA kernel that must operate on pointers-to-pointers. The kernel basically performs a large number of very small reductions, which are best done in serial since the reductions are of size Nptrs=3-4. Here are two implementations of the kernel:
__global__
void kernel_RaiseIndexSLOW(double*__restrict__*__restrict__ A0,
const double*__restrict__*__restrict__ B0,
const double*__restrict__*__restrict__ C0,
const int Nptrs, const int Nx){
const int i = blockIdx.y;
const int j = blockIdx.z;
const int idx = blockIdx.x*blockDim.x + threadIdx.x;
if(i<Nptrs) {
if(j<Nptrs) {
for (int x = idx; x < Nx; x += blockDim.x*gridDim.x){
A0gpu[i+3*j][x] = B0gpu[i][x]*C0gpu[3*j][x]
+B0gpu[i+3][x]*C0gpu[1+3*j][x]
+B0gpu[i+6][x]*C0gpu[2+3*j][x];
}
}
}
}
__global__
void kernel_RaiseIndexsepderef(double*__restrict__*__restrict__ A0gpu,
const double*__restrict__*__restrict__ B0gpu,
const double*__restrict__*__restrict__ C0gpu,
const int Nptrs, const int Nx){
const int i = blockIdx.y;
const int j = blockIdx.z;
const int idx = blockIdx.x*blockDim.x + threadIdx.x;
if(i<Nptrs) {
if(j<Nptrs){
double*__restrict__ A0ptr = A0gpu[i+3*j];
const double*__restrict__ B0ptr0 = B0gpu[i];
const double*__restrict__ C0ptr0 = C0gpu[3*j];
const double*__restrict__ B0ptr1 = B0ptr0+3;
const double*__restrict__ B0ptr2 = B0ptr0+6;
const double*__restrict__ C0ptr1 = C0ptr0+1;
const double*__restrict__ C0ptr2 = C0ptr0+2;
for (int x = idx; x < Nx; x +=blockDim.x *gridDim.x){
double d2 = C0ptr0[x];
double d4 = C0ptr1[x]; //FLAGGED
double d6 = C0ptr2[x]; //FLAGGED
double d1 = B0ptr0[x];
double d3 = B0ptr1[x]; //FLAGGED
double d5 = B0ptr2[x]; //FLAGGED
A0ptr[x] = d1*d2 + d3*d4 + d5*d6;
}
}
}
}
As indicated by the names, the kernel "sepderef" performs about 40% faster than its counterpart, achieving, once launch overhead is figured in, about 85GBps effective bandwidth at Nptrs=3, Nx=60000 on an M2090 with ECC on (~160GBps would be optimal).
Running these through nvvp shows that the kernel is bandwidth bound. Strangely, however, the lines I have marked //FLAGGED are highlighted by the profiler as areas of sub-optimal memory access. I don't understand why this is, as the access here looks coalesced to me. Why would it not be?
Edit: I forgot to point this out, but notice that the //FLAGGED regions are accessing pointers upon which I have done arithmetic, whereas the others were accessed using the square bracket operator.
To understand this behaviour one needs to be aware that all CUDA GPUs so far execute instructions in-order. After after an instruction to load an operand from memory is issued, other independent instructions still continue to be executed. However once an instruction is encountered that depends on the operand from memory, all further operation on this instruction stream is stalled until the operand becomes available.
In your "sepderef" example, you are loading all operands from memory before summing them, which means that potentially the global memory latency is incurred only once per loop iteration (there are six loads per loop iteration, but they can all overlap. Only the first addition of the loop will stall, until it's operands are available. After the stall, all other additions will have their operands readily or very soon available).
In the "SLOW" example, loading from memory and addition are intermixed, so global memory latency is incurred multiple times per loop operation.
You may wonder why the compiler doesn't automatically reorder load instructions before computation. The CUDA compilers used to do this very aggressively, expending additional registers where the operands are waiting until used. CUDA 8.0 however seems far less aggressive in this respect, sticking much more to the order of instructions in the source code. This gives the programmer better opportunity to structure the code in the best way performance-wise where the compiler's instruction scheduling was suboptimal. At the same time, it also puts more burden on the programmer to explicitly schedule instructions even where previous compiler versions got it right.