I compiled a kernel in NVRTC:
__global__ void kernel_A(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx / 32;
unsigned char lane_id = idx % 32;
/* ... */
}
I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:
__global__ void kernel_B(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx >> 5;
unsigned char lane_id = idx & 31;
/* ... */
}
it seems kernel_B
just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof
shows kernel_A
runs for 15.2us in average, while kernel_B
runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.
The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram()
is -arch compute_50
.
Is this expected?
Did a thorough bugsweep in the codebase. Turns out my app was built in DEBUG
mode. This causes additional flags -G
and -lineinfo
passed to nvrtcCompileProgram()
From nvcc
man page:
--device-debug
(-G)
Generate debug information for device code. Turns off all optimizations. Don't use for profiling; use -lineinfo instead.