Is it possible to change the order in which CUDA thread blocks are scheduled when compiled with `--device-debug`?

Short Version

I have a kernel that launches a lot of blocks and I know that there is are illegal memory reads happening for blockIdx.y = 312. Running it under cuda-gdb results in sequential execution of blocks 16 at a time and it takes very long for the execution to reach this block index, even with a conditional breakpoint.

Is there any way to change the order in thread blocks are scheduled when running under cuda-gdb? If not, is there any other debugging strategy that I might have missed?

Longer Version

I have a baseline convolution CUDA kernel that scales with problem size by launching more blocks. There is a bug for input images with dimensions of the order of 10_000 x 10_000. Running it under cuda-memcheck, I see the following.

...
========= Invalid __global__ read of size 4
=========     at 0x00000150 in convolution_kernel_sharedmem(float*, float*, float*)
=========     by thread (30,31,0) in block (0,312,0)
...

All illegal accesses appear to be happening for blockDim.y = 312. So, upon running it with cuda-gdb, 16 blocks are being launched at a time starting from (0, 0, 0). I have set a conditional breakpoint at the kernel to stop at the desired block index, but it is taking a very long time to get there.

Is there any way change the order in which thread blocks are scheduled on the device? If not, is there any alternative debugging strategy that I might have missed?

P.S: I know that I can use grid-strided loops instead of launching these many blocks, but I would like to know what is wrong with this particular implementation.

Solution

Is there any way to change the order in thread blocks are scheduled when running under cuda-gdb?

There is no way to change the threadblock scheduling order unless you want to rewrite the code, and take control of threadblock scheduling yourself. Note that that linked example is not exactly how to redefine threadblock scheduled order, but it has all the necessary ingredients. In practice I don't see a lot of people wanting to do this level of refactoring, but I mention it for completeness.

If not, is there any other debugging strategy that I might have missed?

The method described here can localize your error to a specific line of kernel code. From there you can use e.g. conditioned printf to identify illegal index calculation, etc. Note that for that method, there is no need to compile your code with debug switches, but you do need to compile with -lineinfo.

This training topic provides a longer treatment of CUDA debugging.