I have a CUDA program. If I compile this program with -g -G option, I can get the correct output. If I compile it without -g -G option, I don't always get the correct output. My guess is that without -g -G option, the compiler will optimize the program and cause race condition. Can anyone confirm this, and also let me know what should I do to have the program produce the correct output even if I compile the program without -g -G option. Thanks
operating system: x86_64 GNU/Linux CUDA version: 4.0 Device: Geforce 200, It is one of the GPUS attached to the machine, and I don't think it is a display device.
-G enables debugging device code (by generating debug symbols), and disables optimizations for device code. -g is an option for the host compiler that enables generating debug symbols for host code (and may disable host code optimizations).
This likely points to a race condition in your code that is eliminated when optimizations are disabled. You may, for example, be missing a __syncthreads()
in your device code that would make the code correct. Without it, the compiler might be moving loads or stores around that cause it to generate incorrect results. Such code motion is a perfectly valid optimization if barriers (__syncthreads()
) are not crossed. When you specify -G, such optimizations are probably disabled and therefore the race condition does not manifest.
First, ensure that it is the GPU code that is failing and not CPU code by disabling GPU debugging (remove -G) but enabling CPU debugging (leave -g).
Then, narrow it down to the specific kernel that is failing (process for narrowing this down is program-dependent). Inspect this kernel and look for places where you might need synchronization (shared memory dependencies are the common situation) but don't have it.
If you can't find it, but you can narrow it down to the specific kernel that is getting the wrong results, try sharing the kernel code here if you can, so others might help you see the problem.