linuxcudagdbcuda-gdb

Invoke kernel failure through cuda-gdb?


Is there a way to invoke kernel failure using cuda-gdb? I've tried stepping through the kernel code and setting invalid index positions, odd values to variables, but I'm unable to trigger a "kernel Execution Failed" after continuing from an erroneous setting.

Does anyone know of a proper way to do this through cuda-gdb? I've read through the cuda-gdb documentation twice but might have missed some clues on how to achieve this if it is at all possible. If anyone knows of any tools/techniques that would be most appreciated, thanks.

I'm on CentOS 7 and my device's compute capability is 2.1. See below for the output of the uname -a command.

Linux john 3.10.0-327.10.1.el7.x86_64 #1 SMP Tue Feb 16 17:03:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Solution

  • Is there a way to invoke kernel failure using cuda-gdb?

    Yes, it's possible. Here is a fully worked example:

    $ cat t678.cu
    #include <stdio.h>
    __global__ void kernel(int *data){
    
      int idx = 0;  // line 4
      idx += data[0];
      int tval = data[idx];
      data[1] =  tval;
    }
    
    int main(){
    
      int *d_data;
      cudaMalloc(&d_data, 32*sizeof(int));
      cudaMemset(d_data, 0, 32*sizeof(int));
      kernel<<<1,1>>>(d_data);
      cudaDeviceSynchronize();
      cudaError_t err = cudaGetLastError();
      if (err != cudaSuccess) printf("kernel fail %s\n", cudaGetErrorString(err));
    }
    $ nvcc -g -G -o t678 t678.cu
    $ cuda-gdb ./t678
    NVIDIA (R) CUDA Debugger
    7.5 release
    Portions Copyright (C) 2007-2015 NVIDIA Corporation
    GNU gdb (GDB) 7.6.2
    Copyright (C) 2013 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-unknown-linux-gnu".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /home/user2/misc/t678...done.
    (cuda-gdb) break t678.cu:4
    Breakpoint 1 at 0x4026d5: file t678.cu, line 4.
    (cuda-gdb) run
    Starting program: /home/user2/misc/./t678
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib64/libthread_db.so.1".
    [New Thread 0x7ffff700a700 (LWP 8693)]
    [Switching focus to CUDA kernel 0, grid 2, block (0,0,0), thread (0,0,0), device 0, sm 14, warp 2, lane 0]
    
    Breakpoint 1, kernel<<<(1,1,1),(1,1,1)>>> (data=0x13047a0000) at t678.cu:4
    4         int idx = 0;  // line 4
    (cuda-gdb) step
    5         idx += data[0];
    (cuda-gdb) print idx
    $1 = 0
    (cuda-gdb) set idx=1000000
    (cuda-gdb) step
    6         int tval = data[idx];
    (cuda-gdb) print idx
    $2 = 1000000
    (cuda-gdb) step
    
    CUDA Exception: Device Illegal Address
    The exception was triggered in device 0.
    
    Program received signal CUDA_EXCEPTION_10, Device Illegal Address.
    kernel<<<(1,1,1),(1,1,1)>>> (data=0x13047a0000) at t678.cu:7
    7         data[1] =  tval;
    (cuda-gdb)
    

    In the above cuda-gdb output, you can see that after setting the idx variable to a large value, it results in an index-out-of-bounds (illegal address) error when executing the following line in the debugger:

      int tval = data[idx];