cudacuda-gdb

cuda-gdb exits with "[1] stopped" when it hits a kernel call


I'm pretty new to CUDA and flying a bit by the seat of my pants here...

I'm trying to debug my CUDA program on a remote machine I don't have admin rights on. I compile my program with nvcc -g -G and then try to debug it with cuda-gdb. However, as soon as gdb hits a call to a kernel (doesn't even have to enter it, and it doesn't happen in host code), I get:

(cuda-gdb) run
Starting program: /path/to/my/binary/cuda_clustered_tree 
[Thread debugging using libthread_db enabled]

[1]+  Stopped                 cuda-gdb cuda_clustered_tree

cuda-gdb then dumps me back to my terminal. If I try to run cuda-gdb again, I get

An instance of cuda-gdb (pid 4065) is already using device 0. If you believe
you are seeing this message in error, try deleting /tmp/cuda-dbg/cuda-gdb.lock.

The only way to recover is to kill -9 cuda-gdb and cuda_clustered_ (I assume the latter is part of my binary).

This machine has two GPUs, is running CUDA 4.1 (I believe -- there were a lot installed, but that's the one I set the PATH and LD_LIBRARY_PATH to) and compile + runs deviceQuery and bandwidthTest fine.

I can provide more info if need be. I've searched everywhere I could find online and found no help with this.


Solution

  • Figured it out! Turns out, cuda-gdb hates csh.

    If you are running csh, it will cause cuda-gdb to exhibit the above anomalous behavior. Even running bash from within csh, then running cuda-gdb, I still saw the behavior. You need to start your shell as bash, and only bash.

    On the machine, the default shell was csh, but I use bash. I wasn't allowed to change it directly, so I added 'exec /bin/bash --login' to my .login script.

    So even though I was running bash, because it was started by csh, cuda-gdb would exhibit the above anomalous behavior. Getting rid of 'exec' command, so I was running csh directly with nothing on top, still showed the behavior.

    In the end, I had to get IT to change my shell to bash directly (after much patient troubleshooting by them.) Now it works as intended.