Tracing custom CUDA kernels with Nsight Systems

I work on library which is implemented in C++20 and CUDA 11. This library is called from Python via ctypes through a C API that just exchanges JSON strings. We compile it using Clang 11.

In order to profile the code I have added a lot of NVTX ranges to the C++ code. This works well for me with Nsight Systems, I can see the stack of ranges with their manually chosen names when use nsys profile -t nvtx … to gather data. This doesn't tell me anything about the GPU though. So I specify nvtx,cuda,cublas,cudnn in order to get more information.

But all I get is one of the many kernels. The output looks like this:

One can see the nice NVTX contexts, one can see the calls to the CUDA API (memcpy and the like). But there is only one kernel showing up, I have marked it with a red arrow.

We have a bunch of different kernels and launch them with the <<<>>> syntax right from the .cu files.

It feels like I am missing either a tracing flag for nsys, some compilation option for the CUDA code or some code annotations like NVTX for the kernel code. What do I have to do such that my custom kernels show up in the profile?

Solution

The issue could have been that I have not properly stopped the data gathering and our program is an interactive server which one stops with a SIGINT. Perhaps the data was not properly stored after the interrupt.

I have added calls to the profiler API in the code such that I explicitly call cudaProfilerStop() after our main loop is done. I've done it with a small RAII wrapper such that it works even with SIGINT.

#include <cuda_profiler_api.h>

class ProfilingRange {
 public:
  ProfilingRange() {
    cudaProfilerStart();
  }

  ~ProfilingRange() {
    cudaProfilerStop();
  }
};

On the nsys profile command line I specify --capture-range=cudaProfilerApi and it seems to work fine. Now a lot of kernels show up, and I can learn a lot more about the system.