I work on library which is implemented in C++20 and CUDA 11. This library is called from Python via ctypes
through a C API that just exchanges JSON strings. We compile it using Clang 11.
In order to profile the code I have added a lot of NVTX ranges to the C++ code. This works well for me with Nsight Systems, I can see the stack of ranges with their manually chosen names when use nsys profile -t nvtx …
to gather data. This doesn't tell me anything about the GPU though. So I specify nvtx,cuda,cublas,cudnn
in order to get more information.
But all I get is one of the many kernels. The output looks like this:
One can see the nice NVTX contexts, one can see the calls to the CUDA API (memcpy and the like). But there is only one kernel showing up, I have marked it with a red arrow.
We have a bunch of different kernels and launch them with the <<<>>>
syntax right from the .cu
files.
It feels like I am missing either a tracing flag for nsys
, some compilation option for the CUDA code or some code annotations like NVTX for the kernel code. What do I have to do such that my custom kernels show up in the profile?
The issue could have been that I have not properly stopped the data gathering and our program is an interactive server which one stops with a SIGINT. Perhaps the data was not properly stored after the interrupt.
I have added calls to the profiler API in the code such that I explicitly call cudaProfilerStop()
after our main loop is done. I've done it with a small RAII wrapper such that it works even with SIGINT.
#include <cuda_profiler_api.h>
class ProfilingRange {
public:
ProfilingRange() {
cudaProfilerStart();
}
~ProfilingRange() {
cudaProfilerStop();
}
};
On the nsys profile
command line I specify --capture-range=cudaProfilerApi
and it seems to work fine. Now a lot of kernels show up, and I can learn a lot more about the system.