I am trying to use Nsight Compute to profile kernels in my CUDA code. But how do I profile functions inside a kernel? Say for example, I have 2 functions (device functions) in a kernel (global). Nsight compute only profiles the kernel but there is no mention of the functions called inside the kernels.
nsight compute doesn't provide the ability to profile a __device__
function directly or individually, nor will it present results organized that way.
As indicated in the comments, for a __device__
function defined in the same compilation unit as the kernel you are profiling, that function may not even exist as a separate or identifiable entity; it will usually get inlined by the compiler, and then subject to further optimization with surrounding code.
You can however associate some profiler information to specific lines of code in the nsight compute UI "Source" page. This blog points that out with an example.