cudansightnsight-systemscuda-profiling

How to get average execution time of CUDA kernel using NSight Systems or NSight Compute


Suppose I have a simple CLI test app named "Foo". This app executes a kernel "Bar" 100 times in a loop. How may I obtain an average kernel execution time for Bar, using Nsight Systems or Nsight Compute, either the GUI or CLI versions of these apps.

The Nvidia Visual Profiler app provides this information in the Properties dialog, for each kernel, as "Duration (kernel)" and Invocations.

I would like to obtain the same information with Systems or Compute. Because Visual Profiler is to be deprecated.

Following the example in this post

nv-nsight-cu-cli -k Bar Foo

I get a 100x printouts, one for each kernel execution. I want just summary information for kernel Bar.


Solution

  • You can achieve this with the Nsight Compute CLI using option --print-summary per-gpu: it provides a minimum, maximum and average execution time. Example below:

    $ ncu -k matrixMul --print-summary per-gpu ./test | grep -C8 Duration
          ----------------------- ------------- ---------- ---------- ----------
          Metric Name               Metric Unit    Minimum    Maximum    Average
          ----------------------- ------------- ---------- ---------- ----------
          DRAM Frequency          cycle/nsecond       6.72       6.90       6.79
          SM Frequency            cycle/nsecond       1.48       1.51       1.49
          Elapsed Cycles                  cycle 166,647.00 168,469.00 167,522.43
          Memory Throughput                   %      73.43      74.10      73.76
          DRAM Throughput                     %       2.50       2.57       2.53
          Duration                      usecond     111.20     112.90     112.18
          L1/TEX Cache Throughput             %      84.50      85.35      84.99
          L2 Cache Throughput                 %      10.40      10.64      10.54
          SM Active Cycles                cycle 144,432.91 145,882.70 145,043.22
          Compute (SM) Throughput             %      73.43      74.10      73.76
          ----------------------- ------------- ---------- ---------- ----------
    
          Section: Launch Statistics
          -------------------------------- --------------- ---------- ---------- ----------