Suppose I have a simple CLI test app named "Foo". This app executes a kernel "Bar" 100 times in a loop. How may I obtain an average kernel execution time for Bar, using Nsight Systems or Nsight Compute, either the GUI or CLI versions of these apps.
The Nvidia Visual Profiler app provides this information in the Properties dialog, for each kernel, as "Duration (kernel)" and Invocations.
I would like to obtain the same information with Systems or Compute. Because Visual Profiler is to be deprecated.
Following the example in this post
nv-nsight-cu-cli -k Bar Foo
I get a 100x printouts, one for each kernel execution. I want just summary information for kernel Bar.
You can achieve this with the Nsight Compute CLI using option --print-summary per-gpu
: it provides a minimum, maximum and average execution time. Example below:
$ ncu -k matrixMul --print-summary per-gpu ./test | grep -C8 Duration
----------------------- ------------- ---------- ---------- ----------
Metric Name Metric Unit Minimum Maximum Average
----------------------- ------------- ---------- ---------- ----------
DRAM Frequency cycle/nsecond 6.72 6.90 6.79
SM Frequency cycle/nsecond 1.48 1.51 1.49
Elapsed Cycles cycle 166,647.00 168,469.00 167,522.43
Memory Throughput % 73.43 74.10 73.76
DRAM Throughput % 2.50 2.57 2.53
Duration usecond 111.20 112.90 112.18
L1/TEX Cache Throughput % 84.50 85.35 84.99
L2 Cache Throughput % 10.40 10.64 10.54
SM Active Cycles cycle 144,432.91 145,882.70 145,043.22
Compute (SM) Throughput % 73.43 74.10 73.76
----------------------- ------------- ---------- ---------- ----------
Section: Launch Statistics
-------------------------------- --------------- ---------- ---------- ----------