Suppose I have an executable myapp
which needs no command-line argument, and launches a CUDA kernel mykernel
. I can invoke:
nv-nsight-cu-cli -k mykernel myapp
and get output looking like this:
==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== Disconnected from process 1234
[1234] myapp@127.0.0.1
mykernel(), 2020-Oct-25 01:23:45, Context 1, Stream 7
Section: GPU Speed Of Light
--------------------------------------------------------------------
Memory Frequency cycle/nsecond 1.62
SOL FB % 1.58
Elapsed Cycles cycle 4,421,067
SM Frequency cycle/nsecond 1.43
Memory [%] % 61.76
Duration msecond 3.07
SOL L2 % 0.79
SM Active Cycles cycle 4,390,420.69
(etc. etc.)
--------------------------------------------------------------------
(etc. etc. - other sections here)
so far - so good. But now, I just want the overall kernel duration of mykernel
- and no other output. Looking at nv-nsight-cu-cli --query-metrics
, I see, among others:
gpu__time_duration incremental duration in nanoseconds; isolated measurement is same as gpu__time_active
gpu__time_active total duration in nanoseconds
So, it must be one of these, right? But when I run
nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_duration,gpu__time_active
I get:
==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== Disconnected from process 12345
[12345] myapp@127.0.0.1
mykernel(), 2020-Oct-25 12:34:56, Context 1, Stream 7
Section: GPU Speed Of Light
Section: Command line profiler metrics
---------------------------------------------------------------
gpu__time_active (!) n/a
gpu__time_duration (!) n/a
---------------------------------------------------------------
My questions:
Notes: :
--csv
and taking the last field.tl;dr: You need to specify the appropriate 'submetric':
nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_active.avg
(Based on @RobertCrovella's comments)
CUDA's profiling mechanism collects 'base metrics', which are indeed listed with --list-metrics
. For each of these, multiple samples are taken. In version 2019.5 of NSight Compute you can't just get the raw samples; you can only get 'submetric' values.
'Submetrics' are essentially some aggregation of the sequence of samples into a scalar value. Different metrics have different kinds of submetrics (see this listing); for gpu__time_active
, these are: .min
, .max
, .sum
, .avg
. Yes, if you're wondering - they're missing second-moment metrics like the variance or the sample standard deviation.
So, you must either specify one or more submetrics (see example above), or alternatively, upgrade to a newer version of NSight Compute, with which you actually can just get all the samples apparently.