I want to build a roofline model for my kernels. So I launch the ncu with the command
ncu --csv --target-processes all --set roofline mpirun -n 1 ./run_pselinv_linux_release_v2.0 -H H3600.csc -file ./tmpfile
The roofline set
collects enough data to build the roofline model. But I can't figure out the meaning of each metrics clearly.
The Compute(SM) Throughput
is collected by the metrics sm__throughput.avg.pct_of_peak_sustained_elapsed
which is 0.64%
. And I think it is the percentage of Peak Performance. But when I divide the Performance(6855693348.37)
by the Peak Work
(5080428410372), I get 0.13%
, which is much lower than 0.64%
.
Besides, I want to collect the FLOPS
and memory usage
in my kernel, not their throughput.
So my question is:
What is the real meaning of SM Throughput
and Memory Throughput
? Are they the percentage of Peak Work
and Peak Traffic
? By the way, Peak Work
and Peak Traffic
are Peak Performance
and Peak Bandwidth of DRAM
respectively, right?
To get the real FLOPS
and memory usage
of my kernel, I want to multiply the Compute(SM) Throughput
and Peak Work
to get the real time Performance
. Then I multiply the real time Performance
and elapsed time
to get the FLOPS
. So does to memory usage. Is my method correct?
I have been searching for this question for two days but still can't get a clear answer.
I find the answer from this question: Terminology used in Nsight Compute
In short, the SM Throughput
and the Memory Throughput
is the maximum of a series of metrics respectively. So I just tried to understand their meanings by their name, which is totally wrong.
By the way, the correct way to collects FLOPS and memory usage of your model is in this lab: Roofline Model on NVIDIA GPUs The methodology this lab
Time:
sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second
FLOPs:
DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_dfma_pred_on.sum + sm__sass_thread_inst_executed_op_dmul_pred_on.sum
SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_ffma_pred_on.sum + sm__sass_thread_inst_executed_op_fmul_pred_on.sum
HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_hfma_pred_on.sum + sm__sass_thread_inst_executed_op_hmul_pred_on.sum
Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum
Bytes:
DRAM: dram__bytes.sum
L2: lts__t_bytes.sum
L1: l1tex__t_bytes.sum