cudagpuprofilingnvidiansight-compute

Why is the Compute Throughput’s value different from the actual Performance / Peak Performance?


I want to build a roofline model for my kernels. So I launch the ncu with the command

ncu --csv --target-processes all --set roofline mpirun -n 1 ./run_pselinv_linux_release_v2.0 -H H3600.csc -file ./tmpfile

The roofline set collects enough data to build the roofline model. But I can't figure out the meaning of each metrics clearly.

The Compute(SM) Throughput is collected by the metrics sm__throughput.avg.pct_of_peak_sustained_elapsed which is 0.64%. And I think it is the percentage of Peak Performance. But when I divide the Performance(6855693348.37) by the Peak Work(5080428410372), I get 0.13%, which is much lower than 0.64%.

Besides, I want to collect the FLOPS and memory usage in my kernel, not their throughput.

So my question is:

  1. What is the real meaning of SM Throughput and Memory Throughput? Are they the percentage of Peak Work and Peak Traffic? By the way, Peak Work and Peak Traffic are Peak Performance and Peak Bandwidth of DRAM respectively, right?

  2. To get the real FLOPS and memory usage of my kernel, I want to multiply the Compute(SM) Throughput and Peak Work to get the real time Performance. Then I multiply the real time Performance and elapsed time to get the FLOPS. So does to memory usage. Is my method correct?

I have been searching for this question for two days but still can't get a clear answer.


Solution

  • I find the answer from this question: Terminology used in Nsight Compute In short, the SM Throughput and the Memory Throughput is the maximum of a series of metrics respectively. So I just tried to understand their meanings by their name, which is totally wrong.

    By the way, the correct way to collects FLOPS and memory usage of your model is in this lab: Roofline Model on NVIDIA GPUs The methodology this lab

    Time:

    sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second

    FLOPs:

    DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_dfma_pred_on.sum + sm__sass_thread_inst_executed_op_dmul_pred_on.sum

    SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_ffma_pred_on.sum + sm__sass_thread_inst_executed_op_fmul_pred_on.sum

    HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_hfma_pred_on.sum + sm__sass_thread_inst_executed_op_hmul_pred_on.sum

    Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum

    Bytes:

    DRAM: dram__bytes.sum

    L2: lts__t_bytes.sum

    L1: l1tex__t_bytes.sum