In my cuda program, I use many tensor cores operations like m8n8k4 and even use cusparseSpMV. However, when checking the ncu report, it shows like this: There is no active tensors in my program. The roofline also seems weird. I don't know what happened to it and how to solve it. My Gpu is RTX4070 Laptop. NCU version is
NVIDIA (R) Nsight Compute Command Line Profiler Copyright (c) 2018-2023 NVIDIA Corporation Version 2023.3.0.0 (build 33266684) (public-release)
My command line is ncu -f --launch-count 50 --set full --export profile_rep ./my_program
I am really confused with this problem. Thank you for your time to help me!
I tried to use the command line to check the metrics only about tensor core.
ncu -f --launch-count 50 --metrics sm__inst_executed_pipe_tensor_op_hmma.sum,sm__inst_executed_pipe_tensor_op_dmma.sum,sm__inst_executed_pipe_tensor_op.sum,sm__inst_executed_pipe_tensor_op_hmma.avg.pct_of_peak_sustained_active --export profile_rep ./my_program
However, the metrics show either N/A or 0. I want to know why ncu cannot display the occupancy of tensor cores and how to fix that. If it's due to my mistakes, please don't hesitate to point it out and give me your suggestions.
In my kernel, I use such a tensor core operation:
__device__ __forceinline__ void mma_m8n8k4_fp16_v2(half *acc, uint32_t *A, half *frag_b)
{
uint32_t const *B = reinterpret_cast<uint32_t const *>(&frag_b[0]);
uint32_t *C = reinterpret_cast<uint32_t *>(&acc[0]);
asm volatile(
"mma.sync.aligned.m8n8k4.row.col.f16.f16.f16.f16"
" { %0, %1, %2, %3 }, "
" { %4, %5 }, "
" { %6, %7 }, "
" { %0, %1, %2, %3 };"
: "+r"(C[0]), "+r"(C[1]), "+r"(C[2]), "+r"(C[3]):
"r"(A[0]), "r"(A[1]), "r"(B[0]), "r"(B[1])
);
}
In this case, nsight compute is not lying to you (I don't think.) Based on the tensorcore (TC) op you have shown, I would not expect any actual TC instructions in the SASS code for that op when running on a cc8.9 GPU such as RTX 40 series GPUs.
That is the proximal reason for the ncu report.
To confirm this, we can use godbolt to study the generated SASS in each of the cases (cc8.9 and cc7.0). We'll modify your code slightly to remove the __forceinline__
directive and to add necessary includes. In the cc8.9 case, the generated SASS shows no evidence of a TC op. In the cc7.0 case, we see the expected TC op. (You cannot modify the behavior here by trying to run code compiled for cc7.0 on your cc8.9 device; that won't work. At a minimum, depending on compilation settings, it would be JIT-recompiled on the fly.)
I believe the "reason" for this can be found in the PTX guide:
Note
mma.sync.m8n8k4 is optimized for target architecture sm_70 and may have substantially reduced performance on other target architectures.
That doesn't say exactly that it will result in a non-TC op, but that is my conjecture. If I had to guess, I would guess that for whatever reason, the chip designers decided that this particular mma op wasn't worthy to be carried forward in the future in TC hardware. Therefore in (at least some) subsequent architectures, there is no TC path, and so an alternate non-TC path is provided for PTX-level compatibility (this is not an example of binary compatibility, but the CUDA model does not require binary compatibility from one compute capability major version to another compute capability major version.)
So I believe you are using a good method (nsight compute SOL compute throughput breakdown is a good starting point to check TC usage, IMO), and the reason that you are getting no indication of TC usage is that there is no TC usage for the example you have shown.