I'm trying to multiply blocks of size 8x8 using Tensor Cores on a GPU designed with the Turing architecture. For that I'm using the WMMA API and fragments of size 16x16. My assumption was that shared memory bandwidth would be wasted since most data loaded into the fragments don't represent useful information. While trying to quantify that I came across the following problem: Shared memory loads using wmma::load_matrix_sync
are not even reported on Nsight Compute. To test that, I'm using this kernel:
__global__
void test() {
extern __shared__ half shmem[];
wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
wmma::load_matrix_sync(a_frag, shmem, 16);
wmma::load_matrix_sync(b_frag, shmem, 16);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
wmma::store_matrix_sync((float*)shmem, c_frag, 16, wmma::mem_row_major);
}
Nsight Compute reports shared memory stores, but not loads. What is happening here? I tried several variations but it still shows 0 loads.
Answer added from information in comments:
The new LDSM instruction was not counted in the SM hardware counter used for shared memory accesses. A fix was made in Nsight Compute 2020.3.1. See the release notes here
As was hypothesised, at the time the question was posted, Nsight compute didn’t include instruction counting for the load instructions generated by wmma. This has since been rectified.