I want to use a SASS instruction which (AFAICT) is not available via a PTX instruction as of CUDA 12.4. Namely, suppose it is: HMMA.16816.F16
- a warp-wide matrix-multiply-and-add, of half-precision data, with shape M=16, N=8, K=16 (IIANM).
The CUDA PTX ISA guide for CUDA 12.4 indicates in Section 9.7.13.3 that at FP16 precision, we only have PTX WMMA instructions with shape (M,N,K) being one of (16, 16, 16) or (32, 8, 16) or (8, 32, 16) - nothing smaller. But Section 9.7.13.1 says that smaller matrix shapes - (16, 8, 16), (16, 8, 8) and (8, 8, 4) - Are supported.
Trying to use the intrinsics corresponding to these smaller shapes, e.g.:
__hmma_m16n8k16_ld_a
results in an error:
mma-smaller.hpp(86): error: identifier "__hmma_m16n8k16_ld_a" is undefined
__hmma_m16n8k16_ld_a((int*)&a, (const int*)p, ldm, 0);
^
So are these shapes supported in PTX, or are they not?
Note: I'm using an Ampere GPU.
As Robert indicates, the answer is PTX covers it, but the CUDA C++ libraries don't.
I've been working on some C++/CUDA headers to enable this support. It's still a work-in-progress, and not extensively tested, but here they are:
mma-smaller-intrinsics.hpp
fragment
template with various arguments mma-smaller.cuh
mma-smaller.hpp
https://github.com/eyalroz/gpu-kernel-runner/tree/main/kernels/include