cudanvidiaptxcuda-wmma

Does PTX (8.4) not cover smaller-shape WMMA instructions?


I want to use a SASS instruction which (AFAICT) is not available via a PTX instruction as of CUDA 12.4. Namely, suppose it is: HMMA.16816.F16 - a warp-wide matrix-multiply-and-add, of half-precision data, with shape M=16, N=8, K=16 (IIANM).

The CUDA PTX ISA guide for CUDA 12.4 indicates in Section 9.7.13.3 that at FP16 precision, we only have PTX WMMA instructions with shape (M,N,K) being one of (16, 16, 16) or (32, 8, 16) or (8, 32, 16) - nothing smaller. But Section 9.7.13.1 says that smaller matrix shapes - (16, 8, 16), (16, 8, 8) and (8, 8, 4) - Are supported.

Trying to use the intrinsics corresponding to these smaller shapes, e.g.:

__hmma_m16n8k16_ld_a

results in an error:

mma-smaller.hpp(86): error: identifier "__hmma_m16n8k16_ld_a" is undefined
      __hmma_m16n8k16_ld_a((int*)&a, (const int*)p, ldm, 0);
      ^

So are these shapes supported in PTX, or are they not?

Note: I'm using an Ampere GPU.


Solution

  • As Robert indicates, the answer is PTX covers it, but the CUDA C++ libraries don't.

    I've been working on some C++/CUDA headers to enable this support. It's still a work-in-progress, and not extensively tested, but here they are:

    https://github.com/eyalroz/gpu-kernel-runner/tree/main/kernels/include