Hi my understanding about mma instruction with ptx is (please tell me if I'm wrong):
I have a compiled ptx instruction set and I picked out one mma instruction:
//
// Generated by LLVM NVPTX Back-End
//
.version 8.4
.target sm_89
.address_size 64
// .globl matmul_kernel
.extern .shared .align 16 .b8 global_smem[];
.visible .entry matmul_kernel(
.param .u64 matmul_kernel_param_0,
.param .u64 matmul_kernel_param_1,
.param .u64 matmul_kernel_param_2,
.param .u32 matmul_kernel_param_3,
.param .u32 matmul_kernel_param_4,
.param .u32 matmul_kernel_param_5,
.param .u32 matmul_kernel_param_6,
.param .u32 matmul_kernel_param_7,
.param .u32 matmul_kernel_param_8
)
.maxntid 128, 1, 1
{
...
ldmatrix.sync.aligned.m8n8.x4.shared.b16 { %r3100, %r3101, %r3102, %r3103 }, [ %r561 + 0 ];
...
ldmatrix.sync.aligned.m8n8.x4.trans.shared.b16 { %r3084, %r3085, %r3086, %r3087 }, [ %r581 + 0 ];
...
mov.f32 %f2306, 0f00000000;
mov.b32 %r3107, 2;
mov.b32 %r3106, 0;
shl.b32 %r2885, %r100, 1;
shl.b32 %r2894, %r101, 1;
shl.b32 %r2895, %r102, 1;
shl.b32 %r2896, %r103, 1;
shl.b32 %r2897, %r104, 1;
shl.b32 %r2898, %r105, 1;
shl.b32 %r2899, %r106, 1;
mov.u32 %r3104, %r765;
mov.u32 %r3105, %r758;
mov.f32 %f2307, %f2306;
mov.f32 %f2308, %f2306;
mov.f32 %f2309, %f2306;
...
mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 { %f2306, %f2307, %f2308, %f2309 }, { %r3100, %r3101, %r3102, %r3103 }, { %r3084, %r3085 }, { %f2306, %f2307, %f2308, %f2309 };
}
Now my questions are:
sync
means the target warp will continue until the tensor core finish matrix mult?m16n8k64
for sparse and .m16n8k32
for dense, there is no support for m16n8k16
, or m16n8k64
means it support any shape that m<16 and n<8 and k<64?Yes, for operands which are 32-bits wide (e.g. f32 operands, such as C,D) each register holds one operand. For operands which are 16-bits wide (e.g. f16 operands such as A,B) each register holds two operands. For both C and D for this instruction, we need 16x8 = 128 operands total, and these are distributed four per thread in the warp (4x32 = 128). For A, we need 16x16 (= 256) operands, and these are distributed 8 per thread in the warp (8x32 = 256). However since these f16 operands are stored two per 32-bit register, we only need 128 registers warp-wide, which is four registers per thread, just like C and D. For B, we need 16x8 16 bit operands, and this works out to two 32-bit registers per thread in the warp.
The sync tag means that in a previous warp-divergent situation, this instruction will wait until all threads in the warp reach the point of the instruction, before dispatching the instruction to the tensorcore unit. If we imagined a non-sync variant of the instruction (it does not exist, it is just for illustration) then it would be as if we executed a __syncwarp()
followed by the non-sync variant of the instruction.
No, the indicated instructions support matrix multiplication of exactly the sizes listed; not more, not less. For example, a m16n8k16 instruction requires exactly M=16, not M<16. This is true whether we are talking about dense or sparse variants.
The terminology here dense vs. sparse is similar to the usage in linear algebra. A dense matrix tensorcore calculation means that all matrix values are assumed to be possibly non-zero, and therefore a "full calculation" will be done. A sparse matrix tensorcore calculation means that some of the input matrix values are (i.e. are assumed to be) zero. The benefit of the sparse matrix calculation is that it will assume these values are zero, and will therefore somehow optimize its use of calculation hardware to deliver higher throughput. The exact definition of which input matrix values are assumed to be zero requires specification that is covered in the PTX manual.