I have this very minimal code to multiply two matrices with Cuda Tensor Cores
constexpr int M = 16;
constexpr int N = 16;
constexpr int K = 16;
/*
* Matrix A = M x N, B = N x K, C = M x K => OUT = M x K
*/
__global__ void wmma_matrix_mult(half *a, half *b, float *out) {
// Declare the fragments
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::col_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
wmma::fragment<wmma::accumulator, M, N, K, float> c_frag;
// Initialize the output to zero
wmma::fill_fragment(c_frag, 0.0f);
// Load the inputs
wmma::load_matrix_sync(a_frag, a, N);
wmma::load_matrix_sync(b_frag, b, N);
// Perform the matrix multiplication
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
// Store the output
wmma::store_matrix_sync(out, c_frag, N, wmma::mem_row_major);
}
As soon as M, N and K are something other than 16, the compiler crashes with
error: incomplete type is not allowed
error: no instance of function template "nvcuda::wmma::fill_fragment" matches the argument list
argument types are: (<error-type>, float)
Does this mean that A and B always have to be the 16x16 in size? I thought 4x4 or 8x8 would be allowed as well?
I compile like this:
nvcc -arch=sm_75 -c ./src/main.cu -o ./src/build/main.o
so architecture should be fine.
I thought 4x4 or 8x8 would be allowed as well?
Unfortunately not. Let's read some documentation.
For half precision inputs with a single precision accumulator, as in your use case, only the following sizes are supported:
Matrix A Matrix B Accumulator Matrix Size (m-n-k)
__half __half float 16x16x16
__half __half float 32x8x16
__half __half float 8x32x16