I am a bit confused about the impact of cublasComputeType_t
on computation when using the cublasGemmEx API.
For example, my A, B, and C matrices are all of type float
.
When cublasComputeType_t=CUBLAS_COMPUTE_32F_FAST_16F
, does this mean that within the kernel, the data types of matrices A and B will first be converted to half and stored in registers, and then tensor core calculations will be performed?
Although Tensor Core supports half-float and half->half types, in this case, the kernel should select the tensor core of half ->half for calculation (for CUBLAS_COMPUTE.32F_FAST_16F
), and only convert the data type to float when writing the final result.
Is my understanding above correct?
This is only a hint that cuBLAS could do the downcast. But it is not mandatory. The best suggestion would be to try _TF32
and _16F
and pick whatever improves the performance and keeps the result precision that is good for you.
On the question of TC using half->half or half->float, accumulating into half register outputs will result in even more precision loss, and a few performance speedups. I cannot answer what cuBLAS does in this specific case, but if supported, it would be wise for it to pick half->float tensor cores.
cuBLAS uses a lot of different algorithms under the hood, so you really need to benchmark using your specific hardware, precisions, matrix sizes, leading dimensions, transpositions, etc. You can run your kernel and inspect what instructions are being executed under Nsight Compute.