I came across a strange behaviour with cublas.
Consider the following code:
void test(float matrixValue) {
cublasHandle_t handle;
cublasCreate(&handle);
float *A, *B, *C;
cudaMallocHost(&A, sizeof(float) * 4);
cudaMallocHost(&B, sizeof(float) * 4);
cudaMallocHost(&C, sizeof(float) * 4);
for (int i = 0; i < 4; i++) {
A[i] = matrixValue;
B[i] = matrixValue;
C[i] = 123.0f;
}
float *dA, *dB, *dC;
cudaMalloc(&dA, sizeof(float) * 4);
cudaMalloc(&dB, sizeof(float) * 4);
cudaMalloc(&dC, sizeof(float) * 4);
cudaMemcpy(dA, A, sizeof(float) * 4, cudaMemcpyHostToDevice);
cudaMemcpy(dB, B, sizeof(float) * 4, cudaMemcpyHostToDevice);
cudaMemcpy(dC, C, sizeof(float) * 4, cudaMemcpyHostToDevice);
float alpha = 0.0f;
float beta = 0.0f;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 2, 2, 2, &alpha, dA, 2, dB, 2, &beta, dC, 2);
cudaMemcpy(C, dC, sizeof(float) * 4, cudaMemcpyDeviceToHost);
for (int i = 0; i < 4; i++) {
printf("%f ", C[i]);
}
printf("\n");
cublasDestroy(handle);
cudaFreeHost(A);
cudaFreeHost(B);
cudaFreeHost(C);
cudaFree(dA);
cudaFree(dB);
cudaFree(dC);
}
int main() {
test(std::numeric_limits<float>::signaling_NaN());
test(std::numeric_limits<float>::quiet_NaN());
}
We first allocate three buffers on the CPU (A, B, C), fill A and B with NANs, fill C with a random sentinel value (123), upload them to the GPU, perform a gemm on them using cublas, and download the result in the C buffer, then print the values in C. So far so good.
If we use alpha=1.0 when performing the gemm, the C buffer will be full of NaN, which is to be expected as any calculating involving a NaN will result in a NaN.
But if we use alpha=0, the C buffer will be full of zero even though 0 * NaN
should result in NaN
. The dC matrix is still overwritten, this means that the gemm (at least the accumulation part) is not skipped. I could not find in the nvidia doc where it states that the multiplication part of the gemm operation can be skipped when alpha is equal to zero. And in this case, does this make CuBLAS non-IEEE754 complient when either A or B contains a NaN and alpha is set to zero ?
This is not a bug as far as I can tell but expected behavior.
Note that the docs read
Those links lead to the BLAS reference Fortran code. If you look at that, you will find lines such as
*
* Quick return if possible.
*
IF ((M.EQ.0) .OR. (N.EQ.0) .OR.
+ (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN
*
* And if alpha.eq.zero.
*
IF (ALPHA.EQ.ZERO) THEN
IF (BETA.EQ.ZERO) THEN
DO 20 J = 1,N
DO 10 I = 1,M
C(I,J) = ZERO
10 CONTINUE
20 CONTINUE
ELSE
DO 40 J = 1,N
DO 30 I = 1,M
C(I,J) = BETA*C(I,J)
30 CONTINUE
40 CONTINUE
END IF
RETURN
END IF
I'm not good at reading Fortran but that very much seems like the observed behavior. So Cuda just follows the defacto standard.
Personally, I also find this behavior appropriate since it is the same you expect for beta
. For beta you definitely need this special case because otherwise you could never reuse memory without explicitly zeroing it, otherwise some random previous data could be interpreted as NaN or Inf.