cudamatrix-multiplicationieee-754cublas

Cublas gemms not respecting NaN inputs


I came across a strange behaviour with cublas.

Consider the following code:

void test(float matrixValue) {
    cublasHandle_t handle;
    cublasCreate(&handle);

    float *A, *B, *C;
    cudaMallocHost(&A, sizeof(float) * 4);
    cudaMallocHost(&B, sizeof(float) * 4);
    cudaMallocHost(&C, sizeof(float) * 4);

    for (int i = 0; i < 4; i++) {
        A[i] = matrixValue;
        B[i] = matrixValue;
        C[i] = 123.0f;
    }

    float *dA, *dB, *dC;
    cudaMalloc(&dA, sizeof(float) * 4);
    cudaMalloc(&dB, sizeof(float) * 4);
    cudaMalloc(&dC, sizeof(float) * 4);
    
    cudaMemcpy(dA, A, sizeof(float) * 4, cudaMemcpyHostToDevice);
    cudaMemcpy(dB, B, sizeof(float) * 4, cudaMemcpyHostToDevice);
    cudaMemcpy(dC, C, sizeof(float) * 4, cudaMemcpyHostToDevice);

    float alpha = 0.0f;
    float beta = 0.0f;

    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, 2, 2, 2, &alpha, dA, 2, dB, 2, &beta, dC, 2);

    cudaMemcpy(C, dC, sizeof(float) * 4, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 4; i++) {
        printf("%f ", C[i]);
    }
    printf("\n");

    cublasDestroy(handle);
    cudaFreeHost(A);
    cudaFreeHost(B);
    cudaFreeHost(C);
    cudaFree(dA);
    cudaFree(dB);
    cudaFree(dC);
}

int main() {
    test(std::numeric_limits<float>::signaling_NaN());
    test(std::numeric_limits<float>::quiet_NaN());
}

We first allocate three buffers on the CPU (A, B, C), fill A and B with NANs, fill C with a random sentinel value (123), upload them to the GPU, perform a gemm on them using cublas, and download the result in the C buffer, then print the values in C. So far so good.

If we use alpha=1.0 when performing the gemm, the C buffer will be full of NaN, which is to be expected as any calculating involving a NaN will result in a NaN.

But if we use alpha=0, the C buffer will be full of zero even though 0 * NaN should result in NaN. The dC matrix is still overwritten, this means that the gemm (at least the accumulation part) is not skipped. I could not find in the nvidia doc where it states that the multiplication part of the gemm operation can be skipped when alpha is equal to zero. And in this case, does this make CuBLAS non-IEEE754 complient when either A or B contains a NaN and alpha is set to zero ?


Solution

  • This is not a bug as far as I can tell but expected behavior.

    Note that the docs read

    For references please refer to: sgemm, dgemm, cgemm, zgemm

    Those links lead to the BLAS reference Fortran code. If you look at that, you will find lines such as

    *
    *     Quick return if possible.
    *
          IF ((M.EQ.0) .OR. (N.EQ.0) .OR.
         +    (((ALPHA.EQ.ZERO).OR. (K.EQ.0)).AND. (BETA.EQ.ONE))) RETURN
    *
    *     And if  alpha.eq.zero.
    *
          IF (ALPHA.EQ.ZERO) THEN
              IF (BETA.EQ.ZERO) THEN
                  DO 20 J = 1,N
                      DO 10 I = 1,M
                          C(I,J) = ZERO
       10             CONTINUE
       20         CONTINUE
              ELSE
                  DO 40 J = 1,N
                      DO 30 I = 1,M
                          C(I,J) = BETA*C(I,J)
       30             CONTINUE
       40         CONTINUE
              END IF
              RETURN
          END IF
    

    I'm not good at reading Fortran but that very much seems like the observed behavior. So Cuda just follows the defacto standard.

    Personally, I also find this behavior appropriate since it is the same you expect for beta. For beta you definitely need this special case because otherwise you could never reuse memory without explicitly zeroing it, otherwise some random previous data could be interpreted as NaN or Inf.