floating-pointcudaprecisionnvidiatesla

Is there a benefit (in terms of computation time) of using torch.float32 instead of torch.float64 on Nvidia Tesla K20c?


Somewhere I read that "unless you have a Tesla card, float64 is 32 times slower than float32 on GeForce, Quadro and Titan cards on any recent cards ( Maxwell and Pascal so since 2014)."

So I am wondering would the computation be faster for float32 than float64 on tesla gpu or the performance remains same. I am specially interested in time taken in multiplication of two vectors.

Ofcourse, float32 would take less memory than float 64. But for my application memory is no issue.


Solution

  • So I am wondering would the computation be faster for float32 than float64 on Tesla GPU or the performance remains same.

    32 bit floating point has higher theoretical maximum throughput on all NVIDIA GPUs. The K20c is a compute capability 3.5 GPU, you can see here that the maximum FMAD instruction throughput per SM per clock is three times higher for float32 compared to float64. Other instructions may have even wider performance differences.

    I am specially [sic] interested in time taken in multiplication of two vectors.

    That would be implementation specific and probably depends on how Pytorch works internally. That isn't directly related to CUDA.

    Ofcourse, float32 would take less memory than float 64. But for my application memory is no issue.

    But memory bandwidth might be, and peak memory throughput for float64 is half that of float32. A 64 bit type also potentially introduces a two-way shared memory band conflict where 32 bit types have none.