cudafloating-pointnvprof

Why nvprof does not have metrics on floating point division operations?


Using nvprof to measure floating point operations of my sample kernels, it seems that there is no metrics for flop_count_dp_div, and the actual double-precision division operations is measured in terms of add/mul/fma of double-precision and even some fma of single-precision operations.

I am wondering why is the case, and how to deduce the dynamic number of division operations of a kernel from nvprof report if I don't have the source code?

My simple test kernel:

#include <iostream>

__global__ void mul(double a, double* x, double* y) {
  y[threadIdx.x] = a * x[threadIdx.x];
}

__global__ void div(double a, double* x, double* y) {
  y[threadIdx.x] = a / x[threadIdx.x];
}

int main(int argc, char* argv[]) {
  const int kDataLen = 4;

  double a = 2.0f;
  double host_x[kDataLen] = {1.0f, 2.0f, 3.0f, 4.0f};
  double host_y[kDataLen];

  // Copy input data to device.
  double* device_x;
  double* device_y;
  cudaMalloc(&device_x, kDataLen * sizeof(double));
  cudaMalloc(&device_y, kDataLen * sizeof(double));
  cudaMemcpy(device_x, host_x, kDataLen * sizeof(double),
             cudaMemcpyHostToDevice);

  // Launch the kernel.
  mul<<<1, kDataLen>>>(a, device_x, device_y);
  div<<<1, kDataLen>>>(a, device_x, device_y);

  // Copy output data to host.
  cudaDeviceSynchronize();
  cudaMemcpy(host_y, device_y, kDataLen * sizeof(double),
             cudaMemcpyDeviceToHost);

  // Print the results.
  for (int i = 0; i < kDataLen; ++i) {
    std::cout << "y[" << i << "] = " << host_y[i] << "\n";
  }

  cudaDeviceReset();
  return 0;
}

And nvprof output of the two kernels:

nvprof --metrics flop_count_sp          \
       --metrics flop_count_sp_add      \
       --metrics flop_count_sp_mul      \
       --metrics flop_count_sp_fma      \
       --metrics flop_count_sp_special  \
       --metrics flop_count_dp          \
       --metrics flop_count_dp_add      \
       --metrics flop_count_dp_mul      \
       --metrics flop_count_dp_fma      \
       ./a.out
==14380== NVPROF is profiling process 14380, command: ./a.out
==14380== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "mul(double, double*, double*)" (done)
Replaying kernel "div(double, double*, double*)" (done)
y[0] = 24 internal events
y[1] = 1
y[2] = 0.666667
y[3] = 0.5
==14380== Profiling application: ./a.out
==14380== Profiling result:
==14380== Metric result:
Invocations                               Metric Name                                    Metric Description         Min         Max         Avg
Device "GeForce GTX 1080 Ti (0)"
    Kernel: mul(double, double*, double*)
          1                             flop_count_sp           Floating Point Operations(Single Precision)           0           0           0
          1                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
          1                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
          1                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           0           0           0
          1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           0           0           0
          1                             flop_count_dp           Floating Point Operations(Double Precision)           4           4           4
          1                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
          1                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           4           4           4
          1                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)           0           0           0
    Kernel: div(double, double*, double*)
          1                             flop_count_sp           Floating Point Operations(Single Precision)           8           8           8
          1                         flop_count_sp_add       Floating Point Operations(Single Precision Add)           0           0           0
          1                         flop_count_sp_mul        Floating Point Operation(Single Precision Mul)           0           0           0
          1                         flop_count_sp_fma       Floating Point Operations(Single Precision FMA)           4           4           4
          1                     flop_count_sp_special   Floating Point Operations(Single Precision Special)           4           4           4
          1                             flop_count_dp           Floating Point Operations(Double Precision)          44          44          44
          1                         flop_count_dp_add       Floating Point Operations(Double Precision Add)           0           0           0
          1                         flop_count_dp_mul       Floating Point Operations(Double Precision Mul)           4           4           4
          1                         flop_count_dp_fma       Floating Point Operations(Double Precision FMA)          20          20          20


Solution

  • it seems that there is no metrics for flop_count_dp_div, t

    Because there are no floating point division instructions in CUDA hardware.

    and the actual double-precision division operations is measured in terms of add/mul/fma of double-precision and even some fma of single-precision operations.

    Because floating point division is implemented using a Newton Raphson iterative method using multiply-add and multiply operations. Possibly even in mixed precision (thus the single precision operations)

    how to deduce the dynamic number of division operations of a kernel from nvprof report if I don't have the source code?

    You really can't.