parallel-processingcudagpumpicuda-gdb

How to compute the achieved FLOPS of a MPI program which calls cuBlas function


I am accelerating a MPI program using cuBlas function. To evaluate the application's efficiency, I want to know the FLOPS, memory usage and other stuff of GPU after the program has ran, especially FLOPS.

I have read the relevant question:How to calculate Gflops of a kernel. I think the answers give two ways to calculate the FLOPS of a program:

  1. The model count of an operation divided by the cost time of the operation
  2. Using NVIDIA's profiling tools

The first solution doesn't depend on any tools. But I'm not sure the meaning of model count. It's O(f(N))? Like the model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5 and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 = 120? So the FLOPS is 120 / 0.5 = 240?

The second solution uses nvprof, which is deprecated now and replaced by Nsight System and Nsight Compute. But those two tools only work for CUDA program, instead of MPI program launching CUDA function. So I am wondering whether there is a tool to profile the program launching CUDA function.

I have been searching for this question for two days but still can't find an acceptable solution.


Solution

  • But I'm not sure the meaning of model count. It's O(f(N))? Like the model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5 and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 = 120? So the FLOPS is 120 / 0.5 = 240?

    The standard BLAS GEMM operation is C <- alpha * (A dot B) + beta * C and for A (m by k), B (k by n) and C (m by n), each inner product of a row of A and a column of B multiplied by alpha is 2 * k + 1 flop and there are m * n inner products in A dot B and another 2 * m * n flop for adding beta * C to that dot product. So the total model FLOP count is (2 * k + 3) * (m * n) when alpha and beta are both non-zero.

    For your example, assuming alpha = 1 and beta = 0 and the implementation is smart enough to skip the extra operations (and most are) GEMM flop count is (2 * 5) * (4 * 6) = 240, and if the execution time is 0.5 seconds, the model arithmetic throughput is 240 / 0.5 = 480 flop/s.

    I would recommend using that approach if you really need to calculate performance of GEMM (or other BLAS/LAPACK operations). This is the way that most of the computer linear algebra literature and benchmarking has worked since the 1970’s and how most reported results you will find are calculated, including the HPC LINPACK benchmark.