Why the same cufft code of the following program takes different amount of time?

I ran the following code in cufft (cuda 9) (Nvidia 1080). The code is same for all execution. However, the execution time (below the code) varies a lot. Can anyone please describe how to get the lowest time always and the reason behind this behavior?

int NX 2048
int BATCH 96

cufftHandle plan;
cufftHandle rev_plan;
cufftDoubleReal *idata;
cufftDoubleComplex *odata;

int BLOCKSIZE  = 1024;
int gridSize = (NX * BATCH)/BLOCKSIZE;

cufftPlan1d(&plan, NX, CUFFT_D2Z, BATCH);
cufftPlan1d(&rev_plan, NX, CUFFT_Z2D, BATCH);


cudaMalloc((void **) &idata, sizeof(cufftDoubleReal) * NX * BATCH);
cudaMalloc((void **) &odata, sizeof(cufftDoubleComplex) * (NX / 2 + 1) * BATCH);
inputData << < gridSize, BLOCKSIZE >> > (idata, NX * BATCH);

double sT = omp_get_wtime();
for (int i = 0; i < 500; ++i) {
    cufftExecD2Z(plan, idata, odata);
    cufftExecZ2D(plan, odata, idata);
}
printf("Time taken: %f\n", omp_get_wtime() - sT);

sT = omp_get_wtime();
for (int i = 0; i < 500; ++i) {
    cufftExecD2Z(plan, idata, odata);
    cufftExecZ2D(plan, odata, idata);
}
printf("Time taken: %f\n", omp_get_wtime() - sT);

sT = omp_get_wtime();
for (int i = 0; i < 500; ++i) {
    cufftExecD2Z(plan, idata, odata);
    cufftExecZ2D(plan, odata, idata);
}
printf("Time taken: %f\n", omp_get_wtime() - sT);

sT = omp_get_wtime();
for (int i = 0; i < 500; ++i) {
    cufftExecD2Z(plan, idata, odata);
    cufftExecZ2D(plan, odata, idata);
}
printf("Time taken: %f\n", omp_get_wtime() - sT);

cudaFree(idata);
cudaFree(odata);

Time taken: 0.004334 Time taken: 0.022906 Time taken: 0.027820 Time taken: 0.027786

Solution

Calls to cufft routines can be asynchronous

That means that the call may return before the work is done.

This can only occur up to a certain limit. There is an asynchronous launch queue. Once you fill the queue, new slots in the queue only open up when a queue item is dispatched. This means the launch process is no longer asynchronous.

This is skewing your timing results.

To "fix" this, add a cudaDeviceSynchronize(); call before the end of each timing region (i.e. immediately before each printf statement). This will even out the results considerably. This forces all GPU work to complete before you finish the timing measurement.

$ cat t37.cu
#include <cufft.h>
#include <omp.h>
#include <cuda_runtime_api.h>
#include <cstdio>

int main(){

  const int NX = 2048;
  const int BATCH = 96;

  cufftHandle plan;
  cufftHandle rev_plan;
  cufftDoubleReal *idata;
  cufftDoubleComplex *odata;

  //int BLOCKSIZE  = 1024;
  //int gridSize = (NX * BATCH)/BLOCKSIZE;

  cufftPlan1d(&plan, NX, CUFFT_D2Z, BATCH);
  cufftPlan1d(&rev_plan, NX, CUFFT_Z2D, BATCH);


  cudaMalloc((void **) &idata, sizeof(cufftDoubleReal) * NX * BATCH);
  cudaMalloc((void **) &odata, sizeof(cufftDoubleComplex) * (NX / 2 + 1) * BATCH);
  //inputData << < gridSize, BLOCKSIZE >> > (idata, NX * BATCH);

  double sT = omp_get_wtime();
  for (int i = 0; i < 500; ++i) {
            cufftExecD2Z(plan, idata, odata);
            cufftExecZ2D(plan, odata, idata);
  }
  #ifdef FIX
  cudaDeviceSynchronize();
  #endif
  printf("Time taken: %f\n", omp_get_wtime() - sT);

  sT = omp_get_wtime();
  for (int i = 0; i < 500; ++i) {
            cufftExecD2Z(plan, idata, odata);
            cufftExecZ2D(plan, odata, idata);
  }
  #ifdef FIX
  cudaDeviceSynchronize();
  #endif
  printf("Time taken: %f\n", omp_get_wtime() - sT);

  sT = omp_get_wtime();
  for (int i = 0; i < 500; ++i) {
            cufftExecD2Z(plan, idata, odata);
            cufftExecZ2D(plan, odata, idata);
  }
  #ifdef FIX
  cudaDeviceSynchronize();
  #endif
  printf("Time taken: %f\n", omp_get_wtime() - sT);

  sT = omp_get_wtime();
  for (int i = 0; i < 500; ++i) {
            cufftExecD2Z(plan, idata, odata);
            cufftExecZ2D(plan, odata, idata);
  }
  #ifdef FIX
  cudaDeviceSynchronize();
  #endif
  printf("Time taken: %f\n", omp_get_wtime() - sT);

  cudaFree(idata);
  cudaFree(odata);
}
$ nvcc -o t37 t37.cu -lcufft -lgomp
$ ./t37
Time taken: 0.007373
Time taken: 0.185308
Time taken: 0.196998
Time taken: 0.196857
$ nvcc -o t37 t37.cu -lcufft -lgomp -DFIX
$ ./t37
Time taken: 0.197076
Time taken: 0.196994
Time taken: 0.196937
Time taken: 0.196916
$

One might ask, "why is the total time without the cudaDeviceSynchronize() call apparently lower than the total time with it?" This is essentially due to the same reason. The asynchronous launch queue is full of pending work, but the program terminates (without a final cudaDeviceSynchronize()) before all the work in the queue is launched. This gives rise to the apparent discrepancy between the sum total execution times, in each case. By adding only the last cudaDeviceSynchronize() call, this effect can be observed.