cudautilization

Simulation loop GPU utilization


I am struggling with utilizing a simulation loop. There are 3 kernel launched in every cycle. The next time step size is computed by the second kernel.

while (time < end)
{
  kernel_Flux<<<>>>(...);
  kernel_Timestep<<<>>>(d_timestep);
  memcpy(&h_timestep, d_timestep, sizeof(float), ...);  
  kernel_Integrate<<<>>>(d_timestep);
  time += h_timestep;
}

I only need copy back a single float. What would be the most efficient way to avoid unnecessary synchronizations?

Thank you in advance. :-)


Solution

  • The ideal solution would be to move everything to the GPU. However, I cannot do so, because I need to launch CUDPP compact after every few iterations, and it does not support CUDA streams nor dynamics parallelism. I know that the Thrust 1.8 library has copy_if method, which does the same, and it is working with dynamic parallelism. The problem is it does not compile with separate compilation on.

    To sum up, now I use the following code:

    while (time < end)
    {
      kernel_Flux<<<gs,bs, 0, stream1>>>();
      kernel_Timestep<<<gs,bs, 0, stream1>>>(d_timestep);
      cudaEventRecord(event, stream1);
      cudaStreamWaitEvent(mStream2, event, 0);
      memcpyasync(&h_timestep, d_timestep, sizeof(float), ..., stream2);  
      kernel_Integrate<<<>>>(d_timestep);
      cudaStreamSynchronize(stream2);
      time += h_timestep;
    }