I am developing a Multi-GPU accelerated Flow solver. Currently I am trying to implement communication hiding. That means, while data is exchanged the GPU computes the part of the mesh, that is not involved in communication and computes the rest of the mesh, once communication is done.
I am trying to solve this by having one stream (computeStream
) for the long run time kernel (fluxKernel
) and one (communicationStream
) for the different phases of communication. The computeStream
has a very low priority, in order to allow kernels on the communicationStream
to interleave the fluxKernel
, even though it uses all resources.
These are the streams I am using:
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low , &priority_high ) ;
cudaStreamCreateWithPriority (&communicationStream, cudaStreamNonBlocking, priority_high );
cudaStreamCreateWithPriority (&computeStream , cudaStreamNonBlocking, priority_low );
The desired cocurrency pattern looks like this:
I need synchronization of the communicationStream
before I send the data via MPI, to ensure that the data is completely downloaded, before I send it on.
In the following listing I show the structure of what I am currently doing. First I start the long run time fluxKernel
for the main part of the mesh on the computeStream
. Then I start a sendKernel
that collects the data that should be send to the second GPU and subsequently download it to the host (I cannot use cuda-aware MPI due to hardware limitations). The data is then send non-blocking per MPI_Isend
and blocking receive (MPI_recv
) is used subsequently. When the data is received the procedure is done backwards. First the data is uploaded to the device and then spread to the main data structure by recvKernel
. Finally the fluxKernel
is called for the remaining part of the mesh on the communicationStream
.
Note, that before and after the shown code kernels are run on the default stream.
{ ... } // Preparations
// Start main part of computatation on first stream
fluxKernel<<< ..., ..., 0, computeStream >>>( /* main Part */ );
// Prepare send data
sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
// MPI Communication
MPI_Isend( ... );
MPI_Recv ( ... );
// Use received data
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );
fluxKernel<<< ..., ..., 0, communicationStream >>>( /* remaining Part */ );
{ ... } // Rest of the Computations
I used nvprof and Visual Profiler to see, whether the stream actually execute concurrently. This is the result:
I observe that the sendKernel
(purple), upload, MPI communication and download are concurrent to the fluxKernel
. The recvKernel
(red) only starts ofter the other stream is finished, though. Turning of the synchronization does not solve the problem:
For my real application I have not only one communication, but multiple. I tested this with two communications as well. The procedure is:
sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
MPI_Isend( ... );
sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
MPI_Isend( ... );
MPI_Recv ( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );
MPI_Recv ( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );
The result is similar to the one with one communication (above), in the sense that the second kernel invocation (this time it is a sendKernel
) is delayed till the kernel on the computeStream
is finished.
Hence the overall observation is, that the second kernel invocation is delayed, independent of which kernel this is.
Can you explain, why the GPU is synchronizing in this way, or how I can get the second Kernel on communicationStream
to also run concurrently to the computeStream?
Thank you very much.
Edit 1: complete rework of the question
I built a minimal reproducible Example. In the end the code plots the int
data to the terminal. The correct last value would be 32778 (=(32*1024-1) + 1 + 10). At the beginning I added an option integer to test 3 different options:
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
const int option = 0;
const int numberOfEntities = 2 * 1024 * 1024;
const int smallNumberOfEntities = 32 * 1024;
__global__ void longKernel(float* dataDeviceIn, float* dataDeviceOut, int numberOfEntities)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if(index >= numberOfEntities) return;
float tmp = dataDeviceIn[index];
#pragma unroll
for( int i = 0; i < 2000; i++ ) tmp += 1.0;
dataDeviceOut[index] = tmp;
}
__global__ void smallKernel_1( int* smallDeviceData, int numberOfEntities )
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if(index >= numberOfEntities) return;
smallDeviceData[index] = index;
}
__global__ void smallKernel_2( int* smallDeviceData, int numberOfEntities )
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
if(index >= numberOfEntities) return;
int value = smallDeviceData[index];
value += 10;
smallDeviceData[index] = value;
}
int main(int argc, char **argv)
{
cudaSetDevice(0);
float* dataDeviceIn;
float* dataDeviceOut;
cudaMalloc( &dataDeviceIn , sizeof(float) * numberOfEntities );
cudaMalloc( &dataDeviceOut, sizeof(float) * numberOfEntities );
int* smallDataDevice;
int* smallDataHost;
cudaMalloc ( &smallDataDevice, sizeof(int) * smallNumberOfEntities );
cudaMallocHost( &smallDataHost , sizeof(int) * smallNumberOfEntities );
cudaStream_t streamLong;
cudaStream_t streamSmall;
cudaStream_t streamCopy;
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low , &priority_high ) ;
cudaStreamCreateWithPriority (&streamLong , cudaStreamNonBlocking, priority_low );
cudaStreamCreateWithPriority (&streamSmall, cudaStreamNonBlocking, priority_high );
cudaStreamCreateWithPriority (&streamCopy , cudaStreamNonBlocking, priority_high );
//////////////////////////////////////////////////////////////////////////
longKernel <<< numberOfEntities / 32, 32, 0, streamLong >>> (dataDeviceIn, dataDeviceOut, numberOfEntities);
//////////////////////////////////////////////////////////////////////////
smallKernel_1 <<< smallNumberOfEntities / 32, 32, 0 , streamSmall >>> (smallDataDevice, smallNumberOfEntities);
if( option <= 1 ) cudaMemcpyAsync( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost, streamSmall );
if( option == 2 ) cudaMemcpyAsync( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost, streamCopy );
if( option == 0 ) cudaStreamSynchronize( streamSmall );
// some CPU modification of data
for( int i = 0; i < smallNumberOfEntities; i++ ) smallDataHost[i] += 1;
if( option <= 1 ) cudaMemcpyAsync( smallDataDevice, smallDataHost, sizeof(int) * smallNumberOfEntities, cudaMemcpyHostToDevice, streamSmall );
if( option == 2 ) cudaMemcpyAsync( smallDataDevice, smallDataHost, sizeof(int) * smallNumberOfEntities, cudaMemcpyHostToDevice, streamCopy );
smallKernel_2 <<< smallNumberOfEntities / 32, 32, 0 , streamSmall >>> (smallDataDevice, smallNumberOfEntities);
//////////////////////////////////////////////////////////////////////////
cudaDeviceSynchronize();
cudaMemcpy( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost );
for( int i = 0; i < smallNumberOfEntities; i++ ) std::cout << smallDataHost[i] << "\n";
return 0;
}
With code I see the same behavior as described above:
Option 1 (wrong reslut, +1 from CPU missing):
Option 2 (completely wrong result, all 10, dowload before smallKernel_1
)
Running Option 0 under Linux (on the suggestion in Roberts answere), brings the expected behavior!
Here's how I would try to accomplish this.
cudaDeviceSynchronize()
once per loop/frame, to break (separate) the processing of one frame from the next. Arrangement of activities within a frame is handled entirely with CUDA stream semantics, to enforce serialization for activities which depend on each other, but to allow concurrency for activities that don't.Here's a sample code that implements these ideas:
#include <iostream>
#include <unistd.h>
#include <cstdio>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
typedef double mt;
const int nTPB = 512;
const size_t ds = 100ULL*1048576;
const size_t bs = 1048576ULL;
const int my_intensity = 1;
const int loops = 4;
const size_t host_func_delay_us = 100;
const int max_blocks = 320; // chosen based on GPU, could use runtime calls to set this via cudaGetDeviceProperties
template <typename T>
__global__ void fluxKernel(T * __restrict__ d, const size_t n, const int intensity){
size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x;
if (idx < n){
T temp = d[idx];
for (int i = 0; i < intensity; i++)
temp = sin(temp); // just some dummy code to simulate "real work"
d[idx] = temp;
}
}
template <typename T>
__global__ void sendKernel(const T * __restrict__ d, const size_t n, T * __restrict__ b){
for (size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x; idx < n; idx += ((size_t)blockDim.x)*gridDim.x)
b[idx] = d[idx];
}
template <typename T>
__global__ void recvKernel(const T * __restrict__ b, const size_t n, T * __restrict__ d){
for (size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x; idx < n; idx += ((size_t)blockDim.x)*gridDim.x)
d[idx] = b[idx];
}
void CUDART_CB MyCallback(cudaStream_t stream, cudaError_t status, void *data){
printf("Loop %lu callback\n", (size_t)data);
usleep(host_func_delay_us); // simulate: this is where non-cuda-aware MPI calls would go, operating on h_buf
}
int main(){
// get the range of stream priorities for this device
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
// create streams with highest and lowest available priorities
cudaStream_t st_high, st_low;
cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);
// allocations
mt *h_buf, *d_buf, *d_data;
cudaMalloc(&d_data, ds*sizeof(d_data[0]));
cudaMalloc(&d_buf, bs*sizeof(d_buf[0]));
cudaHostAlloc(&h_buf, bs*sizeof(h_buf[0]), cudaHostAllocDefault);
cudaCheckErrors("setup error");
// main processing loop
for (unsigned long i = 0; i < loops; i++){
// issue low-priority
fluxKernel<<<((ds-bs)+nTPB)/nTPB, nTPB,0,st_low>>>(d_data+bs, ds-bs, my_intensity);
// issue high-priority
sendKernel<<<max_blocks,nTPB,0,st_high>>>(d_data, bs, d_buf);
cudaMemcpyAsync(h_buf, d_buf, bs*sizeof(h_buf[0]), cudaMemcpyDeviceToHost, st_high);
cudaStreamAddCallback(st_high, MyCallback, (void*)i, 0);
cudaMemcpyAsync(d_buf, h_buf, bs*sizeof(h_buf[0]), cudaMemcpyHostToDevice, st_high);
recvKernel<<<max_blocks,nTPB,0,st_high>>>(d_buf, bs, d_data);
fluxKernel<<<((bs)+nTPB)/nTPB, nTPB,0,st_high>>>(d_data, bs, my_intensity);
cudaDeviceSynchronize();
cudaCheckErrors("loop error");
}
return 0;
}
Here is the visual profiler timeline output (on linux, Tesla V100):
Note that arranging complex concurrency scenarios can be quite challenging on Windows WDDM. I would recommend avoiding that, and this answer does not intend to discuss all the challenges there. I suggest using linux or Windows TCC GPUs to do this.
If you try this code on your machine, you may need to adjust some of the various constants to get things to look like this.