Given this code:
void foo(cv::gpu::GpuMat const &src, cv::gpu::GpuMat *dst[], cv::Size const dst_size[], size_t numImages)
{
cudaStream_t streams[numImages];
for (size_t image = 0; image < numImages; ++image)
{
cudaStreamCreateWithFlags(&streams[image], cudaStreamNonBlocking);
dim3 Threads(32, 16);
dim3 Blocks((dst_size[image].width + Threads.x - 1)/Threads.x,
(dst_size[image].height + Threads.y - 1)/Threads.y);
myKernel<<<Blocks, Threads, 0, streams[image]>>>(src, dst[image], dst_size[image]);
}
for (size_t image = 0; image < numImages; ++image)
{
cudaStreamSynchronize(streams[image]);
cudaStreamDestroy(streams[image]);
}
}
Looking at the output of nvvp
, I see almost perfectly serial execution, even though the first stream is a lengthy process that the others should be able to overlap with.
Note that my kernel uses 30 registers, and all report an "Achieved Occupancy" of around 0.87. For the smallest image, Grid Size is [10,15,1] and Block Size [32, 16,1].
The conditions describing the limits for concurrent kernel execution are given in the CUDA programming guide (link), but the gist of it is that your GPU can potentially run multiple kernels from different streams only if your GPU has sufficient resources to do so.
In your usage case, you have said that you are running multiple launches of a kernel with 150 blocks of 512 threads each. Your GPU has 12 SMM (I think), and you could have up to 4 blocks per SMM running concurrently (4 * 512 = 2048 threads, which is the SMM limit). So your GPU can only run a maximum of 4 * 12 = 48 blocks concurrently. When multiple launches of 150 blocks sitting in the command pipeline, it would seem that there is little (perhaps even no) opportunity for concurrent kernel execution.
You might be able to encourage kernel execution overlap if you increase the scheduling granularity of you kernel by reducing the block size. Smaller blocks are more likely to find available resources and scheduling slots than larger blocks. Similarly, reducing the total block count per kernel launch (probably by increasing the parallel work per thread) might also help increase the potential for overlap or concurrent execution of multiple kernels.