cudadynamic-parallelism

Can a CUDA parent kernel launch a child kernel with more threads than the parent?


I'm trying to learn how to use CUDA Dynamic Parallelism.

I have a simple CUDA kernel that creates some work, then launches new kernels to perform that work. Let's say I launch the parent kernel with only 1 block of 1 thread, like so:

int nItems = 100
parentKernel<<<1,1>>>(nItems);

Now, inside my parent kernel, I create the work, then launch a sub-kernel, like this:

__global__ void parentKernel(int nItems)
{
    // create some work
    
    // invoke child kernel
    childKernel<<<2, 256>>>();
}

Notice that the child kernel was launched with more threads and blocks (2x256) than the parent kernel was given (1x1).

Will the child kernel actually run 512 threads in parallel? Or does the parent kernel have to divvy out its threads to its children?


Solution

  • The child kernel launch is just as if you had launched a new kernel from host code. There is no restriction based on the parent kernel that governs the grid size of a child kernel.

    Your child kernel will run with 2 blocks of 256 threads each, as if you had launched it from host code. Those would all run "in parallel" on any GPU I can think of.