cudanvidiadynamic-parallelism

What is the actual maximum nesting depth of dynamic parallelism in CUDA?


Without getting into too much detail, the project I'm working on needs three different phases, each corresponding to a different kernel. I only know the number of threads needed in the second phase when the previous phase has ended. Also, the second phase may have to be executed several times, each time potentially with a different number of threads, and I only know if it needs to be called again after its previous execution has ended.

Right now the host is responsible for launching all the phases, but I'm moving the launch of the phases in the device, to hopefully remove some overhead.

The idea is that the first thread of the last active warp launches the grid executing the next phase. Since many repetitions of the second phase may be needed, I need to ensure that the maximum depth for nesting in dynamic parallelism is not exceeded. If this happens, the control returns to the host, which will launch a new grid.

This was just the introduction to my problem, here comes my question. What is the value of the maximum depth for nesting in dynamic parallelism? I read here the following:

There is also a hardware limit on maximum nesting depth, and thus synchronization depth; as of Compute Capability 3.5, the hardware limit on depth is 24 levels.

It's not clear if the first grid, the one launched by the host, is included or not in this count. In other words, it is not clear if the chain of grid launches can be 24 or 25 grids long in total. The following line from this source seems to suggest that the first grid is excluded from this count, so that a chain of 24 grids can be called from inside the device, for a total length of 25 grids including the first one launched by the host.

Kernels are launched at depth 0 from the host -> recursive launches only work up to the given hardware limit

Wanting to make sure I understood this correctly, I wrote a simple program that "recursively" launches a given number of grids. To my surprise, the program seemed to work fine with a depth of nested grids of 500. Adding a very long for cycle after the launch of the child grid, made the program end incorrectly after at a depth of 127 (much more than the hypothetical maximum depth). Now, it is also possible that errors were happening under the hood at small depths in both cases, but that these errors did not manifest in any obvious way. (This was caused by an issue in the for loop.)

Does anyone know more about this maximum depth? Is it a minimum guarantee, and if it is exceeded the program may or may not still work? Does it have something to do with synchronization between the parent and child grid (a feature that has been removed in recent capabilities)? Does this only affect active grids, so an inactive grid that has a child will not count towards this limit, even though it will only be effectively closed only when all its children have terminated as well? (This last possibility seems highly unlikely)

Any insight is welcome.


Solution

  • Given that even for CDP1

    Each subordinate launch is considered a new nesting level, and the total number of levels is the nesting depth of the program. The synchronization depth is defined as the deepest level at which the program will explicitly synchronize on a child launch. Typically this is one less than the nesting depth of the program, but if the program does not need to call cudaDeviceSynchronize() at all levels then the synchronization depth might be substantially different to the nesting depth.
    CUDA C++ Programming Guifde 12.9.1

    Your results do not seem surprising. With CDP2 the concept of synchronization depth does not really make sense as one can not synchronize from inside a kernel.