cudasynchronizationgpgpucuda-streamscuda-graphs

Is it possible to execute more than one CUDA graph's host execution node in different streams concurrently?


Investigating possible solutions for this problem, I thought about using CUDA graphs' host execution nodes (cudaGraphAddHostNode). I was hoping to have the option to block and unblock streams on the host side instead of the device side with the wait kernel, while still using graphs.

I made a test program with two graphs. One graph does a host-to-device copy, calls a host function (with a host execution node) that waits in a loop for the "event" variable to stop being equal to 0 (i.e. waits for the "event"), then does a device-to-host copy. Another graph does a memset on the device memory, then calls a host function that sets the "event" variable to 1 (i.e. signals the "event"). I launch the first graph on one stream, the second on another, then synchronize on the first stream.

The result was that both graphs were launched as expected, the "wait" host function was executed, and the first stream was blocked successfully. However, even though the second graph was launched, the "signal" host function was never executed.

I realized that CUDA's implementation is likely serializing all host execution nodes in the context, so the "signal" node is forever waiting for the "wait" node to finish before executing. The documentation is even saying that "host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized".

I also tried launching the graphs from separate host threads, but that didn't work.

Is there some kind of way to make host execution nodes on different streams concurrent that I'm missing?


Solution

  • No, this isn't a reliable method. It is evident that additional thread(s) are spun up by the CUDA runtime to handle host callbacks, but there is no detail or specification.

    In order for such a thing to work, you would need the two synchronizing agents to each have their own thread, running concurrently. That way, if the waiting agent spun up first, the signaling agent would still be able to execute and deliver a signal.

    But for cudaLaunchHostFunc (and we can surmise the same thing with graph host nodes) it is explicitly stated:

    Host functions without a mandated order (such as in independent streams) execute in undefined order and may be serialized.

    (emphasis added)

    Serialization of host functions would make such a scheme not workable.

    Is there some kind of way to make host execution nodes on different streams concurrent that I'm missing?

    There aren't any additional controls or specification for this, that I am aware of.