In CUDA(driver API) documentation, it says
The start of execution of a callback has the same effect as synchronizing an event recorded in the same stream immediately prior to the callback. It thus synchronizes streams which have been "joined" prior to the callback.
Does this mean that if I have a pipeline with callbacks after each critical point to signal host, I don't need any cuStreamSynchronize for those points to be able to access output arrays?
Very simple code like
cuda memcpy host to device
cuda launch kernel
cuda memcpy device to host
add callback
callback()
{
here, safe to access host "results" array?
(considering no more cuda commands on these arrays)
}
CUDA streams have some fairly simple semantics. One of those is that all activity issued into a stream will execute in-order. Item B, issued into a particular stream, will not begin to execute until item A, issued into that stream prior to B, has completed.
So, yes, the callback, issued into a particular stream, will not begin to execute until all prior activity in that stream has completed.
If you wanted this characteristic in "ordinary" host code (i.e. that code that is not wrapped in a CUDA callback) it would require some sort of explicit activity, such as cuStreamSynchronize
or cuEventSynchronize
, or cuMemcpy
, or similar.
For the purposes of this discussion, I'm ignoring CUDA managed memory, and assuming you are doing an explicit copy of data from device memory to host memory, as you have laid out.