[SOLVED] Is cuStreamAddCallback as effective as cuStreamSynchronize in having latest bits of data on host?

Is cuStreamAddCallback as effective as cuStreamSynchronize in having latest bits of data on host?

In CUDA(driver API) documentation, it says

The start of execution of a callback has the same effect as synchronizing an event recorded in the same stream immediately prior to the callback. It thus synchronizes streams which have been "joined" prior to the callback.

Does this mean that if I have a pipeline with callbacks after each critical point to signal host, I don't need any cuStreamSynchronize for those points to be able to access output arrays?

Very simple code like

cuda memcpy host to device
cuda launch kernel
cuda memcpy device to host 
add callback

callback()
{ 
   here, safe to access host "results" array? 
   (considering no more cuda commands on these arrays) 
}

Solution

CUDA streams have some fairly simple semantics. One of those is that all activity issued into a stream will execute in-order. Item B, issued into a particular stream, will not begin to execute until item A, issued into that stream prior to B, has completed.

So, yes, the callback, issued into a particular stream, will not begin to execute until all prior activity in that stream has completed.

If you wanted this characteristic in "ordinary" host code (i.e. that code that is not wrapped in a CUDA callback) it would require some sort of explicit activity, such as cuStreamSynchronize or cuEventSynchronize, or cuMemcpy, or similar.

For the purposes of this discussion, I'm ignoring CUDA managed memory, and assuming you are doing an explicit copy of data from device memory to host memory, as you have laid out.