Context
I'm porting a complex CUDA application to SYCL which uses multiple cudaStream
to launch the kernels. In addition, it also uses the default Stream in some cases, forcing a device-wide synchronization.
Problem
Cuda Streams can be mapped quite easily to in order SYCL Queues, however when encountering a device-wide syncronization point (i.e. cudaDeviceSyncronize()
), I must explicitly wait on all the queues as queue::wait()
waits just on the commands submitted to that queue.
Question
Is there a way to wait on all the commands for a specific device, without having to explicitly call wait() on every queue?
In general there are two ways you might be able to mimic this behavior I SYCL.
The former is precisely what you suggest, but of course then you are waiting on the whole queue to empty. The second option allows you to wait just for the events to complete without waiting on the whole queue.
In either case though, you do have to do some book keeping to ensure that you are waiting on each item you expect to complete before proceeding with your algorithm.
As Sri mentioned, you can use SYCLomatic and they way that SYCLomatic translates this code is to create a function that loops over all the queues and performs the waits as in 1 above.
Hopefully this helps, wish it was a one liner as well, but the abstractions are slightly different :)