[SOLVED] Device-wide synchronization in SYCL on NVIDIA GPUs

Device-wide synchronization in SYCL on NVIDIA GPUs

Context
I'm porting a complex CUDA application to SYCL which uses multiple cudaStream to launch the kernels. In addition, it also uses the default Stream in some cases, forcing a device-wide synchronization.

Problem
Cuda Streams can be mapped quite easily to in order SYCL Queues, however when encountering a device-wide syncronization point (i.e. cudaDeviceSyncronize()), I must explicitly wait on all the queues as queue::wait() waits just on the commands submitted to that queue.

Question
Is there a way to wait on all the commands for a specific device, without having to explicitly call wait() on every queue?

Solution

In general there are two ways you might be able to mimic this behavior I SYCL.

You can wait on every queue as you suggest
You can wait on all the events that comprise your CUDA stream using event::wait(const std::vector &) or event::wait_and_throw(const std::vector &)

The former is precisely what you suggest, but of course then you are waiting on the whole queue to empty. The second option allows you to wait just for the events to complete without waiting on the whole queue.

In either case though, you do have to do some book keeping to ensure that you are waiting on each item you expect to complete before proceeding with your algorithm.

As Sri mentioned, you can use SYCLomatic and they way that SYCLomatic translates this code is to create a function that loops over all the queues and performs the waits as in 1 above.

Hopefully this helps, wish it was a one liner as well, but the abstractions are slightly different :)