I have recently ported my legacy CUDA code to SYCL using OneAPI for NVIDIA GPUs. The code runs fine but is two times slower than the native CUDA code. After profiling, I found the following thing. Every time when a kernel is launched, the CUDA API runs are
cuEventCreate
cuLaunchKernel
cuEventRecord
cuEventDestroy
I suppose this is only for measuring performance and shouldn't happen normally. For native CUDA code, only cuLaunchKernel is evoked as I would expect.
My ipcx flags are
-fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64 -Xsycl-target-backend=nvptx64-nvidia-cuda --offload-arch=sm_8
And my kernel launch is something like this
dpct::get_in_order_queue().submit([&](sycl::handler &cgh) {
cgh.parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, BLOCKS_PER_GRID) *
sycl::range<3>(1, 1, THREADS_PER_BLOCK),
sycl::range<3>(1, 1, THREADS_PER_BLOCK)),
[=](sycl::nd_item<3> item_ct1) {MyKernel(item_ct1);});
}
Any idea about why this happens and is there anything to do, such as compiler options to prevent this?
I did more testing. On my desktop with RTX A2000 within WSL2, the SYCL code is 2 times slower than the CUDA code. But on an HPC with A100 GPU, the SYCL code is slightly faster than the CUDA code (about 3%). So I guess the overhead of creating those events is generally not an issue for high-end GPUs, or maybe the WSL2 is slowing down the SYCL code.
This is related to the fact that the queue::submit()
API needs to return an event regardless of whether it is needed or not.
Are code changes acceptable?
If not: I would suggest to use AdaptiveCpp. It has an event pool so you wouldn't see new events created for every kernel launch. However, it still has to record the events it draws from the pool, so the call to cudaEventRecord
is still there.
If code changes are allowed, both implementations have extensions to handle this problem:
For AdaptiveCpp, the remaining event record you can get rid of using the coarse-grained events extension, which asserts to the runtime that the event is not important: https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/extensions.md#hipsycl_ext_coarse_grained_events
For ICPX/DPC++, there is a similar extension called discard events for the same purpose: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_discard_queue_events.asciidoc
However, while an AdaptiveCpp coarse-grained event is still usable for waiting etc, the DPC++ discard event will cause the returned events to throw an exception if they are actually used.