performanceopenclaticodexl

Slow GPU performance on OpenCL kernel


I'm kinda at a lost with some performance with OpenCL on an AMD GPU (Hawaii core or Radeon R9 390).

The operation is as follows:

dependency is:

Memory transmission and Kernel execute are performed in two separate command queues. Command dependency is done by GPU events as defined in OpenCL.

The whole operation is now looped just for performance analysis with the same input data.

CodeXL Timeline

As you can see in the timeline, the host is waiting a very long time on the GPU to finish with clWaitForEvents() while the GPU idles most of the time. You can also see the repeated operation. For convenience I also provide the list of all issued OpenCL commands.

CodeXL Commands

My questions now are:

  1. Why is the GPU idling so much? In my head I can easily push all "blue" items together and start the operation right away. Memory transfer is 6 GB/s which is the expected rate.

  2. Why are the kernels executed so late? Why is there a gap between kernel #2 and kernel #3 execution?

  3. Why are memory transfer and kernel not executed in parallel? I use 2 command queues, with only 1 queue it is even worse with performance.

Just by pushing all commands together in my head (keeping dependency of course, so 1st green must start after 1st blue) I can triple performance. I don't know why the GPU is so sluggish. Has anyone some insight?


Some number crunching

as Kernel #1 is faster than Memory Transfer #2 and Kernel #2 is faster than Memory Transfer #3 overall time should be:

but clWaitForEvents is

Yes, there are some losses and I'm fine with like 10% (60 µs), but 300% is too much.


Solution

  • As @DarkZeros has said, you need to hide kernel-enqueue overhead by using multiple command queues to overlap them in time-line.

    Why is the GPU idling so much?

    Because you are using 2 command queues and they are running serially (probably) with events that make them wait longer.

    You should use single queue if everything is serial. You should let two queues overlap actions if you can add double-buffering or similar techniques to advance computations.

    Why are the kernels executed so late?

    The wide holes consist of host-side latencies such as enqueueing commands, flushing commands to device, host-side algorithms and device-side event control logic. Maybe events can get as small as 20-30 microseconds but host-device interactons are more than that.

    If you get rid of events and use single queue, drivers may even add early compute techniques to fill those gaps even before you enqueue those commands(maybe) just as CPUs do early branching(predicting).

    Why are memory transfer and kernel not executed in parallel?

    There is no enforcement but drivers can also check dependencies between kernels and copies and to keep the data intact, they can halt some operations until some others finish (maybe).

    Are you sure kernels and buffer copies are completely independent?

    Another reason could be two queues don't have much to choose to overlap. If both queues have both types of operations, they would have more options to overlap such as kernel + kernel, copy + copy instead of just kernel+copy.


    If program has too many small kernels, you may try OpenCL 2.0 dynamic parallelism which makes device call kernels itself which is faster than host-side enqueue.

    Maybe you can add a higher level parallelism such as image-level parallelism (if its image processing you do) to keep gpu busy. Work on 5-10 images at the same time which should ensure independent kernel/buffer executions unless all images are in same buffer. If that doesn't work, then you can launch 5-10 processes of same program(process level parallelism). But having too many contexts can stuck at driver limitations so image level parallelism must be better.

    R9 390 must be able to process with 8-16 command queues.

    1758 µs

    Sometimes even empty kernels make it wait for 500-100 µs. Most probably you should enqueue 1000 cycles, wait once at the end. If each cycle works after a user-button-click, then user wouldn't notice the 1.7 ms latency already.