openclnvidiatesla

OpenCL: too slow async commands submiting


My OpenCL program (don't be scared, this is auto-generated code for 3D CFD) shows strange behavior -- a lot of time are spent in opencl_enq_job_* procedures (opencl_code.c), where are only async OpenCL commands:

clEnqueueWriteBuffer(..,CL_FALSE,...,&event1);
clSetKernelArg(...);
...
clEnqueueNDRangeKernel(...,1,&event1,&event2);
clEnqueueReadBuffer(...,CL_FALSE,...,1,&event2,&event3);
clSetEventCallback(event3,...);
clFlush(...);

In program output the time, spent in opencl_enq_job_* shown as:

OCL waste: 0.60456248727985751

It's mean 60% of time wasted in that procedures.

Most of time (92%) are spent in clEnqueueReadBuffer function and ~5% in clSetEventCallback.

Why so much? What's wrong in this code?

My configuration:

Platform: NVIDIA CUDA
Device 0: Tesla M2090
Device 1: Tesla M2090

Nvidia cuda_6.0.37 SDK and drivers.
Linux localhost 3.12.0 #6 SMP Thu Apr 17 20:21:10 MSK 2014 x86_64 x86_64 x86_64 GNU/Linux

Update: Nvidia accepted this as a bug.

Update1: On my laptop (MBP15, AMD GPU, Apple OpenCL) the program show similar behavior, but waiting more in clFlush (>99%). On CUDA SDK the program works without clFlush, on Apple program without clFlush hangs (submitted tasks never finishes).


Solution

  • I have tried memory pining and it significantly improved the situation!

    Problem was solved.

    I think this is not really a bug; I just missed something in the documentation. My investigation lead me to the conclusion, that driver just cannot perform async load/store of non-pinned buffer -- even if non-blocking calls are used. The driver just waits for an opportunity to store/load data, which can be performed only after task finish, and this breaks parallelism.