My OpenCL program (don't be scared, this is auto-generated code for 3D CFD) shows strange behavior -- a lot of time are spent in opencl_enq_job_* procedures (opencl_code.c), where are only async OpenCL commands:
clEnqueueWriteBuffer(..,CL_FALSE,...,&event1);
clSetKernelArg(...);
...
clEnqueueNDRangeKernel(...,1,&event1,&event2);
clEnqueueReadBuffer(...,CL_FALSE,...,1,&event2,&event3);
clSetEventCallback(event3,...);
clFlush(...);
In program output the time, spent in opencl_enq_job_* shown as:
OCL waste: 0.60456248727985751
It's mean 60% of time wasted in that procedures.
Most of time (92%) are spent in clEnqueueReadBuffer function and ~5% in clSetEventCallback.
Why so much? What's wrong in this code?
My configuration:
Platform: NVIDIA CUDA
Device 0: Tesla M2090
Device 1: Tesla M2090
Nvidia cuda_6.0.37 SDK and drivers.
Linux localhost 3.12.0 #6 SMP Thu Apr 17 20:21:10 MSK 2014 x86_64 x86_64 x86_64 GNU/Linux
Update: Nvidia accepted this as a bug.
Update1: On my laptop (MBP15, AMD GPU, Apple OpenCL) the program show similar behavior, but waiting more in clFlush (>99%). On CUDA SDK the program works without clFlush, on Apple program without clFlush hangs (submitted tasks never finishes).
I have tried memory pining and it significantly improved the situation!
Problem was solved.
I think this is not really a bug; I just missed something in the documentation. My investigation lead me to the conclusion, that driver just cannot perform async load/store of non-pinned buffer -- even if non-blocking calls are used. The driver just waits for an opportunity to store/load data, which can be performed only after task finish, and this breaks parallelism.