If I use a barrier (no matter if CLK_LOCAL_MEM_FENCE
or CLK_GLOBAL_MEM_FENCE
) in my kernel, it causes a CL_INVALID_WORK_GROUP_SIZE
error. The global work size is 512, the local work size is 128, 65536 items have to be computed, the max work group size of my device is 1024, I am using only one dimension. For Java bindings I use JOCL.
The kernel is very simple:
kernel void sum(global float *input, global float *output, const int numElements, local float *localCopy
{
localCopy[get_local_id(0)] = grid[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE); // or barrier(CLK_GLOBAL_MEM_FENCE)
}
I run the kernel on the Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
and can use OpenCL 1.2. The calling method looks like
kernel.putArg(aCLBuffer).putArg(bCLBuffer).putArg(elementCount).putNullArg(localWorkSize);
queue.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize);
But the error is always the same:
[...]can not enqueue 1DRange CLKernel [...] with gwo: null gws: {512} lws: {128}
cond.: null events: null [error: CL_INVALID_WORK_GROUP_SIZE]
What I am doing wrong?
This is expected behaviour on some OpenCL platforms. For example, on my Apple system, the CPU device has a maximum work-group size of 1024. However, if a kernel has a barrier inside, then the maximum work-group size for that specific kernel is reduced to 1.
You can query the maximum work-group size for a specific kernel by using the clGetKernelWorkGroupInfo
function with the CL_KERNEL_WORK_GROUP_SIZE
parameter. The value returned will be no more than the value returned by clGetDeviceInfo
and CL_DEVICE_MAX_WORK_GROUP_SIZE
, but is allowed to be less (as it is in this case).