I am having trouble understanding what the work item constraints mean. I am using pyopencl
and looking at the max_work_item_sizes
it gives what I assumed was the max number of global work threads for each dimension.
import pyopencl as cl
import numpy as np
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
queue.device.max_work_item_sizes # [1024, 1024, 64]
I could simulate the np.arange
function by the following:
prg = cl.Program(ctx, """
__kernel void arange(__global int *res_g)
{
int gid = get_global_id(0);
res_g[gid] = gid;
}
""").build()
res_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * 4096)
prg.arange(queue, [4096], None, res_g)
# transfer back to cpu
res_np = np.empty(4096).astype(np.int32)
cl.enqueue_copy(queue, res_np, res_g)
assert (res_np == np.arange(4096)).all() # this is true
How is it possible to specify more than 1024 work items for the first dimension? What does the max_work_item_sizes
mean?
Another question related to this is if it is beneficial to use as many work dimensions as possible? As I understand it possible to use 3 dimensions at most. A way of simulating np.arange
by using 2 work item dimensions could be done by the following:
prg = cl.Program(ctx, """
__kernel void arange(__global int *res_g)
{
int gid = get_global_id(0) * get_global_id(1);
barrier(CLK_GLOBAL_MEM_FENCE);
res_g[gid] = gid;
}
""").build()
res_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * 4096)
prg.arange(queue, [64, 64], [1,1], res_g)
# transfer back to cpu
res_np = np.empty(4096).astype(np.int32)
cl.enqueue_copy(queue, res_np, res_g)
assert (res_np == np.arange(4096)).all()
For some reason the assertion is not always true
But my question is, when processing a large array, is it better to make use of all 3 work_item_dimensions
? Or is it better to treat the array as a 1d contiguous array and only use get_global_id(0)
?
How is it possible to specify more than 1024 work items for the first dimension? What does the max_work_item_sizes mean?
max_work_item_sizes
returns maximum number of work items per work group in each dimension.
By passing None
as third argument:
prg.arange(queue, [4096], None, res_g)
^^^^
implementation is being asked to select the best work group size. Checking work group size can be done this way, for example:
res_g[gid] = get_local_size(0);
In my system max_work_item_sizes=[4096, 4096, 4096]
and the value returned by get_local_size(0)
is 1024 which means that implementation decided that work group size is 1024 items and 4096 / 1024 gives us 4 work groups scheduled.
Specifying work group size, for example to 256 work items:
prg.arange(queue, [4096], [256], res_g)
will schedule 4 times more work groups.
Another question related to this is if it is beneficial to use as many work dimensions as possible? As I understand it possible to use 3 dimensions at most.
and
But my question is, when processing a large array, is it better to make use of all 3 work_item_dimensions? Or is it better to treat the array as a 1d contiguous array and only use get_global_id(0)?
From my experience there is no difference whether it's one or more dimensions being used. So you do as it's more convenient for you.
For some reason the assertion is not always true
That's because there is a bug in your code. Calculating index should be:
int gid = get_global_id(0) * get_global_size(0) + get_global_id(1);