arrays parallel-processing opencl gpu pyopencl

OpenCL - Element-wise operations on 4D array

I am trying to write an OpenCL code to do element-wise operations on multi-dimensional arrays.

I know that OpenCL buffers are flattened, which makes indexing a bit tricky. I succeeded when dealing with 2-dimensional arrays, but for 3+ dimensional arrays, I have either indexing errors or the wrong result.

It is all the more surprising so that I use the same indexing principle/formula as in the 2D case.

2D case:

__kernel void test1(__global int* a, __global int* b, __global int* c, const int height) {
    int i = get_global_id(0);
    int j = get_global_id(1);
    c[i + height * j] = a[i + height * j] + b[i + height * j];
}

Correct.

3D case:

__kernel void test1(__global int* a, __global int* b, __global int* c, const int dim1, const int dim2) {
    int i = get_global_id(0);
    int j = get_global_id(1);
    int k = get_global_id(2);

    int idx = i + dim1 * j + dim1 * dim2 * k;
    c[idx] = a[idx] + b[idx];
}

Wrong result (usually an output buffer filled with values very close to 0).

4D case:

__kernel void test1(__global int* a, __global int* b, __global int* c, const int dim1, const int dim2, const int dim3) {
    int i = get_global_id(0);
    int j = get_global_id(1);
    int k = get_global_id(2);
    int l = get_global_id(3);

    int idx = i + dim1 * j + dim1 * dim2 * k + l * dim1 * dim2 * dim3;
    c[idx] = a[idx] + b[idx];
}

Here is the indexing error: enqueue_knl_test1 pyopencl._cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_DIMENSION

Solution

In the 4D case, you are simply using the API wrongly. OpenCL does not support an infinite number of global / local dimensions. Just up to 3.

In the 2D case, your indexing seems wrong. Assuming row-major arrays. It should be i + j * width not i + j * height.

In the 3D case, the indexing inside the kernel seems OK, assuming row-major memory layout and that dim1 equals cols (width) and dim2 equals rows (height). But anyway, your question lacks context:

Input buffers allocation and initialization.
Kernel invocation code (parameters, work group and global size).
Result collection. synchronization.
You could be accessing beyond the buffer allocated size. It should be checked.

Doing these steps incorrectly can easily lead to unexpected results. Even if your kernel code is OK.

If you wish to debug indexing issues, the easiest thing to do is to write a simple kernel that output the calculated index.

__kernel void test1(__global int* c, const int dim1, const int dim2) {
    int i = get_global_id(0);
    int j = get_global_id(1);
    int k = get_global_id(2);

    int idx = i + dim1 * j + dim1 * dim2 * k;
    c[idx] = idx;
}

You should then expect a result with linearly increasing values. I would start with a single workgroup and then move on to using multiple workgroups.

Also, If you perform a simple element-wise operation between arrays, then it is much simpler to use 1D indexing. You could simply use a 1D workgroup and global size that equals the number of elements (rounded up to to fit workgroup dim):

__kernel void test1(__global int* a, __global int* b, __global int* c, const int total) {
    // no need for complex indexing for elementwise operations
    int idx = get_global_id(0);
    if (idx < total)
    {
       c[idx] = a[idx] + b[idx];
    }
}

You would probably set local_work_size to the max size the hardware allows (for instance 512 for Nvidia, 256 for AMD) and global_work_size to the total of elements rounded up to multiples of local_work_size. See clEnqueueNDRangeKernel.

2D & 3D dims are usually used for operations that access adjacent elements in 2D / 3D space. Such as image convolutions.