pythonopenclpyopencl

(Py)OpenCL modify value from all threads simultanously


Simple, I have the following kernel which modifies the value of C[0], where C is an array with only one element.

__kernel void sigma(__global float *A, __global float *B, __global float *C) {
    int i = get_global_id(0);
    printf("Adding %.2f + %.2f", A[i], B[i]);
    C[0] += A[i] + B[i];
}

The problem is, in the end C[0] has the value of the thread that finished last, specifically in this example I get the following

Adding 1.00 + 0.00
Adding 2.00 + 1.00
Adding 3.00 + 1.00
Adding 4.00 + 1.00
[5.]

In the end C[0] is 4.00 + 1.00. What i want is for C[0] to be (1.00 + 0.00) + (2.00 + 1.00) + (3.00 + 1.00) + (4.00 + 1.00). So I want each thread's A[i] and B[i] to be added to C[0].

Also I am not just looking for addition, I want this to be compatible with any function or operation.

This may be redundant but in the host code I am just doing the bare minimum to pass in the data to the kernel. Is this problem concenred with the host code?

import pyopencl as cl, numpy as np; 
ctx = cl.create_some_context(); queue = cl.CommandQueue(ctx); mf = cl.mem_flags
M = np.array([1, 2, 3, 4]).astype(np.float32) # A
V = np.array([0, 1, 1, 1]).astype(np.float32) # B
a = np.array([0]).astype(np.float32) # C
# Transfer data to GPU
A_GPU = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=M) 
B_GPU = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=V)
C_GPU = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)
c = np.zeros(shape=a.shape, dtype= np.float32) # array to copy the result
kernel.sigma(queue, M.shape, None, A_GPU, B_GPU, C_GPU)
cl.enqueue_copy(queue, c, C_GPU).wait()

Solution

  • One way to perform reduction in PyOpenCl as @Elad Maimoni said, is to use the function work_group_reduce_add available in version 2.0.

    In python it can be implemented like this

    kernel = cl.Program(ctx, """
    __kernel void resum(__global float *A, __global float *B, __global float *a) {
        int i = get_global_id(0);
        a[0] = work_group_reduce_add(A[i] + B[i]);
    }
    """).build(options='-cl-std=CL2.0') # Build using cl 2.0
    
    # Some example arrays
    a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
    b = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1])
    c = np.array([0])
    d = np.array([0])
    
    # Create GPU data
    a = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
    b = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
    c = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=c)
    
    kernel.resum(queue, (10, ), None, a, b, c)
    cl.enqueue_copy(queue, d, c)
    print(d)
    
    

    This would output [90].