Simple, I have the following kernel which modifies the value of C[0]
, where C
is an array with only one element.
__kernel void sigma(__global float *A, __global float *B, __global float *C) {
int i = get_global_id(0);
printf("Adding %.2f + %.2f", A[i], B[i]);
C[0] += A[i] + B[i];
}
The problem is, in the end C[0]
has the value of the thread that finished last, specifically in this example I get the following
Adding 1.00 + 0.00
Adding 2.00 + 1.00
Adding 3.00 + 1.00
Adding 4.00 + 1.00
[5.]
In the end C[0]
is 4.00 + 1.00
. What i want is for C[0]
to be (1.00 + 0.00) + (2.00 + 1.00) + (3.00 + 1.00) + (4.00 + 1.00)
. So I want each thread's A[i]
and B[i]
to be added to C[0]
.
Also I am not just looking for addition, I want this to be compatible with any function or operation.
This may be redundant but in the host code I am just doing the bare minimum to pass in the data to the kernel. Is this problem concenred with the host code?
import pyopencl as cl, numpy as np;
ctx = cl.create_some_context(); queue = cl.CommandQueue(ctx); mf = cl.mem_flags
M = np.array([1, 2, 3, 4]).astype(np.float32) # A
V = np.array([0, 1, 1, 1]).astype(np.float32) # B
a = np.array([0]).astype(np.float32) # C
# Transfer data to GPU
A_GPU = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=M)
B_GPU = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=V)
C_GPU = cl.Buffer(ctx, mf.WRITE_ONLY, a.nbytes)
c = np.zeros(shape=a.shape, dtype= np.float32) # array to copy the result
kernel.sigma(queue, M.shape, None, A_GPU, B_GPU, C_GPU)
cl.enqueue_copy(queue, c, C_GPU).wait()
One way to perform reduction in PyOpenCl as @Elad Maimoni said, is to use the function work_group_reduce_add
available in version 2.0
.
In python it can be implemented like this
kernel = cl.Program(ctx, """
__kernel void resum(__global float *A, __global float *B, __global float *a) {
int i = get_global_id(0);
a[0] = work_group_reduce_add(A[i] + B[i]);
}
""").build(options='-cl-std=CL2.0') # Build using cl 2.0
# Some example arrays
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
b = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1])
c = np.array([0])
d = np.array([0])
# Create GPU data
a = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
b = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
c = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=c)
kernel.resum(queue, (10, ), None, a, b, c)
cl.enqueue_copy(queue, d, c)
print(d)
This would output [90]
.