openclpyopencl

OpenCL bincount


I am trying to implement a bincount operation in OpenCL which allocates an output buffer and uses indices from x to accumulate some weights at the same index (assume that num_bins == max(x)). This is equivalent to the following python code:

out = np.zeros_like(num_bins)
for i in range(len(x)):
    out[x[i]] += weight[i]
return out

What I have is the following:

import pyopencl as cl
import numpy as np

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

prg = cl.Program(ctx, """
__kernel void bincount(__global int *res_g, __global const int* x_g, __global const int* weight_g)
{
  int gid = get_global_id(0);
  res_g[x_g[gid]] += weight_g[gid];
}
""").build()

# test
x = np.arange(5, dtype=np.int32).repeat(2) # [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
x_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf=x)

weight = np.arange(10, dtype=np.int32) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
weight_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE | cl.mem_flags.COPY_HOST_PTR, hostbuf=weight)

res_g = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 4 * 5)

prg.bincount(queue, [10], None, res_g, x_g, weight_g)

# transfer back to cpu
res_np = np.empty(5).astype(np.int32)
cl.enqueue_copy(queue, res_np, res_g)

Output in res_np:

array([1, 3, 5, 7, 9], dtype=int32)

Expected output:

array([1, 5, 9, 13, 17], dtype=int32)

How do I accumulate the elements that are indexed more than once?

EDIT

The above is a contrived example, in my real-world application x will be indices from a sliding window algorithm:

x = np.array([ 0,  1,  2,  4,  5,  6,  8,  9, 10,  1,  2,  3,  5,  6,  7,  9, 10,
              11,  4,  5,  6,  8,  9, 10, 12, 13, 14,  5,  6,  7,  9, 10, 11, 13,
              14, 15,  8,  9, 10, 12, 13, 14, 16, 17, 18,  9, 10, 11, 13, 14, 15,
              17, 18, 19, 20, 21, 22, 24, 25, 26, 28, 29, 30, 21, 22, 23, 25, 26,
              27, 29, 30, 31, 24, 25, 26, 28, 29, 30, 32, 33, 34, 25, 26, 27, 29,
              30, 31, 33, 34, 35, 28, 29, 30, 32, 33, 34, 36, 37, 38, 29, 30, 31,
              33, 34, 35, 37, 38, 39], dtype=np.int32)

weight = np.array([1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
                   0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,
                   0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
                   1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
                   0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0], dtype=np.int32)

There is a pattern which becomes more apparent when reshaping x to (2,3,2,3,3). But I am having a hard time figuring out how the approach given by @doqtor can be used here and especially if it is easy enough to generalize.

The expected output is:

array([1, 1, 0, 0, 2, 2, 0, 0, 3, 3, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 2, 2, 0, 0, 3, 3, 0, 0, 2, 2, 0, 0, 1, 1, 0, 0], dtype=int32)

Solution

  • The problem is that OpenCL buffer to which weights are accumulated is not initialized (zeroed). Fixing that:

    res_np = np.zeros(5).astype(np.int32)
    res_g = cl.Buffer(ctx, cl.mem_flags.WRITE_ONLY | cl.mem_flags.COPY_HOST_PTR, hostbuf=res_np)
    
    prg.bincount(queue, [10], None, res_g, x_g, weight_g)
    
    # transfer back to cpu
    cl.enqueue_copy(queue, res_np, res_g)
    

    Returns correct results: [ 1 5 9 13 17]

    ====== Update ==========

    As @Kevin noticed there is race condition here too. If there is any pattern it could be addressed this way without using synchronization, for example processing every 2 elements by 1 work item:

    __kernel void bincount(__global int *res_g, __global const int* x_g, __global const int* weight_g)
    {
      int gid = get_global_id(0);
      for(int x = gid*2; x < gid*2+2; ++x)
          res_g[x_g[x]] += weight_g[x];
    }
    

    Then schedule 5 work items:

    prg.bincount(queue, [5], None, res_g, x_g, weight_g)