openclgpgpupyopencl

strange OpenCL behavior


I faced very strange behavior of OpenCL. I've linked a minimal code sample.

Starting from some random index (commonly 32-divisible) values is not written to array if I add one extra operation beforehand (g_idata[ai] = g_idata[ai-1]). Also notable that, i will get correct result if:

  1. just read value, and writing a literal (see SHOW_BUG).
  2. add if (ai >= n) g_idata[0]+=0; at beginning. see commented lines

tested on Intel and nvidia.


import numpy as np
import pyopencl as cl


ctx = cl.create_some_context()

prg = cl.Program(ctx, """
__kernel void prescan(__global float *g_idata, const int n) {
    int thid = get_global_id(0);
    
    int ai = thid*2+1;
    
    // if uncomment strings bellow the bug dissappears
    //if (ai >= n){
    //    g_idata[0]+=0;
    //}
    
    bool SHOW_BUG=1;
    // make a dummy operation
    if (SHOW_BUG)
        g_idata[ai] = g_idata[ai-1];
    else {
        g_idata[ai-1]; //dummy read
        g_idata[ai] = 3.14f; //constant write
    }
    barrier(CLK_GLOBAL_MEM_FENCE);
   
    //set 0,1,2,3... as result
    g_idata[thid] = thid;
}

""").build()

prescan_kernel = prg.prescan
prescan_kernel.set_scalar_arg_dtypes([None, np.int32])


def main():
    N = 512
    a_np = (np.random.random((N,))).astype(np.float32)
    queue = cl.CommandQueue(ctx)

    mf = cl.mem_flags
    a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)

    global_size = (512,)
    local_size = None
    prescan_kernel(queue, global_size, local_size, a_g, N)
    cl.enqueue_copy(queue, a_np, a_g)

    corect = np.array(range(N))
    #assert np.allclose(a_np, 3.14), np.where(3.14 != a_np)
    assert np.allclose(a_np, corect), np.where(corect != a_np)


if __name__ == '__main__':
    for i in range(25):
        main()

Solution

  • Several things in your code will, according to the OpenCL spec, create undefined behavior.

    These include:

    1. Accessing out-of-range memory. Array size expected to be N*2+1 for N work-items.
    2. Multiple work-items (threads) accessing the same index of the array (read or write).

    Furthermore barriers only synchronize work-items/threads in a work-group, so it has no effect in your code. When discussing undefined behavior, it may behave differently on different platforms, sometimes crash the driver and sometimes take down the OS. Please fix these problems and then describe your problems.