I faced very strange behavior of OpenCL. I've linked a minimal code sample.
Starting from some random index (commonly 32-divisible) values is not written to array if I add one extra operation beforehand (g_idata[ai] = g_idata[ai-1]
). Also notable that, i will get correct result if:
if (ai >= n) g_idata[0]+=0;
at beginning. see commented linestested on Intel and nvidia.
import numpy as np
import pyopencl as cl
ctx = cl.create_some_context()
prg = cl.Program(ctx, """
__kernel void prescan(__global float *g_idata, const int n) {
int thid = get_global_id(0);
int ai = thid*2+1;
// if uncomment strings bellow the bug dissappears
//if (ai >= n){
// g_idata[0]+=0;
//}
bool SHOW_BUG=1;
// make a dummy operation
if (SHOW_BUG)
g_idata[ai] = g_idata[ai-1];
else {
g_idata[ai-1]; //dummy read
g_idata[ai] = 3.14f; //constant write
}
barrier(CLK_GLOBAL_MEM_FENCE);
//set 0,1,2,3... as result
g_idata[thid] = thid;
}
""").build()
prescan_kernel = prg.prescan
prescan_kernel.set_scalar_arg_dtypes([None, np.int32])
def main():
N = 512
a_np = (np.random.random((N,))).astype(np.float32)
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a_np)
global_size = (512,)
local_size = None
prescan_kernel(queue, global_size, local_size, a_g, N)
cl.enqueue_copy(queue, a_np, a_g)
corect = np.array(range(N))
#assert np.allclose(a_np, 3.14), np.where(3.14 != a_np)
assert np.allclose(a_np, corect), np.where(corect != a_np)
if __name__ == '__main__':
for i in range(25):
main()
Several things in your code will, according to the OpenCL spec, create undefined behavior.
These include:
Furthermore barriers only synchronize work-items/threads in a work-group, so it has no effect in your code. When discussing undefined behavior, it may behave differently on different platforms, sometimes crash the driver and sometimes take down the OS. Please fix these problems and then describe your problems.