openclbarrier

OpenCL barrier usage within conditional


According to the specs:

If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier.

If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.

In my understanding this means, that in any Kernel:

if(0 == (get_local_id(0)%2)){
  //block a
  work_group_barrier(CLK_GLOBAL_MEM_FENCE);
  //block a part 2
}else{
  //block b
  work_group_barrier(CLK_GLOBAL_MEM_FENCE);
  //block b part 2
}
//common operations

Should one worker reach //block a, every other worker needs to reach it too.

By this logic it is not possible to correctly synchronize every odd local worker with every even one ( blocks a and b ) to be run at the same time.

Is this understanding correct?

What would be a good synchronisation strategy for a situation like this ?

Would a logic like this be an acceptable solution?

__local number_finished = 0;
if(0 == (get_local_id(0)%2)){
  //block a
  atomic_add(&number_finished, 1);
  while(number_finished < get_local_size(0));
  //block a part 2
}else{
  //block b
  atomic_add(&number_finished, 1);
  while(number_finished < get_local_size(0));
  //block b part 2
}
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//common operations

Solution

  • By this logic it is not possible to correctly synchronize every odd local worker with every even one ( blocks a and b ) to be run at the same time.

    Is this understanding correct?

    It is not possible. When using a work_group_barrier, you must ensure all work-items in the work group reach it. If there are paths within your code that don't reach the barrier, it may lead to undefined behavior.

    What would be a good synchronization strategy for a situation like this ?

    Usually, barriers are used outside of any conditional sections / loop:

    if (cond)
    {
       // all work items perform work that needs to be synchronized later 
       // such as local memory writes
    }
    
    barrier(CLK_LOCAL_MEM_FENCE); // note the CLK_LOCAL_MEM_FENCE flag
    
    // now every work item can read the data the other work items wrote before the barrier. 
    
    for (...)
    {
    }
    
    barrier(CLK_LOCAL_MEM_FENCE);
    

    Would a logic like this be an acceptable solution?

    It might work, but a barrier outside the conditional section would be more efficient than a busy wait.

    if(0 == (get_local_id(0)%2)){
      //block a
    }else{
      // block b
    }
    barrier(CLK_LOCAL_MEM_FENCE); // ensures number_finished == get_local_size(0)