According to the specs:
If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier.
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.
In my understanding this means, that in any Kernel:
if(0 == (get_local_id(0)%2)){
//block a
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//block a part 2
}else{
//block b
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//block b part 2
}
//common operations
Should one worker reach //block a
, every other worker needs to reach it too.
By this logic it is not possible to correctly synchronize every odd local worker with every even one ( blocks a and b ) to be run at the same time.
Is this understanding correct?
What would be a good synchronisation strategy for a situation like this ?
//block a part 2
and //block b part 2
the workers inside one working group be synced up.Would a logic like this be an acceptable solution?
__local number_finished = 0;
if(0 == (get_local_id(0)%2)){
//block a
atomic_add(&number_finished, 1);
while(number_finished < get_local_size(0));
//block a part 2
}else{
//block b
atomic_add(&number_finished, 1);
while(number_finished < get_local_size(0));
//block b part 2
}
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
//common operations
By this logic it is not possible to correctly synchronize every odd local worker with every even one ( blocks a and b ) to be run at the same time.
Is this understanding correct?
It is not possible. When using a work_group_barrier
, you must ensure all work-items in the work group reach it. If there are paths within your code that don't reach the barrier, it may lead to undefined behavior.
What would be a good synchronization strategy for a situation like this ?
Usually, barriers are used outside of any conditional sections / loop:
if (cond)
{
// all work items perform work that needs to be synchronized later
// such as local memory writes
}
barrier(CLK_LOCAL_MEM_FENCE); // note the CLK_LOCAL_MEM_FENCE flag
// now every work item can read the data the other work items wrote before the barrier.
for (...)
{
}
barrier(CLK_LOCAL_MEM_FENCE);
Would a logic like this be an acceptable solution?
It might work, but a barrier outside the conditional section would be more efficient than a busy wait.
if(0 == (get_local_id(0)%2)){
//block a
}else{
// block b
}
barrier(CLK_LOCAL_MEM_FENCE); // ensures number_finished == get_local_size(0)