I have just started openCL C programming. All work items of a work group update unique locations of local memory. Later, a private variable of a work item is updated based on local data updated by two other work items. Something like this:
__kernel MyKernel(__global int *in_ptr)
{
/* Define a variable in private address space */
int priv_data;
/* Define two indices in private address space */
int index1, index2;
/* index1 and index2 are legitimate local work group indices */
index1 = SOME_CORRECT_VALUE;
index2 = ANOTHER_CORRECT_VALUE;
/* Define storage in local memory large enough to cater to all work items of this work group */
__local int tempPtr[WORK_GROUP_SIZE];
tempPtr[get_local_id(0)] = SOME_RANDOM_VALUE;
/* Do not proceed until the update of tempPtr by this WI has completed */
mem_fence(CLK_LOCAL_MEM_FENCE);
/* Do not proceed until all WI of this WG have updated tempPtr */
barrier(CLK_LOCAL_MEM_FENCE);
/* Update private data */
priv_data = tempPtr[index1] + tempPtr[index2];
}
Although the snippet above is conservative, wouldn't barrier have done the job as it internally does fencing?
Yes, barrier already does fencing.
A barrier will sync the execution in that point. So, all previous instructions have to be executed, therefore memory is consistent at that point. A fence will only ensure all reads/writes are finished before any further read/write is performed, but the workers may be executing different instructions.
In some cases you can go with a single fencing. If you do not care about local workers going out of sync, and you just want the previous memory writes/read be completed. In your case a fence would be enough. (unless that code is running in a loop and there is extra code you have not put in the example).