concurrency parallel-processing opencl gpgpu consistency

OpenCL 1.2: Global memory consistency surrounding atomic operations?

I'm trying to implement global synchronization in OpenCL 1.2 using atomics and was wondering if there's any way to ensure that reads from different work groups (that provably -- by the logic of the program -- occur after an atomic increment) see the updated result.

Details: I have a binary tree (stored as an array) of values I wish to compute. Values of internal nodes depend on those calculated for their children. There are n leaves, and n threads are spawned, which traverse from leaf -> root.

For any particular value:

Suppose there are two threads, not necessarily in the same workgroup:

both threads atomically increment a counter in global memory (preinitialized to 0).
if atomic_inc returns 0, the thread returns (first visitor).
if atomic_inc returns 1, the thread calculates the value for the node (using the values for its children if an internal node), follows a pointer/index to its parent, and loops this process (second visitor).

Thus, the thread processing any node is guaranteed that both its children have already been processed in "real time". I have a working implementation of the same idea working with a threadpool on CPU.

Problem:

When I read the calculated values for children from an internal node, a minority of them are stale/uninitialized. In fact, even the corresponding counter for children (which is only modified by atomic operations), may give stale values (e.g., 0, or 1, as opposed to 2). From the procedure above, it is impossible to be processing a node before having visited each of its children twice.

This suggests to me that OpenCL 1.2 has only eventual consistency guarantees on global memory reads between workgroups on the same memory, even when writes are atomic. Is my assessment correct (and if so, are there any ways around this)?

I greatly appreciate any help with this matter!

TL;DR: I am aware OpenCL 1.2 generally speaking has a relaxed consistency memory model - does this also apply to atomic writes? (And if so, can one ensure reads which occur after an atomic write are not stale?)

(If this is not possible, this would seem to me to be a significant issue, since atomic increments - which involve a read - surely can synchronize with prior atomic writes (or else a lot of correctness guarantees would go out the window)? Surely reads can synchronize across work groups with atomic writes too?)

Solution

I ended up solving this myself. For anyone else with a similar issue, it appears atomic operations are not synchronized with regular reads, and atomic_load is not provided by the OpenCL API. I implemented it myself (which synchronizes with writes from different workgroups and workitems) as follows:

uint atomic_load(volatile global uint* ptr) {
    return atomic_cmpxchg(ptr, DUMMY_VALUE, DUMMY_VALUE);
}

Where DUMMY_VALUE is just any value (the only important part is that the last 2 arguments are the same value).

This works since if *ptr == DUMMY_VALUE, DUMMY_VALUE is stored and it is also returned, and if they are not equal, the old value of *ptr is kept and also returned.