c++multithreading x86-64 volatile lock-free

volatile variable updated from multiple threads C++

    volatile bool b;
 
    
    Thread1: //only reads b
    void f1() {
    while (1) {
       if (b) {do something};
       else { do something else};
    }
    }
    
    Thread2: 
    //only sets b to true if certain condition met
    // updated by thread2
    void f2() {
    while (1) {
       //some local condition evaluated - local_cond
       if (!b && (local_cond == true)) b = true;
        //some other work
    }
    }
    
    Thread3:
    //only sets b to false when it gets a message on a socket its listening to
    void f3() {
    while (1) {
        //select socket
        if (expected message came) b = false;
        //do some other work
    }
    }

If thread2 updates b first at time t and later thread3 updates b at time t+5:

will thread1 see the latest value "in time" whenever it is reading b?

for example: reads from t+delta to t+5+delta should read true and reads after t+5+delta should read false.

delta is the time for the store of "b" into memory when one of threads 2 or 3 updated it

Solution

The effect of volatile keyword is principally two things (I avoid scientifically strict formulations here):

Its accesses can't be cached or combined. (UPD: on suggestion, I underline this is for caching in registers or another compiler-provided location, not the RAM cache in CPU.) For example, the following code:
```
 x = 1;
 x = 2;
```
for a volatile x will never be combined into single x = 2, whatever optimization level is required; but if x is not volatile, even low levels will likely cause this collapse into a single write. The same for reads: each read operation will access the variable value without any attempt to cache it.
All volatile operations are relayed onto machine command layer in the same order between them (to underline, only between volatile operations), as they are defined in source code.

But this is not true for accesses between non-volatile and volatile memory. For the following code:
```
int *x;
volatile int *vy;
void foo()
{
  *x = 1;
  *vy = 101;
  *x = 2;
  *vy = 102;
}
```
gcc (9.4) with -O2 and clang (10.0) with -O both produce something similar to:
```
        movq    x(%rip), %rax
        movq    vy(%rip), %rcx
        movl    $101, (%rcx)
        movl    $2, (%rax)
        movl    $102, (%rcx)
        retq
```
so one access to x is already gone, despite its presence between two volatile accesses. If one needs the first x = 1 to succeed before first write to vy, one must use an explicit barrier (since C11, atomic_signal_fence is the platform-independent means for this).

That was the common rule but without regarding multithreading issues. What happens here with multithreading?

Well, imagine as you declare that thread 2 writes true to b, so, this is writing of value 1 to single-byte location. But, this is an ordinary write without any memory ordering requirements. What you specified with volatile is that the compiler won't optimize it. But what about the processor?

If this was a modern abstract processor, or one with relaxed rules, like ARM, I'd say nothing prevents it from postponing the real write for an indefinite time. (To clarify, "write" is exposing the operation to RAM-and-all-caches conglomerate.) It's fully up to processor's deliberation. Well, processors are designed to flush their stockpiling of pending writes as fast as possible. But what affects real delay, you can't know: for example, it could "decide" to fill instruction cache with a few next lines, or flush another queued writings... lots of variants. The only thing we know it provides "best effort" to flush all queued operations, to avoid getting buried under previous results. That's truly natural and nothing more.

With x86, there is an additional factor. Nearly every memory write (and, I guess, this one as well) is a "releasing" write in x86, so, all previous reads and writes shall be completed before this write. But, the gut fact is that the operations to complete are before this write. So when you write true to volatile b, you will be sure all previous operations have already got visible to other participants... but this one still could be postponed for a while... how long? Nanoseconds? Microseconds? Any other write to memory will flush and so publish this write to b... do you have writes in cycle iteration of thread 2?

The same affects thread 3. You can't be sure this b = false will be published to other CPUs when you need it. Delay is unpredictable. The only thing is guaranteed, if this is not a realtime-aware hardware system, for an indefinite time, and the ISA rules and barriers provide ordering but not exact times. And, x86 is definitely not for such a realtime.

Well, all this means you also need an explicit barrier after write which affects not only compiler, but CPU as well: barrier before previous write and following reads or writes. In C or C++, a full barrier satisfies this - so you have to add std::atomic_thread_fence(std::memory_order_seq_cst) or use atomic variable (instead of plain volatile one) with the same memory order for write.

And, all this still won't provide you with exact timings like you described ("t" and "t+5"), because the visible "timestamps" of the same operation can differ for different CPUs! (Well, this resembles Einstein's relativity a bit.) All you could say in this situation is that something is written into memory, and typically (not always) the inter-CPU order is what you expected (but the ordering violation will punish you).

But, I can't catch the general idea of what do you want to implement with this flag b. What do you want from it, what state should it reflect? Let you return to the upper level task and reformulate. Is this (I'm just guessing on coffee grounds) a green light to do something, which is cancelled by an external order? If so, an internal permission ("we are ready") from the thread 2 shall not drop this cancellation. This can be done using different approaches, as:

Just separate flags and a mutex/spinlock around their set. Easy but a bit costly (or even substantially costly, I don't know your environment).
An atomically modified analog. For example, you can use a bitfield variable which is modified using compare-and-swap. Assign bit 0 to "ready" but bit 1 for "cancelled". For C, atomic_compare_exchange_strong is what you'll need here on x86 (and most other ISAs). And, volatile is not needed any more if you keep accessing with memory_order_seq_cst.