java x86-64 volatile memory-barriers java-memory-model

Java volatile memory ordering and its compilation on x86-64

Consider the following simple Java application:

public class Main {
    public int a;
    public volatile int b;

    public void thread1(){
        int b;
        a = 1;
        b = this.b;
    }

    public void thread2(){
        int a;
        b = 1;
        a = this.a;
    }

    public static void main(String[] args) throws Exception {
        Main m = new Main();
        while(true){
            m.a = 0;
            m.b = 0;
            Thread t1 = new Thread(() -> m.thread1());
            Thread t2 = new Thread(() -> m.thread2());
            t1.start();
            t2.start();
            t1.join();
            t2.join();
        }
    }
}

QUESTION: Is it possible that reading into local variables will result in thread1::b = 0 and thread2::a = 0?

I could not prove that it could not happen from the JMM standpoint so I went down to analyzing compiled code for x86-64.

Here is what compiler ends up with for methods thread1 and thread2 (code unrelated to the while loop and some comments generated by -XX:+PrintAssembly omitted for simplicity):

thread1:

  0x00007fb030dca235: movl    $0x1,0xc(%rsi)    ;*putfield a
  0x00007fb030dca23c: mov     0x10(%rsi),%esi   ;*getfield b

thread2:

  0x00007fb030dcc1b4: mov     $0x1,%edi
  0x00007fb030dcc1b9: mov     %edi,0x10(%rsi)
  0x00007fb030dcc1bc: lock addl $0x0,0xffffffffffffffc0(%rsp) ;*putfield b 
  0x00007fb030dcc1c2: mov     0xc(%rsi),%esi    ;*getfield a

So what we have here is that volatile read is done for free, volatile write requires mfence (or lock add) after.

So thread1's Store can still be forwarded after the Load and therefore thread1::b = 0 and thread2::a = 0 is possible.

Solution

Yeah, your analysis looks right. This is the StoreLoad litmus test with only one of the sides having a StoreLoad barrier (like C++ std::atomic iwth memory_order_seq_cst, or Java volatile). It's needed in both to shut down this possibility. See Jeff Preshing's Memory Reordering Caught in the Act for details on the case where neither side has such a barrier.

StoreLoad reordering of a=1 with b=this.b allows an effective order of

   thread1        thread2
                  b=this.b        // reads 0
    b=1
    a=this.a                      // reads 0
                  a=1

(This mess of names is why it's normal for examples and reordering litmus tests to pick names like r0 and r1 for "registers" to talk about load results that threads observed, definitely not the same names as the shared variables which make the meaning of a statement context-sensitive and a pain to look at and think about in a reordering diagram.)

So thread1's Store can still be forwarded after the Load and therefore thread1::b = 0 and thread2::a = 0 is possible.

It seems you mean "reordered after", not forwarded. "Forwarding" in a memory-ordering context would mean store-to-load forwarding (where a load pulls data from the store buffer before it becomes globally visible, so it sees its own stores right away, in a different order relative to other things than other threads would). But neither of your threads is reloading its own stores, so that's not happening.

x86's memory model is basically program-order + a store buffer with store-to-load forwarding, so StoreLoad reordering is the only kind that can happen.

So yes, this is the closest you can come to ruling out ra=rb=0 while still leaving a window open for it to happen. Running on a strongly-ordered ISA (x86), with a barrier in one side.

It's also going to be really unlikely to observe when you only make one test per thread startup; not surprised it took you 30 minutes for these executions to happen at close enough to the same time across cores to observe this. (Testing faster is non-trivial, like a 3rd thread that resets things between tests and wakes both other threads? But doing something to make it more likely that both threads reach this code at the same time could help a lot, like maybe having them both spin-wait for the same variable, so they'd likely wake within a hundred cycles of each other.)