c multithreading language-lawyer atomic c11

Does the the 2+2W litmus test code in the 'Repairing Sequential Consistency' paper have a problem?

I was reading the paper Repairing Sequential Consistency in C/C++11 by Lahav et al and had a doubt regarding the 2+2W litmus test presented there.

Under the section Semantics of SC Atomics in C11, the authors present the following code for the 2+2W litmus test.

// thread 1
x :=sc 1
y :=sc 2
a :=rlx y

// thread 2
y :=sc 1
x :=sc 2
b :=rlx x

The text says that the conditions imposed by C11 SC semantics prevent the reads from both reading a value of 1. That is (a==1 && b==1) is false.

I looked up the intent of the 2+2W litmus test, and as per my understanding, it checks which of the two writes per thread end up contributing the final values of the shared variables. The test, as described here, just intends to check the values of the two shared variables as a post condition, not necessarily on the two concurrent threads. So, it seems, any specific placement of the reads isn't necessarily an inherent part of the test.

The way this test is encoded by Lahav, I find it queer, why should a read return a value different from the one written by the write just preceding it.

I understand, that the SC semantics enforced by C11, are intended to uphold the promises made to the users of the language. And to achieve this the compiler maintains operation ordering, and/or inserts specific barrier instructions, depending on the requirements of the architecture. And this specifically prevents both the reads from reading a value of 1.

All the same, what I have also read is that, all hardware maintain the semantics of a single thread of execution as per the program order. Then even in the absence of any enforcement made by C11, the compiler and hardware should maintain the semantics of y :=sc 2 followed by b :=rlx x and result in b==2.

My point of confusion is: In the absence of any semantic enforcing done by C11 on the above code, what hardware optimizations or run time situations would make b :=rlx x read any thing other than 2 (and similarly for a :=rlx y).

Assuming my confusion is not misplaced, part two of the question is: Would the code have been better with the two reads swapped across the two threads? Under C11 SC semantics, the modification_ordered_before and sequenced_before cycle across the four writes, that precludes the validity of the execution under question, would still be exemplified, for the sake of the discussion in the paper.

Solution

In the absence of any semantic enforcing done by C11 on the above code, what hardware optimizations or run time situations would make b :=rlx x read any thing other than 2

Simple: Thread 2 executes y := 1, then x := 2, then it observes thread 1's execution of x := 1, then it executes b := x. There's nothing inconsistent about that sequence of events. The order of operations within a single thread is preserved, but operations by another thread may or may not interpose — because we're making no guarantees.

This outcome is even possible with the full C++11 memory model; thread 1 may or may not write to x in between thread 2's x := 2 and thread 2's b := x, and thread 2 may or may not observe that write since it's making a relaxed read. But it's not allowed for a and b to both be 1 because that would be incompatible with a globally consistent order of the writes to x and y.