assemblycpu-architecturestdatomiccompare-and-swapspinlock

Why is acquire semantics only for reads, not writes? How can an LL/SC acquire CAS take a lock without the store reordering with the critical section?


To start with, consider release semantics. If a data set is protected with a spinlock (mutex, etc. - no matters what exact implementation is used; for now, assume 0 means it's free and 1 - busy). After changing of the data set, a thread stores 0 to spinlock address. To force visibility of all previous actions before storing 0 to spinlock address, storing is executed with release semantics, that means all previous reads and writes shall be made visible to other threads before this storing. It is implementation detail whether this is done with full barrier, or release mark of the single store operation. That is (I hope) clear without any doubt.

Then, consider them moment when spinlock ownership is being taken. To protect against race, this is any kind of compare-and-set operation. With single-instruction CAS implementation (X86, Sparc...), this is combined reading and writing. The same for X86 atomic XCHG. With LL/SC (most RISCs), this falls to:

  1. Read (LL) the spinlock location until it shows free state. (Can be optimized with a kind of CPU stall.)
  2. Write (SC) the value "occupied" (1 in our case). CPU exposes whether the operation was successful (condition flag, output register, etc.)
  3. Check the write (SC) result and, if failed, go to step 1.

In all cases, the operation that shall be visible to other threads to show that spinlock is occupied, is writing of 1 to its location, and barrier shall be committed between this writing and following manipulations on the data set protected with the spinlock. Reading of this spinlock gives nothing to protection scheme, except permit of CAS or LL/SC operation.

But all really implemented schemes allow acquire semantics modification on reads (or CAS), not writes. As result, LL/SC scheme would require additional final read-with-acquire operation on the spinlock to commit the required barrier. But there is no such instruction in typical output. For example, if compile on ARM:

  for(;;) {
    int e{0};
    int d{1};
    if (std::atomic_compare_exchange_weak_explicit(p, &e, d,
          std::memory_order_acquire,
          std::memory_order_relaxed)) {
      return;
    }
  }

its output contains first LDAXR == LL+acquire, then STXR == SC (without barrier in it, so, there is no guarantee other threads will see it?) This is likely not my artifact but is generated e.g. in glibc: pthread_spin_trylock calls __atomic_compare_exchange_weak_acquire (and no more barriers), that falls into GCC builtin __atomic_compare_exchange_n with acquire on mutex reading and no release on mutex writing.

It seems Iʼve missed some principal detail in this consideration. Would anybody correct it?

This also could fall into 2 sub-questions:

SQ1: In instruction sequence like:

(1) load_linked+acquire mutex_address     ; found it is free
(2) store_conditional mutex_address       ; succeeded
(3) read or write of mutex-protected area

what prevents CPU against reordering (2) and (3), with result that other threads won't see mutex is locked?

SQ2: Is there a design factor that suggests having acquire semantics only on loads?

I've seen that some examples of lock-free code, such as:

thread 1:

var = value;
flag.store(true, std::memory_order_release);

thread 2:

if (flag.load(std::memory_order_acquire)) {
   // We already can access it!!!
   value = var;
   ... do something with value ...
}

but this should have been made working after the mutex-protected style gets working stably.


Solution

  • Yes, it is possible on real architectures for the store-conditional to reorder with the critical section. Nevertheless, implementing a mutex this way is safe and correct! You do not need an additional acquire-load or any other barriers.

    The basic idea is the following. Regardless of any reordering with other accesses, the LL/SC pair is still exclusive: if the SC succeeds, you know that no stores to the lock took place between it and the LL. So suppose the LL returns zero (lock available), and the SC of 1 (lock taken) succeeds. You now own the lock.

    But we can say more: for all intents and purposes, you have already owned it since the moment the LL returned zero, as nobody else could have taken it in between (or else the SC would have failed). Your ownership of the lock is effectively retroactive to the moment of the LL.

    Therefore, it was safe to start speculatively executing the critical section as soon as the LL returned. It was not necessary to wait for the SC, provided that everything remained speculative until we were certain that the SC would succeed.

    As such, it's sufficent to have an acquire barrier on the LL, since all we need to ensure is that the critical section is not reordered before the LL (which would obviously break everything). And of course, we need a release store when we unlock the mutex. It's this pairing between the acquire load on locking, and the prior release store on unlocking, that ensures the basic correctness property of a mutex: this instance of the critical section observes all the effects of the previous instance, and the previous instance observes none of the effects of this one.

    It's true that if some other core should simply test the lock, rather than actually trying to take it for themselves, they may observe the effects of the critical section on the protected objects before the lock appears as taken. But that isn't something that you would ever do when using a mutex properly. The other core should not be accessing the protected objects at all unless it actually owns the lock. Under the scenario discussed above, assuming our core succeeded in taking the lock, any other core that was trying must have failed, and therefore they should not even be looking at the protected objects which we may be accessing "early" before the SC.

    Related to this topic, see also:

    Why does this spinlock require memory_order_acquire_release instead of just acquire?

    Is memory_order_acquire really sufficient for locking a spinlock?


    For ARMv8/v9 in particular, the architecture's memory model does indeed allow stxr, or even stlxr, to reorder with a following memory access, with some caveats and exceptions.

    At For purposes of ordering, is atomic read-modify-write one operation or two? you can see an example, reproducible on real hardware, where a later ldr (or even ldapr if available) becomes visible before an earlier stlxr.

    The question of whether a later store (str) could become visible before the stxr/stlxr is a little more delicate. We must only do the str if the stxr succeeded, with code such as:

    retry:
        ldaxr x0, [lock]
        cbnz x0, failed_to_take_lock
        stxr w1, #1, [lock]
        cbnz w1, retry
        str x2, [shared_variable]
        @ ...
        stlr xzr, [lock]
    

    So there's an apparent control dependency between the stxr and the str, and normally ARMv8 doesn't allow stores to be reordered through a control dependency. I think this is implied in B2.3.7 of the Architecture Reference Manual, the third clause in the definition of "Dependency-ordered-before". After all, if it turns out that a store should not have been executed, we can't allow it to have been seen by any other core (unless we somehow had the ability to roll back the other core too, which ARMv8 certainly cannot do).

    However, there is an exception: in B2.3.6, in the first clause of the definition of "Dependency ordered through registers and memory", they exclude a register write generated by a Store Exclusive instruction. So if I understand this correctly, there is not considered to be a true control dependency here, and the str is allowed to become visible before the stxr.

    In practice, I suppose this can only happen if we somehow know in advance that the stxr will succeed - since if it fails, we might load nonzero on the next iteration, in which case we would bail out and the str should never have happened at all. Perhaps an implementation could take the cache line in exclusive state at the ldaxr, and promise to hold it for some reasonable number of clock cycles, denying any requests for ownership from other cores that might arrive before then. So if we can look ahead and see that that the stxr will inevitably be executed within that time period, then we know it will succeed, and we could safely begin to execute past it in a non-speculative manner. I don't know if any real implementation can do this - I have not tried.

    The stxr obviously can't be reordered with a later stlr (the release ordering of the latter would forbid it). If the exclusive store is stlxr, then it can't be reordered with a later ldar, since despite their names, stlr/ldar actually achieve sequentially consistent memory ordering: there's no StoreLoad reordering allowed among them. But stlxr can still be reordered with later ldapr which is true acquire ordering.