arm cpu-architecture memory-barriers memory-model armv8

ARMv8.3 meaning of rcpc

With ARMv8.3 a new instruction has been introduced: LDAPR.

When there is a STLR followed by a LDAR to a different address, then these 2 can't be reordered and hence it is called RCsc (release consistent sequential consistent).

When there is a STLR followed by a LDAPR to a different address, then these 2 can be reordered. This is called RCpc (release consistent processor consistent).

My issue is with the PC part.

PC is a relaxation of TSO whereby TSO is multi-copy atomic and PC is non multi-copy atomic.

The memory model of ARMv8 has been improved to be multi-copy atomic because no supplier ever created a non multi-copy atomic microarchitecture and it made the memory model more complicated.

So I'm running into a contradiction.

The key question is: is every store (including relaxed) multi-copy atomic?

If so, then the PC part of rcpc doesn't make sense to me since PC is non multi-copy atomic. Could it be a legacy name due to ARM being non multi-copy atomic in the past?

There are multiple definitions of PC; so perhaps that is the cause.

Solution

In practice, STLR / LDAPR gives C++ std::memory_order_release and acquire,
as opposed to seq_cst from STLR / LDAR.
LDAR can't reorder with an earlier STLR, but LDAPR can.

LDAPR allows StoreLoad reordering even with earlier release and seq_cst (STLR) stores, vs. LDAR only with earlier relaxed and non-atomic stores (STR). (To the same or different addresses). Waiting for the store buffer to drain is slow, that's the major thing that makes release / acquire faster than code with seq_cst stores and loads. (Or acquire before ARMv8.3).

So "processor consistent" is presumably describing the fact that the current core sees its own operations in program order, and as a way to note that it's not sequentially consistent because they don't use that term. It doesn't mean that other parts of the memory model rules are removed.

Yes, ARMv8 is multi-copy atomic, so every plain store (str, stp, etc.) is multi-copy atomic. i.e. It becomes visible to all other cores at the same time via coherent cache, so all threads can agree on the order of two stores done by two independent writers (the IRIW litmus test). Unlike POWER where some threads can see stores early from other SMT threads on the same physical core.
(More precisely, ARMv8 is other-multi-copy atomic. In the terminology of the arch docs, multi-copy-atomic would imply becoming visible to all cores at the same time, including the one doing the store. i.e. it would forbid store-forwarding. Thanks @Nate for the correction.)

LDAPR doesn't relax that guarantee.

(ARMv7 did not have this property, and I've heard that some of NVidia's 32-bit ARM designs did have IRIW reordering. But ARM's own designs didn't. ARM was able the strengthen their guarantees without actually changing how anything worked in their own microarchitectures, beyond adding support for ARMv8 32-bit mode new instructions. "Shared Memory Consistency Models: A Tutorial" from 1995, linked in comments, uses the term RCpc to describe a category of memory models that does include some readers being able to see some stores before other readers, allowing IRIW. So maybe being multi-copy atomic is orthogonal to RCpc, and RCpc doesn't imply anything about whether IRIW reordering is allowed or not? Regardless, ARMv8's memory model does forbid IRIW reordering.)

Big caveat: I'm not a terminology expert on this, and I've never heard of "processor consistent" before so I'm just guessing from context what they mean by it, with an interpretation that would be consistent with all known facts. Please correct me if this is incompatible with an accepted definition of the term.