I was wondering what benefits MOESI has over the MESI cache coherency protocol, and which protocol is currently favored for modern architectures. Oftentimes benefits don't translate to implementation if the costs don't allow it. Quantitative performance results of MOESI over MESI would be nice to see also.
AMD uses MOESI, Intel uses MESIF. (I don't know about non-x86 cache details.)
MOESI allows sending dirty cache lines directly between caches instead of writing back to a shared outer cache and then reading from there. The linked wiki article has a bit more detail, but it's basically about sharing dirty data. The Owned state keeps track of which cache is responsible for writing back dirty the data.
MESIF allows caches to Forward a copy of a clean cache line to another cache, instead of other caches having to re-read it from memory to get another Shared copy. (Intel since Nehalem already uses a single large shared L3 cache for all cores, so all requests are ultimately backstopped by one L3 cache before checking memory anyway, but that's for all cores on one socket. Forwarding apply between sockets in a multi-socket system. Until Skylake-AVX512, the large shared L3 cache was inclusive. Which cache mapping technique is used in intel core i7 processor?)
Wikipedia's MESIF article (linked above) has some comparison between MOESI and MESIF.
AMD in some cases has lower latency for sharing the same cache line between 2 cores. For example, see this graph of inter-core latency for Ryzen vs. quad-core Intel vs. many-core Intel (ring bus: Broadwell) vs. Skylake-X (worst).
Obviously there are many other differences between Intel and AMD designs that affect inter-core latency, like Intel using a ring bus or mesh, and AMD using a crossbar / all-to-all design with small clusters. (e.g. Ryzen has clusters of 4 cores that share an L3. That's why the inter-core latency for Ryzen has another step from core #3 to core #4.)
BTW, notice that the latency between two logical cores on the same physical core is much lower for Intel and AMD. What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.
I didn't look for any academic papers that simulated MESI vs. MOESI on an otherwise-similar model.
Choice of MESIF vs. MOESI can be influenced by other design factors; Intel's use of a large tag-inclusive L3 shared cache as a backstop for coherency traffic is their solution to the same problem that MOESI solves: traffic between cores is handled efficiently with write-back to L3 then sending the data from L3 to the requesting core, in the case where a core had the line in Modified state in a private L2 or L1d.
IIRC, some AMD designs (like some versions of Bulldozer-family) didn't have a last-level cache shared by all cores, and instead had larger L2 caches shared by pairs of cores. Higher-performance BD-family CPUs did also have a shared cache, though, so at least clean data could hit in L3.