x86computer-sciencecpu-architecturecpu-cachemesi

What cache coherence solution do modern x86 CPUs use?


I am somewhat confused with what how cache coherence systems function in modern multi core CPU. I have seen that snooping based protocols like MESIF/MOESI snooping based protocols have been used in Intel and AMD processors, on the other hand directory based protocols seem to be a lot more efficient with multiple core as they don't broadcast but send messages to specific nodes.

What is the modern cache coherence solution in AMD or Intel processors, is it snooping based protocols like MOESI and MESIF, or is it only directory based protocols, or is it a combination of both (snooping based protocols for communication between elements inside the same node, and directory based for node to node communications)?


Solution

  • MESI is defined in terms of snooping a shared bus, but no, modern CPUs don't actually work that way. MESI states for each cache line can be tracked / updated with messages and a snoop filter (basically a directory) to avoid broadcasting those messages, which is what Intel (MESIF) and AMD (MOESI) actually do.

    e.g. the shared inclusive L3 cache in Intel CPUs (before Skylake server) lets L3 tags act as a snoop filter; as well as tracking the MESI state, they also record which core # (if any) has a private copy of a line. Which cache mapping technique is used in intel core i7 processor?

    For example, a Sandybridge-family CPU with a ring bus (modern client chips, server chips up to Broadwell). Core #0 reads a line. That line is in Modified state on core #1.

    This is super hand-wavy; do not take my word for it on the exact details, but the general concept of sending messages like share-request, RFO, or write-back, is the right mental model. BeeOnRope has an answer that with a similar breakdown into steps that covers uops and the store buffer, as well as MESI / RFO.


    In a similar case, core #1 could have silently dropped the line without having modified it, if it had only gotten Exclusive ownership but never written it. (Loads that miss in cache default to loading into Exclusive state so a separate store won't have to do an RFO for the same line). In that case I assume it the core that doesn't have the line after all has to send a message back to indicate that. Or maybe it sends a message directly to one of the memory controllers that are also on the ring bus, instead of a round trip back to the L3 slice to force it to do that.

    Obviously stuff like this can be happening in parallel for every core. (And each core can have multiple outstanding requests it's waiting for: memory level parallelism within a single core. On Intel, L2 superqueue has 16 entries on some microarchitectures, while there are 10 or 12 L1 LFBs.)

    Quad-socket and higher systems have snoop filters between sockets; dual-socket Intel systems with E5-xxxx CPUs of Broadwell and earlier did just spam snoops to each other over the QPI links. (Unless you used a quad-socket-capable CPU (E7-xxxx) in a dual-socket system). Multi-socket is hard because missing in local L3 doesn't necessarily mean it's time to hit DRAM; the / an other socket might have the line modified.

    Also related: