x86cpu-architecturecpu-cachecache-invalidationpersistent-memory

Why does CLFLUSH exist in x86?


I recently learned about the row hammer attack. In order to perform this attack the programmer needs to flush the complete cache hierarchy of a CPU for a specific number of addresses.

My question is: why is CLFLUSH necessary in x86? What are the reasons for ever using this instruction, if all L* caches act transparently (i.e., no explicit cache invalidation needed)? Besides that: isn't the CPU free to speculate memory access patterns, and thereby ignore the instruction altogether?


Solution

  • I think the main use-case is Non-volatile DIMMs, especially Intel's Optane DC PM. It's normally mapped WB-cacheable so requires explicit flushes (or movnt) to make sure data is persisted to non-volatile storage.

    (But clflush was introduced at the same time as SSE2, back in Pentium 4 days. I don't know what the idea was there; possibly explicit cache control for performance reasons, like the opposite of prefetch.)

    Skylake introduced weakly-ordered higher performance CLFLUSHOPT because it's useful for non-volatile storage hooked up to the memory hierarchy directly. Flushing cache makes sure data is written out to actual memory, not still dirty in the CPU.

    See also this SuperUser answer for some links and background on Optane DC PM (Persistent Memory). It's non-volatile storage in physical address-space, not just in virtual address space with software tricks.

    Dan Luu's article on clwb and pcommit is interesting: the benefits of taking the OS out of the way for access to storage, detailing Intel's plans at that point for clflush / clwb and their memory-ordering semantics. It was written while Intel was still planning to require an instruction called pcommit (persistent commit) as part of this process, but Intel later decided to remove that instruction: Deprecating the PCOMMIT Instruction (from Intel) has some interesting info about why, and how things work under the hood.


    It potentially also matters for non-cache-coherent DMA to devices, if anything can do that in x86. (But x86 has always had cache-coherent DMA, since the first x86 CPUs with caches, to avoid breaking existing software.)

    Apparently it's not possible to map MMIO / PCIe device memory regions as write-back (WB) cacheable. how to do mmap for cacheable PCIe BAR Maybe P4 architects were considering that future possibility when they introduced it.

    In that previous link, Dr. Bandwidth mentions a partial workaround that actually involves needing CLFLUSH to maintain correctness:

    map the MMIO range twice -- once for store operations from the processor to the FPGA using the Write-Combining (WC) memory type, and once for reads from the processor to the FPGA using the Write Protect (WP) or Write Through (WT) types. You will need to maintain coherence manually by using CLFLUSH on cache lines in the "read only" region when you write to the alias of that line in the "write only" region.

    So it is possible to create a situation where you might need clflush, other than for NV-DIMM.