multithreadingcpu-architectureatomiccpu-cachemesi

optimal to flush low-contention atomic from caches?


If I have some atomic variable which

then, when A is done with its writes to the variable, should I in some way explicitly flush the variable out of the l1/l2/l3 cache of thread A's core, so that when some other thread B needs to access the variable some time later, it finds a clean cache line in RAM rather than a dirty cache line owned by another core?

Some subquestions

Also, what documentation/etc should I read that covers this type of information?


Solution

  • TL:DR: on x86 you want cldemote. Other things are probably not worth doing, especially if your writer thread can be doing useful work after this store. Or if it doesn't, and the OS doesn't have another thread to run on this core, putting the core into a deep sleep will involve the CPU writing back its dirty cache lines before powering down its private caches.


    I'd expect that reading RAM is generally slower than a dirty line in another core, especially in a single-socket system. (In multi-socket NUMA system, if a remote core has a dirty copy of a cache line that's backed by local DRAM, that might change things, or at least make DRAM less far behind.)

    If a good (and cheap for the writer) write-back instruction doesn't exist, it's probably better to do nothing than to go too far.

    DRAM latency first has to miss in L3, then a message from that L3 slice has to get to a memory controller over the interconnect (e.g. ring bus or mesh), then you have to wait for DRAM command latency over the external memory bus. And there might already be some queued requests in the memory controller that yours waits behind. Then the data has to get back to the core that did the load.

    A dirty line in another core also involves another message after detecting an L3 miss, but to the core that owns the line. (L3 tags themselves may indicate that, like on Intel CPUs which use inclusive L3 for that reason. Otherwise a separate directory acting as a snoop filter). It should be able to respond faster than a DRAM controller, since it just has to read the data from its fast L2 or L1d cache and send it to the L3 slice. (And also directly to the core that wanted it?)


    The ideal case is a hit in last-level (normally L3) cache which backstops coherency traffic. So you want the line evicted from the private L1d/L2 cache of the last core to write it. (Or at least written back to L3 and demoted to Shared state in those private caches, not Exclusive/Modified. So a read from the same core could hit in L1d, only needing off-core traffic (and RFO = Read For Ownership) if it writes again.)

    But not all ISAs have instructions to do that cheaply (without slowing the writer too much) or without going too far and forcing a write to RAM. An instruction like x86 clwb that forces a write to RAM but leaves the line clean in L3 after could be worth considering in some use-cases, but wastes DRAM bandwidth. (But note that Skylake implements clwb as clflushopt; only in Ice Lake and later does it actually keep the data cached as well as write-back to DRAM).

    If it's not frequently accessed, some of the time it'll get written back to L3 just from ordinary activity on the last core to write (e.g. looping over an array), before any other core reads or writes it. That's great, and anything that forcibly evicts even from L3 will prevent this from happening. If the line is accessed frequently enough to normally stay hot in L3 cache, you don't want to defeat that.

    If the writer thread / core doesn't have anything else to do after writing, you could imagine accessing other cache lines to try to get the important write evicted by the normal pseudo-LRU mechanism, but that would only be worth it if the load latency for the next reader was so important that it was worth wasting a bunch of CPU time in the writing thread and generating extra coherency traffic for other cache lines now, to optimize for that later time in some other thread.


    Related:


    Instructions on various ISAs

    x86 - nothing good until the recent cldemote

    Unlike clwb, it has no guaranteed behaviour (just a hint), and no ordering wrt. even fence instructions, only wrt. stores to the same cache line. So the core doesn't have to track the request after it sends a message over the interconnect with the data to be written to L3 (and notifying that this core's copy of the line is not clean.)