multithreading cpu-architecture atomic cpu-cache mesi

optimal to flush low-contention atomic from caches?

If I have some atomic variable which

is accessed relatively infrequently (low contention)
is accessed uniformly at random by threads/cores (i.e. if thread A writes to the variable, with high likelihood it is not A which next accesses the variable)

then, when A is done with its writes to the variable, should I in some way explicitly flush the variable out of the l1/l2/l3 cache of thread A's core, so that when some other thread B needs to access the variable some time later, it finds a clean cache line in RAM rather than a dirty cache line owned by another core?

Some subquestions

is accessing a clean cache line from RAM faster or slower than accessing a dirty cache line from another core? I'm assuming the former is faster here, but maybe it's not?
rather than all the way to RAM, would it be enough to flush to shared l3 cache (and out of per-core l1/l2)?
if it would be beneficial to flush/write the cache line out to RAM (or l3), how can I do so?

Also, what documentation/etc should I read that covers this type of information?

Solution

TL:DR: on x86 you want cldemote. Other things are probably not worth doing, especially if your writer thread can be doing useful work after this store. Or if it doesn't, and the OS doesn't have another thread to run on this core, putting the core into a deep sleep will involve the CPU writing back its dirty cache lines before powering down its private caches.

I'd expect that reading RAM is generally slower than a dirty line in another core, especially in a single-socket system. (In multi-socket NUMA system, if a remote core has a dirty copy of a cache line that's backed by local DRAM, that might change things, or at least make DRAM less far behind.)

If a good (and cheap for the writer) write-back instruction doesn't exist, it's probably better to do nothing than to go too far.

DRAM latency first has to miss in L3, then a message from that L3 slice has to get to a memory controller over the interconnect (e.g. ring bus or mesh), then you have to wait for DRAM command latency over the external memory bus. And there might already be some queued requests in the memory controller that yours waits behind. Then the data has to get back to the core that did the load.

A dirty line in another core also involves another message after detecting an L3 miss, but to the core that owns the line. (L3 tags themselves may indicate that, like on Intel CPUs which use inclusive L3 for that reason. Otherwise a separate directory acting as a snoop filter). It should be able to respond faster than a DRAM controller, since it just has to read the data from its fast L2 or L1d cache and send it to the L3 slice. (And also directly to the core that wanted it?)

The ideal case is a hit in last-level (normally L3) cache which backstops coherency traffic. So you want the line evicted from the private L1d/L2 cache of the last core to write it. (Or at least written back to L3 and demoted to Shared state in those private caches, not Exclusive/Modified. So a read from the same core could hit in L1d, only needing off-core traffic (and RFO = Read For Ownership) if it writes again.)

But not all ISAs have instructions to do that cheaply (without slowing the writer too much) or without going too far and forcing a write to RAM. An instruction like x86 clwb that forces a write to RAM but leaves the line clean in L3 after could be worth considering in some use-cases, but wastes DRAM bandwidth. (But note that Skylake implements clwb as clflushopt; only in Ice Lake and later does it actually keep the data cached as well as write-back to DRAM).

If it's not frequently accessed, some of the time it'll get written back to L3 just from ordinary activity on the last core to write (e.g. looping over an array), before any other core reads or writes it. That's great, and anything that forcibly evicts even from L3 will prevent this from happening. If the line is accessed frequently enough to normally stay hot in L3 cache, you don't want to defeat that.

If the writer thread / core doesn't have anything else to do after writing, you could imagine accessing other cache lines to try to get the important write evicted by the normal pseudo-LRU mechanism, but that would only be worth it if the load latency for the next reader was so important that it was worth wasting a bunch of CPU time in the writing thread and generating extra coherency traffic for other cache lines now, to optimize for that later time in some other thread.

Is there any way to write for Intel CPU direct core-to-core communication code? - CPUs are pretty well optimized for write on one core, read on another core, because that's a common pattern in real code.
Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees? (i.e. when this core commits to L1d, or when it invalidates other caches so they'll have to ask this one for the: no, it doesn't directly make that faster, and isn't worth doing.)
x86 MESI invalidate cache line latency issue - proposes having a third thread read the shared data every millisecond to pull data out of the last writer, making it more efficient for a high-priority thread to eventually read it.
CPU cache inhibition (fairly x86 centric)

Instructions on various ISAs

RISC-V instruction to write dirty cache line to next level of cache - none exist for RISC-V, at least not in 2020.
ARM/AArch64: I don't know, but I wouldn't be surprised if there's something; ARM has quite a few cache-management instructions since it supports non-coherent shared memory, and a major use-case for ARM chips is embedded systems where one freestanding program is the only thing running. Edits welcome.
Any other ISA with interesting cache-management instructions?

x86 - nothing good until the recent `cldemote`

NT stores: bypass cache (including L3) all the way to DRAM, and forcibly evict the line if it was previously hot anywhere. So that's a disaster.
clflush / clflushopt - these evict all the way to DRAM, you don't want this. (Opposite of cache prefetch hint has some performance numbers for flushing small arrays.)
clwb - this Writes Back all the way to DRAM, but does leave the data cached on Ice Lake and later. (In Skylake/Cascade Lake it actually runs the same as clflushopt. At least it runs without faulting so future persistent-memory libraries can just use it without checking ISA version stuff.) And the commit to DRAM (possibly to an NV-DIMM) can be ordered by sfence, so presumably the core has to track it all the way out, tying up space in its queues?
cldemote in Tremont and Sapphire Rapids - designed for exactly this use-case: it's a performance hint, like the opposite of a prefetch. It writes back to L3. It runs as a nop on CPUs that don't support it, since they intentionally picked an encoding that existing CPUs already ran as a previously-undocumented NOP.

Hints to hardware that the cache line that contains the linear address specified with the memory operand should be moved (“demoted”) from the cache(s) closest to the processor core to a level more distant from the processor core. This may accelerate subsequent accesses to the line by other cores in the same coherence domain, especially if the line was written by the core that demotes the line. Moving the line in such a manner is a performance optimization, i.e., it is a hint which does not modify architectural state. Hardware may choose which level in the cache hierarchy to retain the line (e.g., L3 in typical server designs). The source operand is a byte memory location.

Unlike clwb, it has no guaranteed behaviour (just a hint), and no ordering wrt. even fence instructions, only wrt. stores to the same cache line. So the core doesn't have to track the request after it sends a message over the interconnect with the data to be written to L3 (and notifying that this core's copy of the line is not clean.)

optimal to flush low-contention atomic from caches?

Instructions on various ISAs

x86 - nothing good until the recent cldemote

x86 - nothing good until the recent `cldemote`