[SOLVED] Persistent memory cache policy to write and read

Persistent memory cache policy to write and read

Is anyone aware of any shortcomings in trying to use the Intel Optane DC Memory (DCPMM) in App Direct Mode (that is as non-volatile memory) to write or read to/from it using Write Through (WT) or Un-Cacheable (UC) memory policies? The idea is to use regular memory as non-volatile (data is not lost in case of failure), having dirty cache lines is not ideal since is volatile. There are multiple links that show examples using Write Back (WB) or Write Combining (WC) with non-temporal access (NTA) instructions, also using WB and CLFLUSHOPT or CLWB write instructions. Are there any important drawbacks other than bandwidth, not writing an entire cache line to memory when using WT/UC compared to WB/WC?

Solution

(This is mostly speculation, I haven't done any performance testing with Optane DC PM, and only read about UC or WT for DRAM occasionally. But I think enough is known about how they work in general to say it's probably a bad idea for many workloads.)

Further reading about Optane DC PM DIMMs: https://thememoryguy.com/whats-inside-an-optane-dimm/ - they include a wear-leveling remapping layer like an SSD.

Also related: When I test AEP memory, I found that flushing a cacheline repeatedly has a higher latency than flushing different cachelines. I want to know what caused this phenomenon. Is it wear leveling mechanism ? on Intel forums. That would indicate that repeated writes to the same cache line might be even worse than you might expect.

UC also implies strong ordering which would hurt OoO exec, I think. I think UC also stops you from using NT stores for full-line writes. It would also totally destroy read performance so I don't think it's worth considering.

WT is maybe worth considering as an alternative to clwb (assuming it actually works with NV memory), but you'd still have to be careful about compile-time reordering of stores. _mm_clwb is presumably a compiler memory barrier that would prevent such problems.

In a store-heavy workload, you'd expect serious slowdowns in writes, though. Per-core memory bandwidth is very much limited by number of outstanding requests. Making each request smaller (only 8 bytes or something instead of a whole line) don't make it appreciably faster. The vast majority of the time is in getting the request through the memory hierarchy, and waiting for the address lines to select the right place, not the actual burst transfer over the memory bus. (This is pipelined so with multiple full-line requests to the same DRAM page a memory controller can spend most of its time transferring data, not waiting, I think. Optane / 3DXPoint isn't as fast as DRAM so there may be more waiting.)

So for example, storing contiguous int64_t or double would take 8 separate stores per 64-byte cache line, unless you (or the compiler) vectorizes. With WT instead of WB + clwb, I'd guess that would be about 8x slower. This is not based on any real performance details about Optane DC PM; I haven't seen memory latency / bandwidth numbers, and I haven't looked at WT performance. I have seen occasional papers that compare synthetic workloads with WT vs. WB caching on real Intel hardware on regular DDR DRAM, though. I think it's usable if multiple writes to the same cache line aren't typical for your code. (But normally that's something you want to do and optimize for, because WB caching makes it very cheap.)

If you have AVX512, that lets you do full-line 64-byte stores, if you make sure they're aligned. (Which you generally want for performance with 512-bit vectors anyway).