multithreading synchronization atomic mmap memory-mapped-files

Do I need to synchronize writes to memory mapped file from different threads before flushing and unmapping it?

Let suppose I have memory mapped file and write into it from different threads (writes never overlap and are independent from each other). I want to sync already written data with disk and execute msync or FlushViewOfFile, then unmap file.

Do I need to synchronize writer threads with flushing thread, e.g. using release memory fences on writers and acquire fences on a flusher? I fear that some of the writes would still be in the CPU caches and not in main memory at this point.

Do CPU and OS guarantee that writes that are still on CPU caches would eventually get into disk or should I first ensure that writes reach main RAM and only then flush and unmap a file?

My threads don't ever read from mapped pages and I want to use only relaxed atomic operations to track my data, if that is possible.

Pseudo-code of what I try to do:

static NUM_WRITERS: AtomicUint = 0;
static CURRENT_OFFSET: AtomicUint = 0;

fn some_thread_fn(void* buffer, payload: &[byte]){
   NUM_WRITERS.fetch_add(1, memory_order_relaxed);
   offset = CURRENT_OFFSET.fetch_add(payload_size, memory_order_relaxed);
   void* dst = buffer + offset;
   memcpy(dst, payload, payload_size);
   
   // Do I need memory fence or fetch_sub(Release) here?
   compiler_fence(memory_order_release); // Prevent compiler from reordering instructions
   NUM_WRITERS.fetch_sub(1, memory_order_relaxed);

   if offset + payload_size < buffer_size {
      // Some other thread is responsible for unmapping.
      return;
   }

   while (NUM_WRITERS.load(relaxed) > 0) {
      mm_pause();
   }

   // Do I need acquire memory fence here?
   compiler_fence(memory_order_acquire); // Prevent compiler from reordering instructions
   flush_async(buffer);
   unmap(buffer);
}

Solution

All stores need to happen-before the munmap, else they could fault. They will make it to disk unless the system crashes before that happens. For data to be affected by msync, make sure the write (assignments / memcpy) happens-before msync on that address range.

fdatasync on the fd after munmap would be a simpler way to make sure all dirty data makes it to disk ASAP, unless you have other regions of the file that you don't want to sync. Without any manual syncing, dirty pages in the page-cache get queued for write back to disk after some timeout, like 15 seconds.

Sequenced-before is a sufficiently-strong form of happens-before, since a call to munmap or mysnc does the "observing" from this thread. For example, within a single thread, the "dirty" flag bits in the page-table entries will be seen by any later kernel code (such as during the msync syscall) for pages modified by store instructions by this thread. (Or by any other threads that you've synced-with.)

I think in your case, yes you do need NUM_WRITERS.fetch_sub(1, memory_order_release); for every thread, and while (NUM_WRITERS.load(acquire) > 0) { pause } for the one thread that reaches that spin-wait loop to do the cleanup. mm_pause() is x86-specific; on x86, acquire loads are free, same asm as relaxed. And all RMWs need lock, e.g. lock add, the same asm that's strong enough for seq_cst. If you plan to port to other ISAs, then rest assured AArch64 has relatively efficient acquire and release.

relaxed may work most of the time, but would in theory allow this thread to munmap and invalidate the page-table entries before other threads have even reached the store instructions, leading to a fault. Or more plausibly to have some of the stores happen after mysnc.

With cache-coherent DMA (like on x86), only the actual DMA read of the memory by the device is the deadline for stores to have committed to cache. (But for msync to notice the page was dirty in the first place and queue it for writing to disk, at least one byte of it would have to be written recently before the OS checked the hardware page tables.)

When a store instruction runs on a page where the TLB entry shows the D (Dirty) bit = 0, on x86 that core takes a microcode assist to atomically RMW the page-table entry to have D=1. (https://wiki.osdev.org/Paging#Page_Directory) There's also an A bit which gets set even by reads. (The OS can clear this and see which pages have it set again soon; those pages are bad choices for eviction.)

You don't need manual atomic_signal_fence (aka compiler_barrier) because compilers already can't move stores past a function call to a function that might read the stored data. (Like arr[i] = 1; foo(arr) where foo is msync or munmap, for exactly the same reason it's safe with a user-defined function the compiler doesn't know about.) CPUs that do out-of-order exec will preserve the illusion of a single thread running in program order.

If each write has page granularity, you could have each thread do its own msync on the pages it wrote. This would be not great if writes are smaller than pages, since you'd trigger multiple disk I/Os for the same page, though.