c++performancecpu-architectureatomicmemory-barriers

Memory barriers in virtual environments - do they interrupt other cores?


Let's say I call a memory barrier like:

std::atomic_thread_fence(std::memory_order_seq_cst);

From the documentation I read that this implement strong ordering among all cores, even for non atomic operations, and that it's very expensive so it should be used sparingly.

My questions are:


Solution

  • Fences are local, affecting only the current thread. In terms of hardware, only the current logical core executing this thread. The cost of one thread executing a fence doesn't scale with the number of cores in the machine. (It does potentially scale with the size of the core, the number of in-flight loads and stores it has to wait for.)

    this implements strong ordering among all cores

    Only if all threads of your program use seq_cst memory order for operations and fences. If you look at the C++ standard, an operation or operation+fence only synchronizes with another non-relaxed operation, or an operation+fence. (See https://preshing.com/20130922/acquire-and-release-fences/ for example.)

    The C++ guarantee of sequential consistency for data-race free programs only applies if all atomic operations are seq_cst, or if you use equivalent fences. One thread using a fence can't necessarily recover sequential consistency when other threads are using relaxed or release and acquire operations on std::atomic. std::mutex operations are only acquire and release, but that's fine because the semantics of a lock provide additional constraints on what orders can happen.


    An SC fence (full barrier) is local to the (logical) core executing it, draining the store buffer and finishing earlier loads before any later loads can execute or later stores can commit. It doesn't even have to block out-of-order exec of ALU work on that core. But usually memory loads are part of most dependency chains, so a full barrier is pretty expensive, hurting instruction-level parallelism around it a lot on the one logical core which executed it.

    Zero effect on other cores, though, so not interrupting anything. Fence instructions run in user-space, not a system-call either.


    What you're thinking of is like the Linux membarrier(2) system call which does indeed have to interrupt every other core to run a barrier there, allowing you to make some threads very fast (requiring only compiler barriers against compile-time reordering, in C++ terms like atomic_signal_fence), at the cost of making the slow path very costly.


    Related: