[SOLVED] Why do separate arrive and wait exist in C++20 barrier?

Why do separate arrive and wait exist in C++20 barrier?

C++20 std::barrier has arrive_and_wait method, which is what pretty much every synchronization barrier implementation has.

But it also has separate arrive and wait. Why do these functions exist?

Solution

OK, so you've got a bunch of threads that have to do some kind of synchronized tasks. These tasks are grouped into phases: the tasks from one phase will use data produced by tasks from a previous phase, and all previous phase work must be done before any next-phase work can start. Let us call any work that requires data from a previous phase "in-phase" work.

However, let's say that not everything you need to do actually requires data from a previous phase. There could be some individual work items that a thread could perform that doesn't read data from a previous phase. Let's call this "out-of-phase" work.

Note that this is "out-of-phase" relative to a particular barrier. It could be that the work is in-phase of a different barrier.

If you try to do this out-of-phase work before calling arrive_and_wait, then you could be blocking all of the other threads from doing something even though you are done with the actual work they're waiting on. Depending on the balance between in-phase and out-of-phase work, that could be a lot of wasted performance.

So if a thread has finished its in-phase work and has some out-of-phase work to do, it can arrive. This potentially frees up all of the other threads if they too are finished with their in-phase work. The thread can then go process some out-of-phase work potentially asynchronously with work being done from the next phase. Once the out-of-phase work is done, the thread can wait on the token generated by its call to arrive, which if the next phase has started, will return without blocking.

Indeed, if the amount of in-phase work is much less than the amount of out-of-phase work, then this pattern means that threads almost never block. The barrier just acts as a multi-thread atomic ordering operation, never a blocking one.