c++multithreadinglock-freelocklessaba

How can I implement ABA counter with c++11 CAS?


I am implementing a lock-free queue based on this algorithm, which uses a counter to solve the ABA problem. But I don't know how to implement this counter with c++11 CAS. For example, from the algorithm:

E9:    if CAS(&tail.ptr->next, next, <node, next.count+1>)

It is an atomic operation, meaning if tail.ptr->next is equal to next, let tail.ptr->next point to node and simultaneously (atomically) make next.count+1. However, using C++11 CAS, I can only implement:

std::atomic_compare_exchange_weak(&tail.ptr->next, next, node);

which cannot make next.count+1 simultaneously happen.


Solution

  • To atomically modify two things at once with a single atomic operation, you need to put them in adjacent memory, e.g. in a two-member struct. Then you can use std::atomic<my_struct> to get gcc to emit lock cmpxchg16b on x86-64, for example.

    You don't need inline asm for this, and it's worth a bit of C++ syntax pain to avoid it. https://gcc.gnu.org/wiki/DontUseInlineAsm.

    Unfortunately, with current compilers, you need to use a union to get efficient code for reading just one of the pair. The "obvious" way of doing an atomic load of the struct and then only using one member still leads to a lock cmpxchg16b to read the whole struct even though we only need one member. (Much slower, and dirties the cache line so readers contend with other readers). I'm confident that a normal 64b load of the pointer would still correctly implement the acquire memory-ordering semantics on x86 (as well as atomicity), but current compilers don't do that optimization even for std::memory_order_relaxed, so we trick them into it with a union.

    (submitted GCC bug 80835 about this. TODO: same for clang if this is a useful idea.)


    Checklist:


    Here's an example of code with a CAS retry-loop that compiles to asm that looks right, and I think is free of UB or other unsafe C++ for implementations that allow union type punning. It's written in C style (non-member functions, etc.) but it would be the same if you wrote member functions.

    See the code with asm output from gcc6.3 on the Godbolt compiler explorer. With -m32, it uses cmpxchg8b the same way 64-bit code uses cmpxchg16b. With -mx32 (32-bit pointers in long mode) it can simply use a 64-bit cmpxchg, and normal 64-bit integer loads to grab both members in one atomic load.

    This is portable C++11 (except the union type-punning), with nothing x86-specific. It's only efficient on targets that can CAS an object the size of two pointers, though. e.g. it compiles to a call to a __atomic_compare_exchange_16 library function for ARM / ARM64 and MIPS64, as you can see on Godbolt.

    It doesn't compile on MSVC, where atomic<counted_ptr> is larger than counted_ptr_separate, so the static_assert catches it. Presumably MSVC includes a lock member in the atomic object.

    #include <atomic>
    #include <stdint.h>
    
    using namespace std;
    
    struct node {
      // This alignas is essential for clang to use cmpxchg16b instead of a function call
      // Apparently just having it on the union member isn't enough.
      struct alignas(2*sizeof(node*)) counted_ptr {
        node *    ptr;
        uintptr_t count;  // use pointer-sized integers to avoid padding
      };
      
      // hack to allow reading just the pointer without lock-cmpxchg16b,
      // but still without any C++ data race
      struct counted_ptr_separate {
        atomic<node *>    ptr;
        atomic<uintptr_t> count_separate;  // var name emphasizes that accessing this way isn't atomic with ptr
      };
    
      static_assert(sizeof(atomic<counted_ptr>) == sizeof(counted_ptr_separate), "atomic<counted_ptr> isn't the same size as the separate version; union type-punning will be bogus");
      //static_assert(std::atomic<counted_ptr>{}.is_lock_free());
      
      union {  // anonymous union: the members are directly part of struct node
        alignas(2*sizeof(node*)) atomic<counted_ptr> next_and_count;
        counted_ptr_separate  next;
      };
      // TODO: write member functions to read next.ptr or read/write next_and_count
    
      int data[4];
    };
    
    
    // make sure read-only access is efficient.
    node *follow(node *p) {   // good asm, just a mov load
      return p->next.ptr.load(memory_order_acquire);
    }
    node *follow_nounion(node *p) {  // really bad asm, using cmpxchg16b to load the whole thing
      return p->next_and_count.load(memory_order_acquire).ptr;
    }
    
    
    void update_next(node &target, node *desired)
    {
      // read the old value efficiently to avoid overhead for the no-contention case
      // tearing (or stale data from a relaxed load) will just lead to a retry
      node::counted_ptr expected = {
          target.next.ptr.load(memory_order_relaxed),
          target.next.count_separate.load(memory_order_relaxed) };
    
      bool success;
      do {
        node::counted_ptr newval = { desired, expected.count + 1 };
        // x86-64: compiles to cmpxchg16b
        success = target.next_and_count.compare_exchange_weak(
                               expected, newval, memory_order_acq_rel);
        // updates exected on failure
      } while( !success );
    }
    

    The asm output from clang 4.0 -O3 -mcx16 is:

    update_next(node&, node*):
        push    rbx             # cmpxchg16b uses rbx implicitly so it has to be saved/restored
        mov     rbx, rsi
        mov     rax, qword ptr [rdi]     # load the pointer
        mov     rdx, qword ptr [rdi + 8] # load the counter
    .LBB2_1:                        # =>This Inner Loop Header: Depth=1
        lea     rcx, [rdx + 1]
        lock
        cmpxchg16b      xmmword ptr [rdi]
        jne     .LBB2_1
        pop     rbx
        ret
    

    gcc does some clunky store/reloads, but is basically the same logic.

    follow(node*) compiles to mov rax, [rdi] / ret, so read-only access to the pointer is as cheap as it should be, thanks to the union hack.


    It does depend on writing a union through one member and reading it through another, to do efficient reads of just the pointer without using a lock cmpxchg16b. This is guaranteed to work in GNU C++ (and ISO C99/C11), but not in ISO C++. Many other C++ compilers do guarantee that union type-punning works, but even without that it would probably still work: We're always using std::atomic loads which have to assume that the value was modified asynchronously. So we should be immune from aliasing-like problems where values in registers are still considered live after writing the value through another pointer (or union member). Compile-time reordering of things the compiler thinks are independent could be a problem, though.

    Atomically reading just the pointer after an atomic cmpxchg of the pointer+counter should still give you acquire/release semantics on x86, but I don't think ISO C++ says anything about it. I would guess that a wide release-store (as part of the compare_exchange_weak will synchronize with a narrower load from the same address on most architectures (like it does on x86), but AFAIK C++ std::atomic doesn't guarantee anything about type-punning.

    Not relevant for pointer + ABA-counter, but could come up for other applications of using a union to allow access to subsets of larger atomic object: Don't use the union to allow atomic stores to just the pointer or just the counter. At least not if you care about synchronization with an acquire-load of the pair. Even strongly-ordered x86 can reorder a narrow store with a wider load that fully contains it. Everything is still atomic, but you get into weird territory as far as memory-ordering goes.

    On x86-64, an atomic 16B load requires a lock cmpxchg16b (which is a full memory barrier, preventing a preceding narrow store from becoming globally visible after it). But you could easily have a problem if you were using this with 32-bit pointers (or 32-bit array indices), since both halves could be loaded with a regular 64b load. And I have no idea what problems you could see on other architectures, if you need synchronization with other threads and not just atomicity.


    To learn more about std::memory_order acquire and release, see Jeff Preshing's excellent articles.