Implementing 64 bit atomic counter with 32 bit atomics

I would like to cobble together a uint64 atomic counter from atomic uint32s. The counter has a single writer and multiple readers. The writer is a signal handler so it must not block.

My idea is to use a generation count with the low bit as a read lock. The reader retries until the generation count is stable across the read, and the low bit is unset.

Is the following code correct in design and use of memory ordering? Is there a better way?

using namespace std;
class counter {
    atomic<uint32_t> lo_{};
    atomic<uint32_t> hi_{};
    atomic<uint32_t> gen_{};

    uint64_t read() const {
        auto acquire = memory_order_acquire;
        uint32_t lo, hi, gen1, gen2;
        do {
            gen1 = gen_.load(acquire);
            lo = lo_.load(acquire);
            hi = hi_.load(acquire);
            gen2 = gen_.load(acquire);
        } while (gen1 != gen2 || (gen1 & 1));
        return (uint64_t(hi) << 32) | lo;
    }

    void increment() {
        auto release = memory_order_release;
        gen_.fetch_add(1, release);
        uint32_t newlo = 1 + lo_.fetch_add(1, release);
        if (newlo == 0) {
            hi_.fetch_add(1, release);
        }
        gen_.fetch_add(1, release);
    }
};

edit: whoops, fixed auto acquire = memory_order_release;

Solution

This is a known pattern, called a SeqLock. https://en.wikipedia.org/wiki/Seqlock. (With the simplification that there's only one writer so no extra support for excluding simultaneous writers is needed.) It's not lock-free; a writer sleeping at the wrong time will leave readers spinning until the writer finishes. But in the common case where that doesn't happen, it has excellent performance with no contention between readers which are truly read-only.

You don't need or want the increment of the payload to use atomic RMW operations. (Unless you're on a system that can cheaply do a 64-bit atomic add or load, then do that instead of a SeqLock).
You can just load both halves with atomic 32-bit loads, increment it, and atomically store the result. (With cheap relaxed or release memory order for the payload, and using a release store for the 2nd sequence counter update, what you're calling the "generation" counter).

Similarly the sequence counter also doesn't need to be an atomic RMW. (Unless you're using it as a spinlock with multiple writers)

The single writer only needs pure loads and pure stores with only release ordering, which are (much) cheaper than atomic RMW, or stores with seq_cst ordering:

load the counter and the value in any order
store a new counter (old+1)
store the new value (or just update the low half if you want to branch on no carry)
store the final counter.

The ordering of the stores in those 3 bullet points are the only thing that matters. A write fence after the first store could be good, because we don't really want the cost of making both stores of both halves of the value release, on CPUs where that's more expensive than relaxed.

Unfortunately to satisfy C++ rules, the value has to be atomic<T>, which makes it inconvenient to get the compiler to generate the most efficient code possible for loading both halves. e.g. ARM ldrd or ldp / stp load-pair aren't guaranteed atomic until ARMv8.4a, but that doesn't matter. (And compilers often won't optimize two separate atomic 32-bit loads into one wider load.)

Values other threads read while the sequence-counter is odd are irrelevant, but we'd like to avoid undefined behaviour. Maybe we could use a union of a volatile uint64_t and an atomic<uint64_t>

I wrote this C++ SeqLock<class T> template for another question I didn't finish writing an answer for (figuring out which versions of ARM have 64-bit atomic load and store).

This tries to check if the target already supports lock-free atomic operations on atomic<T> to stop you from using this when it's pointless. (Disable that for testing purposed by defining IGNORE_SIZECHECK.) TODO: transparently fall back to doing that, maybe with a template specialization, instead of using a static_assert.

I provided an inc() function for T that supports a ++ operator. TODO would be an apply() that accepts a lambda to do something to a T, and store the result between sequence counter updates.

// **UNTESTED**

#include <atomic>

#ifdef UNIPROCESSOR
// all readers and writers run on the same core (or same software thread)
// ordering instructions at compile time is all that's necessary
#define ATOMIC_FENCE std::atomic_signal_fence
#else
// A reader can be running on another core while writing.
// Memory barriers or ARMv8 acquire / release loads / store are needed
#define ATOMIC_FENCE std::atomic_thread_fence
#endif
// using fences instead of .store(std::memory_order_release) will stop the compiler
// from taking advantage of a release-store instruction instead of separate fence, like on AArch64
// But fences allow it to be optimized away to just compile-time ordering for the single thread or unirprocessor case.


// SINGLE WRITER only.
// uses volatile + barriers for the data itself, like pre-C++11
template <class T>
class SeqLocked
{
#ifndef IGNORE_SIZECHECK
    // sizeof(T) > sizeof(unsigned)
    static_assert(!std::atomic<T>::is_always_lock_free, "A Seq Lock with a type small enough to be atomic on its own is totally pointless, and we don't have a specialization that replaces it with a straight wrapper for atomic<T>");
#endif

       // C++17 doesn't have a good way to express a load that doesn't care about tearing
       //  without explicitly writing it as multiple small parts and thus gimping the compiler if it can use larger loads
    volatile T data;          // volatile should be fine on any implementation where pre-C++11 lockless code was possible with volatile,
                              //  even though Data Race UB does apply to volatile variables in ISO C++11 and later.
                            // even non-volatile normally works in practice, being ordered by compiler barriers.

    std::atomic<unsigned> seqcount{0};  // Even means valid, odd means modification in progress.
                                        //  unsigned definitely wraps around at a power of 2 on overflow

public:
    T get() const {
        unsigned c0, c1;
        T tmp;
        // READER RETRY LOOP
        do {
            c0 = seqcount.load(std::memory_order_acquire);     // or for your signal-handler use-case, relaxed load followed by ATOMIC_FENCE(std::memory_order_acquire);

            tmp = (T)data;       // load

            ATOMIC_FENCE(std::memory_order_acquire);  // LoadLoad barrier
            c1 = seqcount.load(std::memory_order_relaxed);
        } while(c0&1 || c0 != c1);     // retry if the counter changed or is odd

        return tmp;
    }

    // TODO: a version of this that takes a lambda for the operation on tmp
    T inc()     // WRITER
    {
        unsigned orig_count = seqcount.load(std::memory_order_relaxed);
        // we're the only writer, avoid an atomic RMW.

        seqcount.store(orig_count+1, std::memory_order_relaxed);
        ATOMIC_FENCE(std::memory_order_release);     // 2-way barrier *after* the store, not like a release store.  Or like making data=tmp a release operation.
        // make sure the counter becomes odd *before* any data change

        T tmp = data;  // load into a non-volatile temporary
        ++tmp;         // make any change to it
        data = tmp;    // store

        seqcount.store(orig_count+2, std::memory_order_release);  // or use ATOMIC_FENCE(std::memory_order_release); *before* this, so the UNIPROCESSOR case can just do compile-time ordering

        return tmp;
    }

    void set(T newval) {
        unsigned orig_count = seqcount.load(std::memory_order_relaxed);

        seqcount.store(orig_count+1, std::memory_order_relaxed);
        ATOMIC_FENCE(std::memory_order_release);
        // make sure the data stores appear after the first counter update.

        data = newval;    // store

        ATOMIC_FENCE(std::memory_order_release);
        seqcount.store(orig_count+2, std::memory_order_relaxed);  // Or use mo_release here, better on AArch64
    }

};


/***** test callers *******/
#include <stdint.h>

struct sixteenbyte {
    //unsigned arr[4];
    unsigned long  a,b,c,d;
    sixteenbyte() = default;
    sixteenbyte(const volatile sixteenbyte &old)
         : a(old.a), b(old.b), c(old.c), d(old.d) {}
    //arr(old.arr) {}
};

void test_inc(SeqLocked<uint64_t> &obj) {  obj.inc(); }
sixteenbyte test_get(SeqLocked<sixteenbyte> &obj) { return obj.get(); }
//void test_set(SeqLocked<sixteenbyte> &obj, sixteenbyte val) { obj.set(val); }

uint64_t test_get(SeqLocked<uint64_t> &obj) {
    return obj.get();
}

// void atomic_inc_u64_seq_cst(std::atomic<uint64_t> &a) { ++a; }
uint64_t u64_inc_relaxed(std::atomic<uint64_t> &a) {
    // same but without dmb barriers
    return 1 + a.fetch_add(1, std::memory_order_relaxed);
}

uint64_t u64_load_relaxed(std::atomic<uint64_t> &a) {
    // gcc uses LDREXD, not just LDRD?
    return a.load(std::memory_order_relaxed);
}

void u64_store_relaxed(std::atomic<uint64_t> &a, uint64_t val) {
    // gcc uses a LL/SC retry loop even for a pure store?
    a.store(val, std::memory_order_relaxed);
}

It compiles to the asm we want on the Godbolt compiler explorer for ARM, and other ISAs. At least for int64_t; larger struct types may be copied less efficiently because of cumbersome volatile rules.

It uses non-atomic volatile T data for the shared data. This is technically data-race undefined behaviour, but all compilers we use in practice were fine with pre-C++11 multi-threaded access to volatile objects. And pre-C++11, people even depended on atomicity for some sizes. We do not, we check the counter and only use the value we read if there were no concurrent writes. (That's the whole point of a SeqLock.)

One problem with volatile T data is that in ISO C++, T foo = data won't compile for struct objects unless you provide a copy-constructor from a volatile object, like

sixteenbyte(const volatile sixteenbyte &old)
         : a(old.a), b(old.b), c(old.c), d(old.d) {}

This is really annoying for us, because we don't care about the details of how memory is read, just that multiple reads aren't optimized into one.

volatile is really the wrong tool here, and plain T data with sufficient fencing to ensure that the read actually happens between the reads of the atomic counter would be better. e.g. we could do that in GNU C with a asm("":::"memory"); compiler barrier against reordering before/after the accesses. That would let the compiler copy larger objects with SIMD vectors, or whatever, which it won't do with separate volatile accesses.

I think std::atomic_thread_fence(mo_acquire) would also be a sufficient barrier, but I'm not 100% sure.

In ISO C, you can copy a volatile aggregate (struct), and the compiler will emit whatever asm it normally would to copy that many bytes. But in C++, we can't have nice things apparently.

Related: single-core systems with a writer in an interrupt handler

In an embedded system with one core, and some variables that are only updated by interrupt handlers, you may have a writer that can interrupt the reader but not vice versa. That allows some cheaper variations that use the value itself to detect torn reads.

See Reading a 64 bit variable that is updated by an ISR, especially for a monotonic counter Brendan's suggestion of reading the most significant-half first, then the low half, then the most-significant half again. If it matches, your read wasn't torn in a way that matters. (A write that didn't change the high half isn't a problem even if it interrupted the reader to change the low half right before or after the reader read it.)

Or in general, re-read the whole value until you see the same value twice in a row.

Neither of these techniques are SMP-safe: the read retry only guards against torn reads, not torn writes if the writer stored the halves separately. That's why a SeqLock uses a 3rd atomic integer as a sequence counter. They would work in any case where the writer is atomic wrt. the reader, but the reader isn't atomic. Interrupt handler vs. main code is one such case, or signal handler is equivalent.

You could potentially use the low half of a monotonic counter as a sequence number, if you don't mind incrementing by 2 instead of 1. (Perhaps requiring readers to do a 64-bit right shift by 1 to recover the actual number. So that's not good.)