c++c++14lock-freestdatomic

lock free single producer multiple consumer data struct using atomic


I have sample code like below recently (real code is much more complicated). After watched Hans Boehm's cppcon16 talk on atomic, I am bit anxious whether my code works.

produce is called by a single producer thread, and consume is called by multiple consumer threads. Producer only updates data in sequence number like 2, 4, 6, 8,... , but sets to odd sequence num like 1, 3, 5, 7, ... before update data to indicate data might be dirty. Consumers tries to fetch data in same sequence (2, 4, 6, ...) as well.

Consumer double checks the sequence num after read to make sure data are good (not updated by producer during read).

I think my code works fine on x86_64 (my target platform) because x86_64 does not reorder stores with other stores, or loads with stores or loads, but I suspect it's wrong on other platforms.

Am I correct that data assignment (in produce) can be moved to above 'store(n-1)' so consumer reads corrupted data but the t == t2 still succeeds?

struct S 
{
    atomic<int64_t> seq;
    // data members of primitive type int, double etc    
    ...
};

S s;

void produce(int64_t n, ...) // ... for above data members
{
    s.seq.store(n-1, std::memory_order_release); // indicates it's working on data members

    // assign data members of s
    ...

    s.seq.store(n, std::memory_order_release); // complete updating
}

bool consume(int64_t n, ...) // ... for interested fields passed as reference
{
    auto t = s.load(std::memory_order_acquire);

    if (t == n)
    {
        // read fields
        ...

        auto t2 = s.load(std::memory_order_acquire);
        if (t == t2)
            return true;
    }        

    return false;
}

Solution

  • Compile-time reordering can still bite you when targeting x86, because the compiler optimizes to preserve the behaviour of the program on the C++ abstract machine, not any stronger architecture-dependent behaviour. Since we want to avoid memory_order_seq_cst, reordering is allowed.

    Yes, your stores can reorder as you suggest. Your loads can also reorder with the t2 load, since an acquire-load is only a one-way barrier. It would be legal for a compiler to optimize away the t2 check entirely. If a reordering is possible, the compiler is allowed to decide that it's what always happens and apply the as-if rule to make more efficient code. (Current compilers usually don't, but this is definitely allowed by the current standard as written. See the conclusion of a discussion about this, with links to standards proposals.)

    Your options for preventing reordering are:


    void produce(int64_t n, ...) // ... for above data members
    {
        /*********** changed lines ************/
        std::atomic_signal_fence(std::memory_order_release);  // compiler-barrier to make sure the compiler does the seq store as late as possible (to give the reader more time with it valid).
        s.seq.store(n-1, std::memory_order_relaxed);          // changed from release
        std::atomic_thread_fence(std::memory_order_release);  // StoreStore barrier prevents reordering of the above store with any below stores.  (It's also a LoadStore barrier)
        /*********** end of changes ***********/
    
        // assign data members of s
        ...
    
        // release semantics prevent any preceding stores from being delayed past here
        s.seq.store(n, std::memory_order_release); // complete updating
    }
    
    
    
    bool consume(int64_t n, ...) // ... for interested fields passed as reference
    {
        if (n == s.seq.load(std::memory_order_acquire))
        {
            // acquire semantics prevent any reordering with following loads
    
            // read fields
            ...
    
        /*********** changed lines ************/
            std::atomic_thread_fence(std::memory_order_acquire);  // LoadLoad barrier (and LoadStore)
            auto t2 = s.seq.load(std::memory_order_relaxed);    // relaxed: it's ordered by the fence and doesn't need anything extra
            // std::atomic_signal_fence(std::memory_order_acquire);  // compiler barrier: probably not useful on the load side.
        /*********** end of changes ***********/
            if (n == t2)
                return true;
        }
    
        return false;
    }
    

    Notice the extra compiler-barrier (signal_fence only affects compile-time reordering) to make sure the compiler doesn't merge the second store from one iteration with the first store from the next iteration, if this is run in a loop. Or more generally, to make sure the store that invalidates the region is done as late as possible, to reduce false positives. (Probably not necessary with real compilers, and with plenty of code between calls to this function. But signal_fence never compiles to any instructions, and seems like a better choice than keeping the first store as mo_release. On architectures where a release-store the thread-fence both compile to extra instructions, a relaxed store avoids having two separate barrier instructions.)

    I was also worried about the possibility of the first store reordering with the release-store from the previous iteration. But I don't think that can ever happen, because both stores are to the same address. (At compile-time, maybe the standard allows a hostile compiler to do it, but any sane compiler would instead just not do one of the stores at all if it thought one could pass the other.) At run-time on a weakly-ordered architecture, I'm not sure if stores to the same address can ever become globally visible out of order. This shouldn't be a problem in real life since the producer presumably isn't called back-to-back.


    BTW, the synchronization technique you're using is a Seqlock, but with only a single-writer. You only have the sequence part, not the lock part to synchronize separate writers. In a multi-writer version, writers would take the lock before reading/writing the sequence numbers and data. (And instead of having the seq no as a function arg, you'd read it from the lock).

    C++ standards-discussion paper N4455 (about compiler-optimizations of atomics, see the second half of my answer on Can num++ be atomic for 'int num'?) uses it as an example.

    Instead of a StoreStore fence, they use release-stores for the data items in the writer. (With atomic data items, as I mentioned is required for this to really be correct).

    void writer(T d1, T d2) {
      unsigned seq0 = seq.load(std::memory_order_relaxed);  // note that they read the current value because it's presumably a multiple-writers implementation.
      seq.store(seq0 + 1, std::memory_order_relaxed);
      data1.store(d1, std::memory_order_release);
      data2.store(d2, std::memory_order_release);
      seq.store(seq0 + 2, std::memory_order_release);
    }
    

    They talk about letting the reader's second load of the sequence number potentially reorder with later operations, if it's profitable for the compiler to do so, and using t2 = seq.fetch_add(0, std::memory_order_release) in the reader as a potential way to get a load with release semantics. With current compilers, I would not recommend that; you're likely to get a locked operation on x86 where the way I suggested above doesn't have any (or any actual barrier instructions, because only full-barrier seq_cst fences need an instruction on x86).