c++stdatomicmemsetauto-vectorization

Efficient way to reset array of structs which contain a std::atomic member?


Let's say I have a struct Foo s. t.

struct alignas(64) Foo {
    std::atomic<int> value;
    Foo* other;
};

Then, if I have an array Foo array[2048]; of Foo's: I already have initialized the array, but it has some dirty values inside, I want to reset it back to zero state.

If I want to reset this array to zeroed Foo's without violating standard there is an efficient way to do it in C (if Foo has value of type not std:atomic<int>, but volatile int): memset.

However, what is safe, yet efficient, way to do it in C++?


Solution

  • The object-representation of a null pointer is not guaranteed to be the bit-pattern 0, so even in C it's not fully portable to use memset instead of assignment of ptr = 0;. But all modern mainstream systems use 0 as the bit-pattern for null pointers. In the rest of this answer, I'll assume that you only care about optimizing for such platforms.

    volatile int is also not guaranteed by standards to be thread-safe in C or C++ (although it does work on most compilers as a legacy way to do something like atomic with memory_order_relaxed). memset on a struct containing volatile members doesn't respect their volatility so your proposed C equivalent was potentially not even safe in practice, let alone guaranteed by anything. The C equivalent of std::atomic<int> / std::atomic_int is _Atomic int / atomic_int. C11 doesn't have a portable atomic_ref, only compiler-specific stuff like GNU C __atomic_load_n etc.


    To allow efficient zeroing during single-threaded phases of your program, consider using C++20 std::atomic_ref on a plain int member of your struct instead of a std::atomic<int> member. This makes your struct only have primitive types as members, making it safe to use memset as long as no other threads are reading or writing simultaneously. (Or std::fill.)

    struct Foo {
       alignas (std::atomic_ref<int>::required_alignment) int value;
       Foo *other;
    };
    

    It would still be data-race UB to memset a struct while any other threads could be reading or writing it, so this only helps if you have phases of your program where synchronization guarantees that no other threads will be accessing your array. Unless you want to rely on non-portable behaviour


    One thing that would make it faster is not padding your structs to 4x the size. Your struct would be 16 bytes (on a typical 64-bit architecture) without alignas(64), or 8 bytes total with 32-bit pointers. It can make sense to align the whole array of structs by 64, but aligning every individual struct puts each one in a separate cache line. Perhaps you're doing that to avoid false-sharing? That does make it slower to zero them since it's more cache-lines to write, so test to see if (and how much) it speeds up your program on various hardware to have each pointer+counter pair in its own cache line.

    With 3/4 of the space being padding (assuming 16-byte Foo objects), memset on the whole array would typically be doing at least 2 stores per struct (x86-64 with AVX2 for 32-byte stores). Worse on AArch64 without SVE (16-byte vectors), much worse on RISC-V32 without vector extensions (just 4-byte scalar stores, AFAIK).

    So if you are going to use this much padding, it's not bad to just loop manually and do normal assignments to the two members of each struct. If this has to be thread-safe (so you can't just access a plain int), use memory_order_relaxed for the atomic member, unless you need something stronger. You certainly don't want to be draining the store buffer for every struct, which is what would happen on x86 with arr[i].value = 0; (which defaults to seq_cst), so use arr[i].value.store(0, std::memory_order_relaxed).

    Looping manually would result in two stores per struct when you really only need one 16-byte store. Compilers don't optimize atomics, and they won't turn .value.store(0, relaxed); .other = nullptr; into a single 16-byte store, unfortunately. Even without an atomic member, GCC has a missed-optimization (bug 82142) where it avoids writing padding, stopping it from coalescing stores when doing struct assignment. Using size_t value or ptrdiff_t value could avoid that.

    With no atomic members, store coalescing should result in a loop that does one 16-byte store (or 8-byte in 32-bit code) per 64-byte struct. Unfortunately GCC fails at that for x86-64, but clang succeeds. GCC gets it right for AArch64.

    #include <atomic>
    #include <stddef.h>
    
    struct alignas(64) Foo_atomic_ref {
       alignas (std::atomic_ref<ptrdiff_t>::required_alignment) ptrdiff_t value;  // pointer sized to avoid padding between members which trips up GCC's store coalescing.
       Foo_atomic_ref *other;
    };
    
    Foo_atomic_ref shared_array_aref[1024];
    
    // if no other threads could be accessing the array
    void zero_array_nonatomic(){
        for (auto &s : shared_array_aref){
            s.value = 0;     // GCC for x86-64 fails to use movdqa; AArch64 uses stp though.
            s.other = nullptr;
        }
    }
    
    // if some other threads could be accessing the array
    void zero_array_atomic_ref(){
        for (auto &s : shared_array_aref){
            std::atomic_ref ref(s.value);  // free to construct
            ref.store(0, std::memory_order_relaxed);
            s.other = nullptr;             // separate stores from both compilers
        }
    }
    

    Godbolt - GCC for x86-64 fails to coalesce the two 8-byte stores even in the non-atomic version where that would be possible. But GCC for AArch64 does use scalar stp xzr, xzr, [x0] for it, which is also a 16-byte store of the zero-register twice. Most microarchitectures run it as a single 16-byte store, at least on ARMv8.4 and later where it's guaranteed atomic. So it's efficient

    Clang compiles these to asm at least as good as memset, with unrolled loops without any branching to handle tiny sizes, just an unrolled loop. The nonatomic version only does one movaps xmmword ptr [rax], xmm0 per struct; the atomic_ref loop does separate stores for each member. Neither spend instructions storing the padding like memset on the whole array would do.

    # clang -O3 
    zero_array_atomic_ref():
            lea     rax, [rip + shared_array_aref]
            lea     rcx, [rip + shared_array_aref+65536]
    .LBB1_1:
            mov     qword ptr [rax], 0
            mov     qword ptr [rax + 8], 0        # first struct
            mov     qword ptr [rax + 64], 0
            mov     qword ptr [rax + 72], 0       # second struct
        ....
            mov     qword ptr [rax + 448], 0
            mov     qword ptr [rax + 456], 0
            add     rax, 512
            cmp     rax, rcx
            jne     .LBB1_1
            ret
    

    On real hardware, the atomic_ref version would also be safe with 16-byte movaps stores, but hardware vendors don't guarantee it. See Per-element atomicity of vector load/store and gather/scatter? - it's not plausible that a 16-byte aligned store could have tearing at 8-byte boundaries on x86, especially since 8-byte atomicity is guaranteed.

    On x86 with AVX, 16-byte stores are guaranteed atomic, so it would actually be fully safe for GCC and clang to coalesce the atomic 8-byte store to .value with the non-atomic 8-byte store to .other. But compilers aren't that smart, treating atomic a lot like volatile.

    It's frustrating to know that most (all?) hardware can do stuff C++ won't let you do portably, but such is life in writing portable programs.

    You could manually vectorize with SIMD intrinsics like _mm_store_si128. Or modern compilers will inline a 16-byte memset as a single instruction, but there's no guarantee of that. Or use offsetof(Foo, other) + sizeof(Foo::other) as the size for each memset, to only write the parts of struct that contain the non-padding data. offsetof isn't guaranteed on structs that aren't "standard layout", but C++17 makes that "conditionally supported" instead of UB. But of course memset on objects being read+written by other threads is data-race UB so I don't recommend that unless you want to carefully check your code for every compiler version, and any change to where this zeroing gets inlined into, to make sure it always compiles to safe asm.

    (std::fill wouldn't be usable this way since it takes an iterator range of the same type. You can't pass it a ptrdiff_t * as the start and a Foo ** as the end iterators. You could pass it a whole Foo *, but then might do a 64-byte struct assignment, unless the compiler decided to skip storing the padding. If you care a lot about what asm you're getting, it's probably not a good choice.)


    With std::atomic<int>, not plain int + atomic_ref

    In this case, std::fill won't compile because std::atomic is not trivially-copyable (deleted copy-constructor). In C++, struct assignment goes per-member. memset will bypass that and let you do things you're not "supposed" to do.

    If no other threads are running, memset on a std::atomic<int> (or a struct containing one) works in practice on mainstream C++ implementations because std::atomic<int> is lock-free and its object representation is the same as an int. But I wouldn't recommend writing code this way since ISO C++ doesn't guarantee it.

    On GCC/Clang and libstdc++ or libc++, std::atomic<int> will have a plain int member, and the member functions are wrappers for compiler built-ins like GNU C __atomic_load_n and __atomic_fetch_add. So there'd be no UB in using memset to change the bytes of the object representation. But again, ISO C++ doesn't guarantee anything about the internals of std::atomic so this would be relying on implementation details.

    For larger non-lock-free atomic objects like atomic<big_struct>, some compilers (like MSVC: Godbolt) include a spinlock inside the std::atomic object. IDK if any compilers include a full std::mutex which shouldn't be zeroed even if you know it's unlocked (no concurrent readers or writers). (Most other compilers use a separate hash table of spinlocks, and people have said MSVC is planning that change for their next C++ ABI break.)

    MSVC uses zero as the unlocked state so static zero-initialized storage is usable directly, but in theory an implementation could have 0 meaning locked, so memset would create a deadlock.
    Zeroing a spinlock while there are concurrent readers or writers could let some threads see torn values and maybe cause some memory_order weirdness.


    It's still data-race UB to memset while other threads might be reading or writing the object. In practice it will probably just work since any decent memset will work internally as if storing in chunks at least as wide as int, especially on a large aligned array, and in asm a relaxed atomic store doesn't need any special instructions on mainstream platforms.

    But without actual guarantees, there's a lot of "this probably won't break" being relied on, so I wouldn't recommend it for portable code. Only consider memset while other threads are running if you need to squeeze every drop of performance out of something for a specific compiler / standard library and architecture, and are prepared to verify your assumptions by checking the asm of the standard library implementation, or checking what gets inlined into your code.

    And as I said earlier, padding your structs to one per cache line means only 1/4 or 1/8th of the bytes actually need to get stored, so an optimized libc memset isn't such an advantage. If you only aligned your structs by their size, memset could be a lot better since it's able to use 32-byte stores even if the rest of your program isn't compiled to depend on x86 AVX or ARM SVE or whatever.

    On some microarchitectures, storing every byte of a cache line could have advantages in allowing the CPU to just invalidate other copies of the cache line without doing a Read For Ownership. Normally a store needs to merge some bytes into the old value of a cache line, but storing every byte of it can avoid that. But a CPU would have to detect that across multiple stores, not just looking at the first store to a line, unless you're using AVX-512 to store a whole cache line at once. (And even then IDK if x86 CPUs do anything special for aligned vmovdqa64 [mem], zmm, but they might.) See also Enhanced REP MOVSB for memcpy for more about no-RFO store protocols on x86 CPUs. It's normally most relevant for larger arrays that don't fit in cache, but could be relevant here. NT stores like movntps / _mm_stream_ps also avoid RFOs, but they fully evict from cache, including from L3, so that makes the next read slower. (Shared L3 cache is normally a backstop for coherency traffic.)