cmultithreadingconcurrencyatomicc11

Using = operator on atomic variable?


Is newNode->next = NULL; an undefined behavior in this case?

struct node {
  int value;
  _Atomic(struct node*) next;
};

//at the initialization stage
struct node* newNode = malloc(sizeof(struct node));
newNode->next = NULL; //other threads are not able to access this node at this point

Solution

  • Assuming malloc doesn't return NULL, there's no UB, and actually room for optimization by avoiding an atomic seq_cst store (newNode->next = NULL; is equivalent to atomic_store).

    malloc returns memory that no other thread has a pointer to. (Unless your program is already has UB, like use-after free, including via insufficient memory-ordering of stores with free()).

    You're storing to that memory before giving other threads a pointer to that object, so data-race UB would be impossible even if the member wasn't _Atomic.

    foo = val assignment to an _Atomic object is the same as atomic_store(&foo, val);, which in turn is equivalent to atomic_store_explicit(&foo, val, memory_order_seq_cst);.

    You don't need any ordering wrt. other operations, because any way of publishing the newNode pointer to other threads will need to synchronize-with them (via release/acquire) so our malloc happens-before their access to the pointed-to memory. Any other operations we do between malloc and atomic_store_explicit(&shared, newNode, release) will also happen-before anything in a reader thread.

    atomic_store_explicit(&newNode->next, NULL, memory_order_relaxed); // cheapest atomic way
    *newNode = (struct node){0, NULL};  // assignment of whole non-atomic struct
    

    So we could use a relaxed store, but that's still atomic which will prevent real compilers from combining it with init of the int value; member. (Especially on 32-bit machines or ILP32 ABIs where the whole struct is only 8 bytes. Or if we use long count so there's no padding on most 64-bit ABIs, other than Windows, or intptr_t so it's always a pair of same-size members. Compilers often avoid storing to padding, sometimes stopping themselves from using one wider store like AArch64 stp (store pair).)

    #include <stdlib.h>
    struct node {
      long value;
      _Atomic(struct node*) next;
    };
    
    struct node* alloc_and_init_orig(long val){
        struct node* newNode = malloc(sizeof(struct node));
        if (!newNode)
            return NULL;
        newNode->next = NULL; // atomic seq_cst store
        return newNode;
    }
    

    Clang for x86-64 uses xchg to do a seq_cst store, which is a full memory barrier, because x86 doesn't have anything that's only strong enough without being much stronger and more expensive. (AArch64 does, stlr).

    # clang 19 -O3 -fno-plt
    alloc_and_init_orig:
            push    rax           # align the stack
            mov     edi, 16
            call    qword ptr [rip + malloc@GOTPCREL]
            test    rax, rax
            je      .LBB0_2
            xor     ecx, ecx
            xchg    qword ptr [rax + 8], rcx      # Slow.  The seq_cst store
    .LBB0_2:
            pop     rcx
            ret
    

    A relaxed atomic store would compile as cheaply as a non-atomic init of the pointer. But if we wanted to also init count, especially with 0, for maximum efficiency we want to let the compiler do both member non-atomically.

    struct node* alloc_and_init_zero(){
        struct node* newNode = malloc(sizeof(struct node));
        if (!newNode)
            return NULL;
        *newNode = (struct node){0, NULL};
        // equivalent to
        //struct node tmp = {0, NULL};
        //*newNode = tmp;
        return newNode;
    }
    

    The whole struct node pointed to by newNode is not itself _Atomic, so this struct-assignment is non-atomic. There happens to be an _Atomic member, but C struct assignment just copies the whole thing, ignoring qualifiers like volatile or _Atomic on members. (So it's like a memcpy. I think it's well-defined to copy the object-representation of an _Atomic type, as long as you don't expect the copy itself to be _Atomic. It certainly works in practice on compilers where the object-representation of _Atomic T is the same as plain T, with non-lock-free using a separate hash table of spinlocks or mutexes.)

    Clang is pretty clever, compiling the whole function into calloc(1, 16). (With int count instead of long, this optimization to calloc only happens with current nightly builds of clang, not clang 19).

    If you had an atomic store, current compilers wouldn't optimize it away, defeating this optimization. (Why don't compilers merge redundant std::atomic writes?).

    With a non-zero initializer, Clang for AArch64 compiles it to a single 16-byte stp (store-pair), again which doesn't happen with atomic_store_explicit(&p->next, NULL, relaxed) and a separate assignment to p->value. (That would be a legal optimization, but compilers don't do it.)

    # clang -O3 -Wall -fomit-frame-pointer -mcpu=cortex-a77
    alloc_and_init_const:
            str     x30, [sp, #-16]!   # save return address (link register)
            mov     w0, #16
            bl      malloc
            cbz     x0, .LBB3_2       # compare-and-branch if zero NULL check
            mov     w8, #123
            stp     x8, xzr, [x0]     # store 123 and the zero-register
    .LBB3_2:
            ldr     x30, [sp], #16    # restore return address
            ret
    

    All of these and a couple other examples on the Godbolt compiler explorer. Clang for x86-64 makes the weird choice to load a 16-byte vector constant from .rodata and store that, instead of doing two separate 8-byte mov stores like GCC does, or mov ecx, 123 / movd xmm0, ecx / movaps [rax], xmm0. So compilers are fully capable of shooting themselves in the foot when given more freedom to optimize.