Is newNode->next = NULL;
an undefined behavior in this case?
struct node {
int value;
_Atomic(struct node*) next;
};
//at the initialization stage
struct node* newNode = malloc(sizeof(struct node));
newNode->next = NULL; //other threads are not able to access this node at this point
Assuming malloc
doesn't return NULL
, there's no UB, and actually room for optimization by avoiding an atomic seq_cst
store (newNode->next = NULL;
is equivalent to atomic_store
).
malloc
returns memory that no other thread has a pointer to. (Unless your program is already has UB, like use-after free, including via insufficient memory-ordering of stores with free()
).
You're storing to that memory before giving other threads a pointer to that object, so data-race UB would be impossible even if the member wasn't _Atomic
.
foo = val
assignment to an _Atomic
object is the same as atomic_store(&foo, val);
, which in turn is equivalent to atomic_store_explicit(&foo, val, memory_order_seq_cst);
.
You don't need any ordering wrt. other operations, because any way of publishing the newNode
pointer to other threads will need to synchronize-with them (via release/acquire) so our malloc
happens-before their access to the pointed-to memory. Any other operations we do between malloc
and atomic_store_explicit(&shared, newNode, release)
will also happen-before anything in a reader thread.
atomic_store_explicit(&newNode->next, NULL, memory_order_relaxed); // cheapest atomic way
*newNode = (struct node){0, NULL}; // assignment of whole non-atomic struct
So we could use a relaxed
store, but that's still atomic which will prevent real compilers from combining it with init of the int value;
member. (Especially on 32-bit machines or ILP32 ABIs where the whole struct is only 8 bytes. Or if we use long count
so there's no padding on most 64-bit ABIs, other than Windows, or intptr_t
so it's always a pair of same-size members.
Compilers often avoid storing to padding, sometimes stopping themselves from using one wider store like AArch64 stp
(store pair).)
#include <stdlib.h>
struct node {
long value;
_Atomic(struct node*) next;
};
struct node* alloc_and_init_orig(long val){
struct node* newNode = malloc(sizeof(struct node));
if (!newNode)
return NULL;
newNode->next = NULL; // atomic seq_cst store
return newNode;
}
Clang for x86-64 uses xchg
to do a seq_cst store, which is a full memory barrier, because x86 doesn't have anything that's only strong enough without being much stronger and more expensive. (AArch64 does, stlr
).
# clang 19 -O3 -fno-plt
alloc_and_init_orig:
push rax # align the stack
mov edi, 16
call qword ptr [rip + malloc@GOTPCREL]
test rax, rax
je .LBB0_2
xor ecx, ecx
xchg qword ptr [rax + 8], rcx # Slow. The seq_cst store
.LBB0_2:
pop rcx
ret
A relaxed
atomic store would compile as cheaply as a non-atomic init of the pointer. But if we wanted to also init count
, especially with 0
, for maximum efficiency we want to let the compiler do both member non-atomically.
struct node* alloc_and_init_zero(){
struct node* newNode = malloc(sizeof(struct node));
if (!newNode)
return NULL;
*newNode = (struct node){0, NULL};
// equivalent to
//struct node tmp = {0, NULL};
//*newNode = tmp;
return newNode;
}
The whole struct node
pointed to by newNode
is not itself _Atomic
, so this struct-assignment is non-atomic. There happens to be an _Atomic
member, but C struct assignment just copies the whole thing, ignoring qualifiers like volatile
or _Atomic
on members. (So it's like a memcpy
. I think it's well-defined to copy the object-representation of an _Atomic
type, as long as you don't expect the copy itself to be _Atomic
.
It certainly works in practice on compilers where the object-representation of _Atomic T
is the same as plain T
, with non-lock-free using a separate hash table of spinlocks or mutexes.)
Clang is pretty clever, compiling the whole function into calloc(1, 16)
. (With int count
instead of long
, this optimization to calloc
only happens with current nightly builds of clang, not clang 19).
If you had an atomic store, current compilers wouldn't optimize it away, defeating this optimization. (Why don't compilers merge redundant std::atomic writes?).
With a non-zero initializer, Clang for AArch64 compiles it to a single 16-byte stp
(store-pair), again which doesn't happen with atomic_store_explicit(&p->next, NULL, relaxed)
and a separate assignment to p->value
. (That would be a legal optimization, but compilers don't do it.)
# clang -O3 -Wall -fomit-frame-pointer -mcpu=cortex-a77
alloc_and_init_const:
str x30, [sp, #-16]! # save return address (link register)
mov w0, #16
bl malloc
cbz x0, .LBB3_2 # compare-and-branch if zero NULL check
mov w8, #123
stp x8, xzr, [x0] # store 123 and the zero-register
.LBB3_2:
ldr x30, [sp], #16 # restore return address
ret
All of these and a couple other examples on the Godbolt compiler explorer. Clang for x86-64 makes the weird choice to load a 16-byte vector constant from .rodata
and store that, instead of doing two separate 8-byte mov
stores like GCC does, or mov ecx, 123
/ movd xmm0, ecx
/ movaps [rax], xmm0
. So compilers are fully capable of shooting themselves in the foot when given more freedom to optimize.