Background
I'm building a cross-platform atomic abstraction layer to support 64-bit and 128-bit atomic operations for the following types:
int64_t
, uint64_t
__int128
(on Clang platforms)
A custom MyInt128
struct (used on MSVC to emulate 128-bit integers as two int64_t
s)
The platforms I need to support:
Implementation Strategy
I've based my solution on this helpful post: Cross-platform support for 128-bit atomic operations in Clang
I'm using the following platform-specific setup:
MSVC (x64 / ARM64)
• Type: MyInt128
• Mechanism: std::atomic_ref<T>
• Notes: Not lock-free, but works via fallback locking
Clang (Linux x64)
• Type: __int128
, __int64
• Mechanism: std::atomic_ref<T>
• Notes: Works perfectly
Clang (macOS ARM64)
• Type: __int128
, __int64
• Mechanism: std::atomic<T>
• Notes: std::atomic_ref
is unavailable on this platform
I’m choosing std::atomic_ref<T>
where possible because:
It works on both trivially-copyable fundamental types and structs.
It lets me write APIs that take T&
(reference to the actual variable), rather than requiring users to declare all variables as std::atomic<T>
.
Questions
std::atomic_ref<MyInt128>
with two threads over 10 million iterations, and the CAS loop completed without issues.std::atomic_ref<T>
fail to compile on macOS ARM64 (Clang), even though std::atomic<T>
works? - The error I get: no template named 'atomic_ref' in namespace 'std'
.
Am I missing an include? Or is this a known limitation in Apple's libc++ implementation?std::atomic_ref<MyInt128>
on MSVC? - MyInt128
is a trivially-copyable 16-byte struct (two int64_t
members). Will all atomic operations work correctly (even if not lock-free)?std::atomic_ref<T>
for cross-platform atomic APIs that operate on external references? - I want to avoid requiring all shared variables in the codebase to be declared as std::atomic<T>
.Instead, I want to write generic atomic operations that take T&
and use atomics internally.Example Snippet
template<typename T>
T CompareAndSwap(T& dest, T desired, T expected)
{
static_assert(std::is_trivially_copyable_v<T>);
static_assert(sizeof(T) == 8 || sizeof(T) == 16);
#if defined(__APPLE__) && defined(__aarch64__)
std::atomic<T>* atomicPtr = reinterpret_cast<std::atomic<T>*>(&dest);
atomicPtr->compare_exchange_strong(expected, desired);
#else
std::atomic_ref<T> atomicRef(dest);
atomicRef.compare_exchange_strong(expected, desired);
#endif
return expected;
}
Test Observation
T = __int128
on Clang (Linux, macOS)T = MyInt128
on MSVC (Windows)Goal I want to validate:
atomic_ref
.std::atomic_ref
is intentionally missing from Apple Clang's libc++.
- Is this a safe and correct approach overall? - For MSVC, I understand that 128-bit atomic operations are not lock-free, but fallback to internal locks. I tested std::atomic_ref with two threads over 10 million iterations, and the CAS loop completed without issues.
It's "safe and correct" in that it has well-defined behavior under the C++ standard, if the required conditions are met. So unless one of those implementations has bugs (I don't know of any), that's the semantics you'll get.
But with atomic_ref
, you have to remember that all non-atomic accesses to the object (i.e. not through atomic_ref
) are still subject to the usual C++ data race rules. In particular:
You may not perform a non-atomic read that is potentially concurrent with a write, even if the write is atomic.
You may not perform a non-atomic write that is potentially concurrent with any other access (either read or write), even if the other access is atomic.
Doing either of those things is a data race and causes undefined behavior (anything can happen including nasal demons). The effects of a data race are not limited to simply reading or writing incorrect values to the object; more severe failures are possible. They will not be detected at compile time, nor reliably at run time, and may be very difficult to reproduce. (Tools like Thread Sanitizer can help but are not 100% effective.)
So while using atomic_ref
does mean you don't have to change declarations elsewhere in your codebase, you still have to audit that entire codebase to ensure it is free of data races.
- Why does std::atomic_ref fail to compile on macOS ARM64 (Clang), even though std::atomic works? - The error I get: no template named 'atomic_ref' in namespace 'std' Am I missing an include? Or is this a known limitation in Apple's libc++ implementation?
You need to compile with -std=c++20
or higher. Apple's clang currently defaults to C++14.
With this option, using the Xcode command line tools version 16.4 (clang-1700.0.13.5), your example code with atomic_ref
compiles fine for me, and produces the expected assembly with a caspal
instruction.
Your approach of type-punning via a reinterpret_cast
of pointers is undefined behavior, formally speaking. It will probably work in practice on this platform for the time being, but it's hard to be sure. It may certainly break on other platforms (e.g. on MSVC, std::atomic<T>
actually contains a lock in some cases, and so is a different size from T
), or possibly in future OS updates.
If atomic_ref
were really not available, a better option in this setting would be GCC/Clang __atomic
builtins, e.g. __atomic_compare_exchange_n
. (These replace the older __sync
builtins which are deprecated, and do not allow you to specify memory ordering.) They take pointers to non-atomic T
, and emit the expected assembly (caspal
). Indeed, this is presumably how the standard library implements atomic_ref
. They're still non-portable, but at least they're documented to behave correctly on this platform, and should be forward compatible.
- Is it safe to use std::atomic_ref on MSVC? - MyInt128 is a trivially-copyable 16-byte struct (two int64_t members). Will all atomic operations work correctly (even if not lock-free)?
Again, unless MSVC has bugs, then it will work correctly. But that's sort of tautological, so I don't really know what answer you were expecting.
You should make sure you are aware of the implications of a non-lock-free atomic object. If one thread gets scheduled out while holding the lock on an object, then any other threads trying to access that object atomically will block until the first thread is scheduled back in and releases the lock. Under heavy system load, this could involve a significant delay (many milliseconds at least).
To make things worse, MSVC uses spinlocks for at least some non-lock-free atomics. So in the above situation, not only will the other threads be blocked until the first thread is scheduled back in, but they will consume 100% CPU in the meantime, making the system load even heavier.
Moreover, non-lock-free atomics are often implemented with a hash table mapping the object's address to a lock. So in case of hash collisions, you may have accesses to unrelated objects that contend for the same lock.
As such, you should carefully consider whether using a non-lock-free atomic object is actually better than implementing your own locking.
- Is there a better pattern than std::atomic_ref for cross-platform atomic APIs that operate on external references? - I want to avoid requiring all shared variables in the codebase to be declared as std::atomic. Instead, I want to write generic atomic operations that take T& and use atomics internally.
Given those constraints, there aren't really any alternatives offered by the C++ language. But again, consider the implications carefully.
Whether any edge cases are known — e.g., invalid assumptions about struct layout or hidden pitfalls with atomic_ref.
Nothing I know of that would apply to a simple pair of int64_t
, which has no padding under the ABIs of any of the implementations you care about, and for which bitwise and numerical comparison are equivalent.
There are some caveats if you use a struct that contains padding. For compare_exchange
, C++20 and later promises that the value representation is compared (i.e. the padding bits are not compared), so everything is defined to work as expected, albeit with a performance penalty, if your implementation is fully C++20 compliant. Previous C++ standards would compare the padding bits as well, so you had to be careful to copy the structs in a way that preserves padding (e.g. memcpy rather than member-by-member copy).
That said, there have been bugs in at least one implementation of this, and so I'd be somewhat less than comfortable that every implementation gets it correct in all possible corner cases, especially with atomic_ref
where the compiler can't apply special handling to non-atomic writes of the object. I'd want to test very carefully, and/or ensure that all padding is manually cleared.
Even with a correct C++20 implementation, it's still true that the structs (excluding padding) are compared bitwise, which may differ from the result of a member-by-member comparison (e.g. for floating-point types or overloaded operator==
). There are also tricky issues if your type is a union where the members are of different sizes or contain different amounts of padding.