c++multithreadingclangatomicstdatomic

Cross-platform 128-bit atomic support: std::atomic vs std::atomic_ref on Clang/MSVC (macOS ARM64, Windows x64, Linux)


Background

I'm building a cross-platform atomic abstraction layer to support 64-bit and 128-bit atomic operations for the following types:

The platforms I need to support:

Implementation Strategy

I've based my solution on this helpful post: Cross-platform support for 128-bit atomic operations in Clang

I'm using the following platform-specific setup:

MSVC (x64 / ARM64)

• Type: MyInt128

• Mechanism: std::atomic_ref<T>

• Notes: Not lock-free, but works via fallback locking

Clang (Linux x64)

• Type: __int128, __int64

• Mechanism: std::atomic_ref<T>

• Notes: Works perfectly

Clang (macOS ARM64)

• Type: __int128, __int64

• Mechanism: std::atomic<T>

• Notes: std::atomic_ref is unavailable on this platform

I’m choosing std::atomic_ref<T> where possible because:

It works on both trivially-copyable fundamental types and structs.

It lets me write APIs that take T& (reference to the actual variable), rather than requiring users to declare all variables as std::atomic<T>.

Questions

  1. Is this a safe and correct approach overall? - For MSVC, I understand that 128-bit atomic operations are not lock-free, but fallback to internal locks. I tested std::atomic_ref<MyInt128> with two threads over 10 million iterations, and the CAS loop completed without issues.
  2. Why does std::atomic_ref<T> fail to compile on macOS ARM64 (Clang), even though std::atomic<T> works? - The error I get: no template named 'atomic_ref' in namespace 'std'. Am I missing an include? Or is this a known limitation in Apple's libc++ implementation?
  3. Is it safe to use std::atomic_ref<MyInt128> on MSVC? - MyInt128 is a trivially-copyable 16-byte struct (two int64_t members). Will all atomic operations work correctly (even if not lock-free)?
  4. Is there a better pattern than std::atomic_ref<T> for cross-platform atomic APIs that operate on external references? - I want to avoid requiring all shared variables in the codebase to be declared as std::atomic<T> .Instead, I want to write generic atomic operations that take T& and use atomics internally.

Example Snippet

template<typename T>
T CompareAndSwap(T& dest, T desired, T expected)
{
    static_assert(std::is_trivially_copyable_v<T>);
    static_assert(sizeof(T) == 8 || sizeof(T) == 16);

#if defined(__APPLE__) && defined(__aarch64__)
    std::atomic<T>* atomicPtr = reinterpret_cast<std::atomic<T>*>(&dest);
    atomicPtr->compare_exchange_strong(expected, desired);
#else
    std::atomic_ref<T> atomicRef(dest);
    atomicRef.compare_exchange_strong(expected, desired);
#endif

    return expected;
}

Test Observation

Goal I want to validate:


Solution

    1. Is this a safe and correct approach overall? - For MSVC, I understand that 128-bit atomic operations are not lock-free, but fallback to internal locks. I tested std::atomic_ref with two threads over 10 million iterations, and the CAS loop completed without issues.

    It's "safe and correct" in that it has well-defined behavior under the C++ standard, if the required conditions are met. So unless one of those implementations has bugs (I don't know of any), that's the semantics you'll get.

    But with atomic_ref, you have to remember that all non-atomic accesses to the object (i.e. not through atomic_ref) are still subject to the usual C++ data race rules. In particular:

    Doing either of those things is a data race and causes undefined behavior (anything can happen including nasal demons). The effects of a data race are not limited to simply reading or writing incorrect values to the object; more severe failures are possible. They will not be detected at compile time, nor reliably at run time, and may be very difficult to reproduce. (Tools like Thread Sanitizer can help but are not 100% effective.)

    So while using atomic_ref does mean you don't have to change declarations elsewhere in your codebase, you still have to audit that entire codebase to ensure it is free of data races.

    1. Why does std::atomic_ref fail to compile on macOS ARM64 (Clang), even though std::atomic works? - The error I get: no template named 'atomic_ref' in namespace 'std' Am I missing an include? Or is this a known limitation in Apple's libc++ implementation?

    You need to compile with -std=c++20 or higher. Apple's clang currently defaults to C++14.

    With this option, using the Xcode command line tools version 16.4 (clang-1700.0.13.5), your example code with atomic_ref compiles fine for me, and produces the expected assembly with a caspal instruction.

    Your approach of type-punning via a reinterpret_cast of pointers is undefined behavior, formally speaking. It will probably work in practice on this platform for the time being, but it's hard to be sure. It may certainly break on other platforms (e.g. on MSVC, std::atomic<T> actually contains a lock in some cases, and so is a different size from T), or possibly in future OS updates.

    If atomic_ref were really not available, a better option in this setting would be GCC/Clang __atomic builtins, e.g. __atomic_compare_exchange_n. (These replace the older __sync builtins which are deprecated, and do not allow you to specify memory ordering.) They take pointers to non-atomic T, and emit the expected assembly (caspal). Indeed, this is presumably how the standard library implements atomic_ref. They're still non-portable, but at least they're documented to behave correctly on this platform, and should be forward compatible.

    1. Is it safe to use std::atomic_ref on MSVC? - MyInt128 is a trivially-copyable 16-byte struct (two int64_t members). Will all atomic operations work correctly (even if not lock-free)?

    Again, unless MSVC has bugs, then it will work correctly. But that's sort of tautological, so I don't really know what answer you were expecting.

    You should make sure you are aware of the implications of a non-lock-free atomic object. If one thread gets scheduled out while holding the lock on an object, then any other threads trying to access that object atomically will block until the first thread is scheduled back in and releases the lock. Under heavy system load, this could involve a significant delay (many milliseconds at least).

    To make things worse, MSVC uses spinlocks for at least some non-lock-free atomics. So in the above situation, not only will the other threads be blocked until the first thread is scheduled back in, but they will consume 100% CPU in the meantime, making the system load even heavier.

    Moreover, non-lock-free atomics are often implemented with a hash table mapping the object's address to a lock. So in case of hash collisions, you may have accesses to unrelated objects that contend for the same lock.

    As such, you should carefully consider whether using a non-lock-free atomic object is actually better than implementing your own locking.

    1. Is there a better pattern than std::atomic_ref for cross-platform atomic APIs that operate on external references? - I want to avoid requiring all shared variables in the codebase to be declared as std::atomic. Instead, I want to write generic atomic operations that take T& and use atomics internally.

    Given those constraints, there aren't really any alternatives offered by the C++ language. But again, consider the implications carefully.

    Whether any edge cases are known — e.g., invalid assumptions about struct layout or hidden pitfalls with atomic_ref.

    Nothing I know of that would apply to a simple pair of int64_t, which has no padding under the ABIs of any of the implementations you care about, and for which bitwise and numerical comparison are equivalent.

    There are some caveats if you use a struct that contains padding. For compare_exchange, C++20 and later promises that the value representation is compared (i.e. the padding bits are not compared), so everything is defined to work as expected, albeit with a performance penalty, if your implementation is fully C++20 compliant. Previous C++ standards would compare the padding bits as well, so you had to be careful to copy the structs in a way that preserves padding (e.g. memcpy rather than member-by-member copy).

    That said, there have been bugs in at least one implementation of this, and so I'd be somewhat less than comfortable that every implementation gets it correct in all possible corner cases, especially with atomic_ref where the compiler can't apply special handling to non-atomic writes of the object. I'd want to test very carefully, and/or ensure that all padding is manually cleared.

    Even with a correct C++20 implementation, it's still true that the structs (excluding padding) are compared bitwise, which may differ from the result of a member-by-member comparison (e.g. for floating-point types or overloaded operator==). There are also tricky issues if your type is a union where the members are of different sizes or contain different amounts of padding.