atomic_flag_test_and_set
Yes!atomic_flag_clear
Yes!atomic_flag_test_and_clear
nopeatomic_flag_set
nopeIf you wanted to do something like set a flag on a event in some context, and in some other context check and clear the event, C/C++ doesn't allow you to do in single atomic call per context.
You'd have to invert the flag, so clear the flag on the event, check and set the flag when checking the event.
Not a huge deal, but seems backwards in this scenario, especially considering the default state of the flag is false, which in the inverted sense means the event is asserted by default.
I supposed alternatively, an atomic bool
with atomic_exchange
could be used instead.
Practical advice: use atomic<bool>
or atomic<unsigned char>
instead of atomic_flag
in normal code if its limited set of operations makes your code less efficient.
Unless you care about being lock-free on very primitive machines where atomic<bool>
and atomic<int>
might not be lock-free. (Or perhaps use C++20 std::atomic_unsigned_lock_free
.)
TAS (Test And Set) is one of the well-known primitive atomic RMW operations in computer science that can be a building block for mutual-exclusion (locking). On some old hardware (like Motorola 68000), it's the only atomic RMW.
There are no machines where the only atomic RMW is a test-and-clear. (Zero-initializing memory is the norm, so a spinlock or mutex in static storage will start out in the unlocked state if 0 means unlocked, and TAS takes the lock. You need an atomic RMW when acquiring a spinlock but not when releasing.)
TAS can also be efficiently implemented in terms of an atomic exchange, aka swap, which some other old hardware provides as its only atomic RMW. (local = foo.exchange(true)
and test the result.)
But neither TAS nor exchange work as building blocks for arbitrary lock-free atomic RMWs like fetch_xor
or CAS (Compare-and-Swap e.g. compare_exchange_weak/strong
). A machine with just TAS or a way to emulate it can't provide lock-free std::atomic<bool>
, but can provide lock-free std::atomic_flag
.
(LL/SC or CAS itself are building blocks for arbitrary lock-free RMWs on a single variable. All(?) modern machines that support multiple cores have at least one of those, and sometimes also direct support for some of the common integer operations like fetch_add
, fetch_or
, exchange
, etc. like on x86-64 and ARMv8.1. And of course atomicity guarantees for pure-load and pure-store operations so .load()
can be done as an actual load, not an RMW which would fault on read-only memory and have contention between readers.)
Hypothetically, on a CPU with only a test-can-clear, you could still implement atomic_flag
with the C++ false
state having a non-zero object-representation; before C++20 it wasn't required that static initialization made it false. But if a CPU had TAS and "TAC" (or exchange which can do both) but not enough functionality to implement lock-free atomic<bool>
, you can't take full advantage of it via atomic_flag
.
atomic_flag
is required to be lock-free, unlike any atomic<T>
. It exists to expose a minimal set of lock-free functionality that can be implemented across a wide range of hardware that's able to implement ISO C++11 at all (including std::mutex
and the locking for non-lock-free std::atomic<T>
.)
Some things it avoids requiring:
Read-only access (before C++20 .test()
)
Some hardware might have race detection where it could fault on concurrent read + write. (The same condition that would be undefined behaviour in C++ on non-atomic variables.) Presumably such hardware would have special instructions that are allowed to be concurrent, which may only include writes and RMWs.
Write-only access other than .clear()
.
Writing true
can be done with .test_and_set()
and ignoring the return value. But that's not efficient. It's still an atomic RMW and thus continues a release-sequence and has to see the "latest value" in the modification order so it's linearizable, so it's non-trivial for compilers to optimize it to just a store if the return value isn't used.
IDK what hardware reason might exist for not providing a .set()
API. If the underlying hardware doesn't allow plain stores of values other than 0
(or whatever the actual bit-pattern is for clear
), the implementation can always use a TAS instruction and ignore the result. Being stronger isn't a problem.
So this might just be a case of keeping the API minimal because it mostly doesn't matter; most code should just use atomic<bool>
because that's also lock-free on the real platforms most people are programming for.
Store of a variable value: you could always if(x) flag.TAS(); else flag.clear()
, but the API doesn't provide that.
Toggle of an existing value: the only available RMW (TAS) stores a new value that doesn't depend on what was already there. This allows implementation in terms of instructions like ARM swp
(Swap) which pre-dated LL/SC support on ARM. As well as in terms of TAS itself.
Presumably this is one of the factors that makes TAS and swap easier to implement in hardware: a read+write could maybe happen in the same cycle, unlike with operations where the value to be written has to be computed from the load result so won't be ready until at least a cycle later.
If any operation supported by atomic<bool>
needs a mutex (due to lack of CAS or LL/SC), all operations have to respect the mutex. (Except pure loads, if tearing and memory ordering aren't a problem, e.g. for relaxed loads of a byte or bool, or atomic int on some systems, but that's a corner case for obsolete systems that modern compilers probably don't bother with.) Then atomic<bool>::is_always_lock_free
has to be false
.
C++20 added std::atomic_signed_lock_free
and std::atomic_unsigned_lock_free
integer types that are guaranteed lock-free (and are "most efficient" for wait/notify). Those are optional in freestanding implementations (not hosted under an OS), but I think this rules out C++20 hosted implementations on 386 or 68000; you'd need 486 or 68020 or later. I think C++20 decided to add useful stuff that people want even if that means some retro hardware can't implement C++20. Modern C++ has been making choices like that in other areas, like requiring signed integers to be two's complement, dropping the implementation choice of one's complement or sign/magnitude.
In ISO C, the thread/mutex/atomics stuff is optional, unlike ISO C++, so modern C can still be implemented on ancient hardware by leaving out the thread support library entirely.
68000 TAS instruction 68000 / http://www.easy68k.com/paulrsm/doc/dpbm68k3.htm
68020 and later have CAS (and even CAS2 on 68020/030/040, an operation on two separate memory locations, although that proved very difficult to support efficiently so was dropped from later CPUs).
ARMv5 (just swp
Swap)
SparcV8 (mentioned by https://llvm.org/docs/Atomics.html#atomics-and-codegen)
80386: cmpxchg
was officially new in Pentium, although 486 had a different undocumented opcode for it.
(See also a retrocomputing 486 SMP question for some mention of some early x86 SMP systems.)
80386 has xchg
(which is an atomic RMW even without a lock
prefix when used with a memory operand, whether you want it or not), and lock bts
/ btr
/ btc
to test-and-set, test-and-reset (clear), or test-and-complement (flip) a bit without disturbing surrounding bits.
It also has lock add
, lock or
/ lock and
etc. (fetch_or
/ fetch_and
without a return value, other than setting FLAGS from the result so you can branch on the result being all-zero or having its MSB set. Or the parity of the low byte.) lock xadd
was new in 486 (fetch_add
including the return value) If you use the return value of fetch_or
,
compilers have to implement it using a lock cmpxchg
retry loop.
But none of these are cmpxchg
or sufficient to emulate it.
Technically, mutual exclusion is possible without an atomic RMW via Peterson's algorithm with just seq_cst
loads and stores, but that requires an array with an entry for each possible thread, and C++ allows starting new threads after mutex objects already exist and are in use. So that's not really viable. Lamport's bakery algorithm has the same requirement for an array with size = NUM_THREADS
.
C11 and C++11 weren't interested in trying to support threads on such ancient CPUs where multiprocessing is such a struggle, so they went ahead and required atomic_flag
to support lock-free RMWs.
On a uniprocessor machine, one way to implement any atomic RMW is just to disable/re-enable interrupts around it. This is still lock-free in a sense: it can't create a deadlock between an interrupt or signal handler and the main thread like an actual spinlock or mutex could. This would require a system call in a hosted implementation.
Or, on uniprocessor machines where interrupts can only happen at instruction boundaries, do it with a single instruction. e.g. x86 add [mem], eax
is atomic wrt. interrupts and thus context switches on the same core, but not in a multi-core system unless you also use the lock
prefix. (Is incrementing an int effectively atomic in specific cases?).
So some CISCs with memory-destination RMW instructions can use those for operations that only have to be atomicity wrt. other threads (or interrupt handlers) that could only run on the same core (because this is a uniprocessor system or because we only care about interrupt / signal handlers). Even though the instructions don't have any special RMW guarantees wrt. concurrent writers like DMA or other bus-master I/O devices. Or other cores if any existed.
Some ISAs like m68k can save partial progress of executing an instruction and resume after an interrupt, defeating this. But x86 isn't like that; interrupts are only processed at instruction boundaries. (See Is x86 CMPXCHG atomic, if so why does it need LOCK?)
How is atomic_flag implemented? - for x86, current compilers use xchg
because it's a bit more efficient than lock bts
. clear(memory_order_release)
or weaker is much more efficient than lock btr
, using plain mov
without mfence
or xchg
.
TAS instruction 68000 - example asm for a spinlock
http://www.easy68k.com/paulrsm/doc/dpbm68k3.htm discusses the design behind TAS
Is incrementing an int effectively atomic in specific cases? - how modern microarchitectures with cache handle atomic RMWs without locking the whole memory bus, so separate cores can be doing atomic RMWs on separate memory locations at the same time.