#include <atomic>
#include <thread>
alignas(sizeof(void*)) int n1;
std::atomic<int> n2;
int main() {
std::thread threads[32];
for (auto& thread : threads) {
thread = std::thread([] {
while (true) {
// randomly load and/or store n1, n2
}
});
}
for (auto& thread : threads) {
thread.join();
}
}
Consider the code above:
n1
is aligned to the native word boundary, so it can be loaded and stored atomically without a LOCK prefix at the assembly instruction level.
n2
is a std::atomic
, I'm not sure whether it will use a LOCK prefix at the assembly instruction level.
My question is:
Is it always safe to use an aligned-to-native-word int
variable , instead of an std::atomic<int>
variable, for the best performance gain?
"Is it always safe to use an aligned-to-native-word int varible , instead of an std::atomic variable, for the best performance gain?"
It's definately not safe:
According to the C++ memory model accessing n1
from multiple threads in this case constitutes a data race. And as you can see in the link above:
If a data race occurs, the behavior of the program is undefined.
Having an undefined behavior means anything can happen, and you should always avoid it.
One possible result could be that compiler will optimize away reads of n1
assuming it is not modified by another thread.
(Note that the atomicity of read/write de-facto due to alignment to word boundary etc. does not preclude a data race).
Regarding performance:
std::atomic<T>
is a part of the standard library. It is meant for such cases exactly.
The standard library usually supplies the most optimal implementation for these primitives, and you need a really special reason (which you did not specify) to avoid using it.
If you are concerned about the overhead of locking, you can check std::atomic<T>::is_lock_free
, and std::atomic<T>::is_always_lock_free
(from C++17).
But keep in mind that if is is not lock free, it means that a lock is really required.
Additional info regarding performance (courtacy of Peter Cordes):
std::atomic<int>
is lock-free on all modern mainstream systems (except 8-bit microcontrollers) so you don't really have to worry about that.
The default memory_order
for operations on std::atomic
is std::memory_order_seq_cst
. On x86, that costs extra for .store
, but not for loads or RMWs. Use .load(acquire)
and .store(val, release)
so the compiler doesn't have to use any extra barrier instructions (or xchg
for a seq-cst store), unless something you're doing actually needs some of the guarantees that sequential consistency gives but acq/rel don't. (The expensive part being StoreLoad ordering, and no IRIW reordering even on PowerPC where that costs extra barriers to prevent.)