It seems that Intel's Transactional Synchronization Extensions (TSX-NI) work on a per-CPU basis.
This applies to both the _InterlockedXxx_HLE{Acquire,Release}
Hardware Lock Elision functions (HLE), as well as for the _xbegin
/_xend
/etc. Restricted Transactional Memory (RTM) functions.
What is the "proper" way to use these functions on multi-core systems?
Given their correctness guarantees, I assume I only need to be worried about performance here.
So, how should I structure & write my code so that my code has the best performance, considering that there is always the chance that threads might suddenly switch cores and hence these instructions might need to fall back to slower code paths?
For example, should I try to set thread CPU affinities explicitly, or is that bad practice?
Is there any other thing I should worry about?
The transaction will abort if the CPU takes an interrupt in the middle. The abort is processed before RIP is saved, so an interrupt->CPU migration can't resume on this or another core and run xend
without being inside a transaction.
Thus there's no correctness problem.
Pinning threads to cores can help performance for cache-locality reasons, if the OS's process scheduler would otherwise be tempted to bounce threads around in a way that's sub-optimal for your workload.
But it won't help TSX specifically: resuming on the same core after an interrupt is no better, because the transaction is already aborted. That core will have all the cache-lines you need probably still hot in L1d, and hopefully still in Exclusive or Modified state.
CPU migrations can only happen for user-space tasks when an interrupt puts them to sleep, and the kernel on another core decides to grab that task.
In kernel code, obviously don't call schedule()
inside a transaction; not that it matters for correctness because either the transaction aborts (likely) or execution eventually or quickly returns to this task and we reach xend
and successfully commit everything that happened as a single large transaction (which includes everything the scheduler and potentially another task did).
I haven't actually played around with this, but I don't think there's any reason to expect thread-affinity performance considerations for TSX to be significantly different from non-TSX.