I am working on making a fiber-based job system for my latest project which will depend on the use of spinlocks for proper functionality. I had intended to use the PAUSE instruction as that seems to be the gold-standard for the waiting portion of your average modern spinlock. However, on doing some research into implementing my own fibers, I came across the fact that the cycle duration of PAUSE on recent machines has increased to an adverse extent.
I found this out from here, where it says, quoting the Intel Optimization Manual, "The latency of PAUSE instruction in prior generation microarchitecture is about 10 cycles, whereas on Skylake microarchitecture it has been extended to as many as 140 cycles," and "As the PAUSE latency has been increased significantly, workloads that are sensitive to PAUSE latency will suffer some performance loss."
As a result, I'd like to find an alternative to the PAUSE instruction for use in my own spinlocks. I've read that in the past, PAUSE has been preferred due to it somehow saving on energy usage which I'm guessing is due to the other often quoted factoid that using PAUSE somehow signals to the processor that it's in the midst of a spinlock. I'm also guessing that this is on the other end of the spectrum power-wise to doing some dummy calculation for the desired number of cycles.
Given this, is there a best-case solution that comes close to PAUSE's apparent energy efficiency while having the flexibility and low-cycle count as a repeat 'throwout' calculation?
I'm guessing is due to the other often quoted factoid that using PAUSE somehow signals to the processor that it's in the midst of a spinlock.
Yes, pause
lets the CPU avoid memory-order mis-speculation when leaving a read-only spin-wait loop, which is how you should spin to avoid creating contention for the thread trying to unlock that location. (Don't spam xchg
). See also:
How does x86 pause instruction work in spinlock *and* can it be used in other scenarios?
Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock? - spin read-only (with pause
) if the first atomic RMW fails to take the lock. You do want to start with an RMW attempt so you don't get a Shared copy of the cache line and then have to wait for another off-core request; if the first access is an RMW like xchg
or lock cmpxchg
, the first request the core makes will be an RFO.
Locks around memory manipulation via inline assembly a minimal x86 asm spinlock using pause
(but no fallback to OS-assisted sleep/wake, so inappropriate if the locking thread might ever sleep while holding the lock.)
If you have to wait for another core to change something in memory, it doesn't help much to check much more frequently than the inter-core latency, especially if you'll stall right when you finally do see the change you've been waiting for.
pause
being slow (newer Intel) or fast (AMD and old Intel)If you have code that uses multiple pause
instructions between checking the shared value, you should change the code to do fewer.
See also Why have AMD-CPUs such a silly PAUSE-timing with Brendan's suggestion of checking rdtsc
in spin loops to better adapt to unknown delays from pause
on different CPUs.
Basically try to make your workload not as sensitive to pause
latency. That may also mean trying to avoid waiting for data from other threads as much, or having other useful work to do until a lock becomes available.
pause
On CPUs new enough to have the WAITPKG extension (Tremont or Alder Lake / Sapphire Rapids), umonitor
/ umwait
could be viable to wait in user-space for a memory location to change, like spinning but the CPU wakes itself up when it sees cache coherency traffic about a change, or something like that. Although that may be slower than pause
if it has to enter a sleep state.
(You can ask umwait
to only go into C0.1, not C0.2, with bit 0 of the register you specify, and EDX:EAX as a TSC deadline. Intel says a C0.2 sleep state improves performance of the other hyperthread, so presumably that means switching back to single-core-active mode, de-partitioning the store-buffer, ROB, etc., and having to wait for re-partitioning before this core can wake up. But C0.1 state doesn't do that.)
Even in the worst case, pause
is only about 140 core clock cycles. That's still much faster than a Linux system call on a modern x86-64, especially with Spectre / Meltdown mitigation. (Thousands to tens of thousands of clock cycles, up from a couple hundred just for syscall
+ sysret
, let alone calling schedule()
and maybe running something else.)
So if you're aiming to minimize wakeup latency at the expense of wasting CPU time spinning longer, nanosleep
is not an option. If might be good for other use-cases though, as a fallback after spinning on pause
a couple times.
Or use futex
to sleep on a value changing or on a notify from another process. (It doesn't guarantee that the kernel will use monitor
/ mwait
to sleep until a change, it will let other tasks run. So you do need to have the unlocking thread make a futex
system call if any waiters had gone to sleep with futex
. But you can still make a lightweight mutex that avoids any system calls in the locker and unlocker if there isn't contention and no threads go to sleep.)
But at that point, with futex
you're probably reproducing what glibc's pthread_mutex
does.