I'm measuring costs of busy-waiting implemented with two approaches:
// 1
while(run) { // looping on volatile flag
}
// 2
while(run) { // looping on volatile flag
Thread.onSpinWait();
}
Complete code of the examples is available on GitHub via link1 and link2.
I'm running the benchmarks with -prof perfnorm
and they yield:
Benchmark Mode Cnt Score Error Units
WhileTrueBenchmark.whileTrue avgt 20 6460.700 ± 109.333 ns/op
WhileTrueBenchmark.whileTrue:CPI avgt 4 0.444 ± 0.005 clks/insn
WhileTrueBenchmark.whileTrue:IPC avgt 4 2.252 ± 0.026 insns/clk
WhileTrueBenchmark.whileTrue:L1-dcache-loads avgt 4 51523.529 ± 3556.009 #/op
WhileTrueBenchmark.whileTrue:branches avgt 4 13981.285 ± 958.249 #/op
WhileTrueBenchmark.whileTrue:cycles avgt 4 36407.576 ± 2434.292 #/op
WhileTrueBenchmark.whileTrue:instructions avgt 4 81985.523 ± 6300.983 #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait avgt 20 6463.334 ± 49.922 ns/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:CPI avgt 4 2.143 ± 0.056 clks/insn
ThreadOnSpinWaitPlainBenchmark.onSpinWait:IPC avgt 4 0.467 ± 0.012 insns/clk
ThreadOnSpinWaitPlainBenchmark.onSpinWait:L1-dcache-loads avgt 4 7262.587 ± 324.600 #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:branches avgt 4 2951.111 ± 162.867 #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:cycles avgt 4 36307.064 ± 1516.787 #/op
ThreadOnSpinWaitPlainBenchmark.onSpinWait:instructions avgt 4 16943.396 ± 820.446 #/op
So from the output we see that for the same time elapsed we have the same cycle count, but Thread.onSpinWait()
executed almost 5 times less instructions. This is understandable and expected behavior.
What is unexpected to me is that it produced much less branches
and L1-dcache-loads
. The benchmarked code doesn't have much branching and reads one and the same flag from memory.
So why are these two metrics (branches
and L1-dcache-loads
) so different?
On x86, Thread.onSpinWait()
intrinsic is translated to the PAUSE
instruction. PAUSE
delays the execution of the next instruction for an implementation-specific amount of time. Because of this delay, the second loop executes the less number of times comparing to the loop without onSpinWait
.
Extra delay per each loop iteration => less number of iterations => less number of retired instructions (including cmp
and jne
) => less branches and memory loads.