I have the following Swift code that that takes an iteration count and another argument and performs some computation in a loop.
@inline(never)
func rawComputeAll(iterCount: Int, timeStep: Double) -> Double {
// most things up here are constants, scroll down to
let rho = 995.59
let c = 0.04017672438703612,
Rt = 0.0042879141883543715
let s = 4184.0
let T_in = 300.0,
S_in = 1000.0
var h_st = 1.0e5,
f_st = 0.5,
m = c * rho,
sv = s
var T_out = 300.0,
h_out = 1.0e5
for _ in 0..<iterCount {
T_out = T_in - Rt * S_in
let T_st = 273.15 + h_st / sv
let deltaT = T_out - T_st
let sign = deltaT != 0.0 ? deltaT / abs(deltaT) : deltaT
let deltaS = sign * S_in * timeStep
let deltaU = deltaS * 1.0
let m_in = f_st * timeStep
let E_in = m_in * h_out
let m_total = m + m_in
let E_state = m * h_out + E_in + deltaU
h_st = E_state / m_total
}
return h_st
}
I benchmark it in two ways.
iterCount
like so.let startTime1 = DispatchTime.now()
let result1 = rawComputeAll(iterCount: iterCount, timeStep: timeStep)
let endTime1 = DispatchTime.now()
(I actually used a proper benchmarking framework that reported the distribution of runtimes, but using this simple timing code here for brevity and ease of reproducibility).
timeStep
so that the swift compiler does not optimize the loop out.let scaling = 1.0e-6
var result2 = Double.random(in: 0.1...10.0)
let startTime2 = DispatchTime.now()
for _ in 1...iterCount {
result2 = rawComputeAll(iterCount: 1, timeStep: timeStep + scaling*result2)
}
let endTime2 = DispatchTime.now()
Now, I was expecting the second method to be slower because of the overhead of function calls in the loop (note the @inline(never)
used to ensure this), and also because it does a couple of extra FLOPs in the loop body.
But, turns out that the this method with the loop on the outside clocks significantly faster than a straightforward benchmark -- on my machine there is a 30% difference between them. Here is a Godbolt link showcasing this problem.
So, I have two questions.
I did all measurements (profiling and benchmarking) on release builds.
Is for _ in 0..<iterCount
just to repeat the same work for benchmarking? So a compiler could at least in theory optimize it to
if iterCount > 0 { }
? (Spoiler alert: no, the two programs are doing different work.)
Does the run time scale anywhere near linearly with the iteration count for high counts? If not, the compiler may be optimizing away the loop. I'd have expected an optimizing compiler to inline the function at least in the second version if this was just an apples-to-apples throughput benchmark, but your Godbolt link confirms that's not happening.
Copy/pasting the loop from .LBB1_4:
to dec rdi / jne .LBB1_4
from output.rawComputeAll
into https://uica.uops.info/ to have it analyze for bottlenecks, apparently there's a loop-carried dependency bottleneck of 65 cycles on Skylake.
According to the dependency analysis, it's through XMM11, which starts out initialized to 100000.00 aka 1e5 (from movsd xmm11, qword ptr [rip + .LCPI1_0]
and the godbolt mouseover for the .quad 0x40f86a0000000000
bit-pattern shows the value). So it's h_st
. Ok yes, we can see in your code that variables are computed from h_st
and then at the end of the loop stuff is assigned back to h_st
.
So that's why the compiler wasn't able to optimize away the looping.
Throughput and latency are separate things on out-of-order exec CPUs. See
For your loop, if we break the dependency by editing the asm loop to add xorps xmm11, xmm11
right before the div xmm11, xmm4
, then uiCA predicts it would run in about 11 cycles per iteration on Skylake instead of 66.5, with a bottleneck on divider throughput. But with all the overhead of a function call and reloading the constants every time, that probably takes up too much space in the Reorder Buffer (ROB) for out-of-order exec to see far enough to find the instruction-level parallelism across iterations. (https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/). Or perhaps we're getting into a front-end bottleneck. (At 2 loads per clock on modern CPUs when it hits in cache, reloading the constants is no big deal.)
Benchmarking on Godbolt itself is also going to be very noisy due to other code on the same CPUs as the AWS instance. And suffer from unknown amounts of distortion due to CPU frequency warm-up to max turbo if it wasn't already. Running on your own desktop, CPU warm-up might be an even bigger factor without running any warm-up loop, unless you run the benchmark many times in a row from a shell script loop or something. (Not manually, any gap wouldn't be continuous load.) Idiomatic way of performance evaluation?
Division has bad throughput, and even worse latency. (On modern CPUs its partially pipelined. In CPUs like Core 2, division wasn't even pipelined at all.)
See Floating point division vs floating point multiplication.
let T_st = 273.15 + h_st / sv
Could be written as multiplication by a constant or loop-invariant, like
let T_st = 273.15 + h_st * (1.0 / sv)
Or simply define sv_inverse = 1.0/s
and use that. Compilers can't do this optimization for you (without -ffast-math
or equivalent), except when the divisor has an inverse that's exactly representable in binary floating-point. (e.g. when it's a power of 2, for example 4.0
and 0.25
are both exactly representable.)
FP multiply or FMA have about 4 cycle latency, vs. about 15 for double-precision division on recent Intel (https://uops.info/). If you can't avoid a loop-carried dependency chain, at least try to keep division and square root out of it.