My mental model for out of order execution is to think of it as like a sliding window on the instruction stream, where if there are any instructions in the window that are ready (their inputs have already been computed) they can be started immediately even if there are other instructions before them in the stream, provided CPU resources are available.
However I am trying to understand the impact of ordering between instructions that don't have a dependency on one another. Say I have two sets of three instructions that each form independent dependency chains A,B,C
and X,Y,Z
. None of these instructions change control flow. Because the instructions in each set form a dependency chain, the instructions within a set can't be executed out of order (C,B,A
is invalid), but since the chains are independent they can be interleaved with each other (A,X,B,Y,C,Z
is valid). Further assume that all six instructions fit inside the sliding window and are in cache. Will all valid sorts of the six instructions have the same performance or will some be faster than others? With N chains of M instructions?
My suspicion is no because it seems like ideal scheduling would be complex to do in hardware, but maybe there are good-enough-in-practice heuristics real CPUs use?
Assuming that some of the sorts are faster than others, how using performance counters or other tools can you detect that your code is running slower than it could due to a bad sort? Is there a way to eyeball that the sort in the assembly for a hot loop is bad? As opposed to the more usual suspects like branch mispredicts and cache misses.
The concept you're asking about is "scheduling", either statically by the compiler (program order of the asm) or dynamically in the CPU (by the "scheduler", aka RS = Reservation Stations).
See How are x86 uops scheduled, exactly? - for each execution port, oldest-ready first, not detecting which uops are part of a long critical-path dependency chain. Usually that mostly happens naturally after a few iterations of a loop as the un-executed uops are the ones that are part of a loop-carried dependency chain.
For something like integer a*b*c + d*e
, you would want to schedule the imul
for a*b
before d*e
, since it's part of a chain of 2 imul
s (3 cycle latency, 1/cycle throughput). If the d*e
imul
is statically first, in program order (and all 5 inputs are ready in registers), it will execute first. So the longer dependency chain can't start until the next cycle, leading to a critical path length of 1(resource conflict) + 3+3(imuls) + 1(add)
. If the critical path latency of that whole expression is part of a longer dep chain which is an overall bottleneck (yet somehow we still had all 5 inputs ready at once, or at least d, e, a, and b), then statically scheduling the critical path uops first would help, avoiding the resource conflict for the one execution port that has an integer multiplier.
(I picked integer multiply for a simpler example because modern x86 CPUs still only have one execution port that can handle it. SIMD shuffles on Intel CPUs are another common example of that, at least for some shuffles that can't run on Ice Lake's port 1, or with 512-bit vector width.)
uops are assigned to ports during issue/rename/alloc (when they're moved from the in-order front-end to the out-of-order back-end, the RS and ROB = ReOrder Buffer). This is done in parallel for all uops in an issue group (e.g. 4 to 6 on recent x86), based on the number of un-executed uops for each port at the start of the cycle. Scheduling can be sub-optimal in the first couple iterations of a loop; see x86_64 haswell instruction scheduled on already used port instead of unused one - Intel uses some tie-break heuristics to distribute uops to ports when the queue lengths are close, not checking port restrictions between uops in the same cycle.
For block diagrams of CPUs, see
how using performance counters or other tools can you detect that your code is running slower than it could due to a bad sort?
Hard to detect in general. If you've run the numbers for the expected latency bottleneck of a loop-carried dependency chain but it's actually running slower than that, and there aren't stalls from cache misses or the front-end, then uop scheduling is something to look at.
On Intel CPUs, idq_uops_not_delivered.cycles_fe_was_ok
counts cycles where there were uops ready to issue from the front-end (IDQ = Instruction Decode Queue), but they didn't because of back-end stalls.
Unrolling to hide latency (by overlapping multiple dep chains instead of having one longer one) is hugely important for higher-latency operations like floating-point math, like dot products or summing arrays.
imul
dep chains (3c latency, 1c throughput), how fully can they be overlapped.