performance assembly x86 micro-optimization micro-architecture

Why dependency in a loop iteration can't be executed together with the previous one

I use this code to test the impact of dependency in a loop iteration on IvyBridge:

global _start
_start:
    mov rcx,    1000000000
.for_loop:          
    inc rax     ; uop A
    inc rax     ; uop B
    dec rcx     ; uop C
    jnz .for_loop   

    xor rdi,    rdi
    mov rax,    60  ; _exit(0)
    syscall

Since dec and jnz will be macro-fused to a single uop, there are 3 uops in my loop, they are labeled in the comments.

uop B depends on uop A, so I think the execution would be like this:

A C
B A C  ; the previous B and current A can be in the same cycle
B A C
...
B A C
B

Therefore the loop can be executed 1 cycle per iter.

However, the perf tool shows:

 2,009,704,779      cycles                
 1,008,054,984      stalled-cycles-frontend   #   50.16% frontend cycles idl

So it's 2 cycle per iter, and there are 50% frontend cycle idle.

What caused the frontend 50% idle? Why the hypothetical execution diagram can't be realized?

Solution

B and A form a loop-carried dependency chain. A in the next iteration can't run until it has the result of B in the previous.

Any given B can never run in the same cycle as an A: what input would the later one use, if the earlier one hasn't produced a result yet?

This chain is 2 cycles long (per iteration), because the latency of inc is 1 cycle. This creates a latency bottleneck in the back-end that out-of-order execution can't hide. (Except for very low iteration counts where it can overlap it with code after the loop).

Just like if you fully unrolled a huge chain of times 102400 inc eax, there's no instruction-level parallelism for the CPU to find between a chain of instructions that each depend on the previous.

The macro-fused dec rcx/jnz uop is independent of the RAX chain, and is a shorter chain (only 1 cycle per iteration, being only 1 dec&branch uop with 1c latency). So it can run in parallel with B or A uops.

See my answer on another question for more about the concept of instruction-level parallelism and dependency chains, and how CPUs exploit that parallelism to run instruction in parallel when they're independent.

Agner Fog's microarch PDF shows this with examples in an early chapter: Chapter 2: Out-of-order execution (All processors except P1, PMMX).

If you started a new 2-cycle dep chain every iteration, it would run as you expect. A new chain forking off every iteration would expose instruction-level parallelism for the CPU to keep A and B from different iterations in flight at the same time.

.for_loop:
    xor eax,eax          ; dependency-breaking for RAX
    inc rax     ; uop A
    inc rax     ; uop B
    dec rcx     ; uop C
    jnz .for_loop

Sandybridge-family handles xor-zeroing without an execution unit, so this is still only 3 unfused-domain uops in the loop, so IvyBridge has enough ALU execution ports to run all 3 in a single cycle. This also maxes out the front-end at 4 fused-domain uops per clock.

Or if you changed A to start a new dep chain in RAX with any instruction that unconditionally overwrites RAX without depending on the result of the inc, you'd be fine.

    lea  rax, [rdx + rdx]     ; no dependency on B from last iter
    inc  rax                  ; uop B

Except for a couple instructions with an unfortunate output dependency: Why does breaking the "output dependency" of LZCNT matter?

    popcnt rax, rdx        ; false dependency on RAX, 3 cycle latency
    inc  rax               ; uop B

On Intel CPUs, only popcnt, and lzcnt/tzcnt have an output dependency for no reason. It's because they use the same execution unit as bsf/bsr, which leave the destination unmodified if the input is zero, on Intel and AMD CPUs. Intel still only documents it on paper as undefined if the input is zero for BSF/BSR, but they build hardware that implements stronger guarantees. (AMD does even document this BSF/BSR behaviour.) Anyway, so Intel's BSF/BSR are like CMOV, and need the destination as an input in case the source reg is 0. popcnt, (and lzcnt/tzcnt on pre-Skylake) suffer from this, too.

If you made the loop more than 5 fused-domain uops, SnB/IvB could issue it at best 1 per 2 cycles from the front-end. Haswell and later "unroll" in the loop buffer or something so a 5 uop loop can run at ~1.25 c per iteration, but SnB/IvB don't. Is performance reduced when executing loops whose uop count is not a multiple of processor width?

The front-end issue/rename stage is 4 fused-domain uops wide in Intel CPUs since Core 2.