Can the LSD issue uOPs from the next iteration of the detected loop?

I was playing investigating the capabilities of the branch unit on port 0 of my Haswell starting with a very simple loop:

BITS 64
GLOBAL _start

SECTION .text

_start:

 mov ecx, 10000000

.loop:

 dec ecx             ;|
  jz .end            ;| 1 uOP (call it D)

jmp .loop            ;| 1 uOP (call it J)

.end:
 mov eax, 60
 xor edi, edi
 syscall

Using perf we see that the loop runs at 1c/iter

Performance counter stats for './main' (50 runs):

        10,001,055      uops_executed_port_port_6   ( +-  0.00% )
         9,999,973      uops_executed_port_port_0   ( +-  0.00% )
        10,015,414      cycles:u                    ( +-  0.02% )
                23      resource_stalls_rs          ( +- 64.05% )

My interpretations of these results are:

Both D and J are dispatched in parallel.
J has a reciprocal throughput of 1 cycle.
Both D and J are dispatched optimally.

However, we can also see that the RS never gets full.
It can dispatch uOPs at a rate of 2 uOPs/c at most but can theoretically get 4 uOPs/c, leading to a full RS in about 30 c (for an RS with a size of 60 fused-domain entries).

To my understanding, there should be very few branch mispredictions and the uOPs should all come from the LSD.
So I looked at the FE:

     8,239,091      lsd_cycles_active ( +-  3.10% )
       989,320      idq_dsb_cycles    ( +- 23.47% )
     2,534,972      idq_mite_cycles   ( +- 15.43% )
         4,929      idq_ms_uops       ( +-  8.30% )

   0.007429733 seconds time elapsed   ( +-  1.79% )

which confirms that the FE is issuing from the LSD¹.
However, the LSD never issues 4 uOPs/c:

     7,591,866      lsd_cycles_active ( +-  3.17% )
             0      lsd_cycles_4_uops

My interpretation is that the LSD cannot issue uOPs from the next iteration² thereby only sending D J pairs to the BE each cycle.
Is my interpretation correct?

The source code is in this repository.

¹ There is a bit of variance, I think this is due to the high number of iterations that allows for some context switch.
² This is sound quite complex to do in hardware with limited circuits depth.

Solution

All of the uops in your loop are branches (2 per iteration). I think that the reason that `lsd_cycles_4_uops is zero is because of a limitation in the renamer. According to the Intel Optimization Manual Section 2.4.3.1:

The renamer can allocate two branches each cycle, compared to one branch each cycle in the previous microarchitecture. This can eliminate some bubbles in execution.

That is a subsection of a section on the Sandy bridge microarchitecture. But to my knowledge, this applies to all later microarchitectures. The maximum renaming throughput is 4 uops per cycle. But at most two uops can be branches. So in this example where all uops are branches, the LSD can never deliver more than 2 uops at any given cycle even in the first iteration of the loop.

Therefore, 2 branch uops will be allocated in the RS per cycle, and both (one predicated taken and one not taken) can be dispatched per cycle. So the RS occupancy does not grow.

This limitation does not impact the performance of your program. Executing 2 branch uops per cycle, giving an IPC of 3 per cycle, is already optimal.

I tried to find a performance event that can capture allocator stalls due to that limitation. The events RESOURCE_STALLS.ANY and UOPS_ISSUED.ANY (with cmask=1 and inv=1) don't seem to be relevant in this case. @IwillnotexistIdonotexist suggested to use IDQ_UOPS_NOT_DELIVERED.CORE. I present the results below for the performance event and all of its supported variants. I also provide the correct meaning of these events because the manual is wrong. T denotes the number of iterations.

IDQ_UOPS_NOT_DELIVERED.CORE: Counts the number of slots that were not utilized by the allocator. If the program ran for C core cycles, then the total number of slot is 4*C. The measured value is almost equal to 2*T. Since the the number of cycles is T, the number of slots is 4*T, which means that about half the issue slots were not utilized.

IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE: Counts the number of cycles where zero uops were delivered from the IDQ. The measured value is negligible.

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE: Counts the number of cycles where at most 1 uops were delivered from the IDQ. The measured value is negligible.

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE: Counts the number of cycles where at most 2 uops were delivered from the IDQ: The measured value is almost equal to T.

IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE: Counts the number of cycles where at most 3 uops were delivered from the IDQ: The measured value is almost equal to T.

Therefore, since the execution time is almost equal to T core cycles, we can conclude that the allocator only allocates exactly 2 uops per cycle in most cycles., which is equal to the dispatch rate.

Note that the RS in Haswell and Skylake holds unfused uops. So each entry can hold a single unfused uop. See Footnote 2. But this doesn't matter here because there is no microfusion.