assemblyx86cpu-architectureinstructionsmicro-architecture

Execute operations of the same instruction separately in an OoO processor


Imagine that we have an instruction which has been divided into 3 micro-operations, and we have an out-of-order processor. My question is: these 3 uops must be executed sequentially or can the processor alternate these uops with other uops from other instructions?

I mean, in an OoO processor you can execute instructions out of order, but if we divide an instruction in some micro-operations, can be these micro-operations executed non-sequentially?

For example we have 3 instructions: A, B and C. A and C have been divided into 1 uop each: A1 and C1, and B has been divided into 3 uops: B1, B2, B3. Can the OoO processor execute, for example, B1 - A1 - B2 - C1 - B3? Or must it execute B1-B2-B3 in a row?


Solution

  • Yes, every uop is scheduled independently, subject only to having to wait for its inputs to be ready. (And for a free cycle on the execution port it was assigned when it issued into the out-of-order back end.) How are x86 uops scheduled, exactly? Instruction boundaries aren't relevant for the RS aka scheduler.

    For many multi-uop instructions, the uops have a data dependency on earlier ones. But sometimes the earlier uops only need one of the inputs to be ready, so there are separate minimum latencies for each input to the output. What do multiple values or ranges means as the latency for a single instruction?

    e.g. add eax, [rdi] only needs EAX to be ready after the load uop finishes. So the critical path latency through EAX is only 1 cycle. But if RDI wasn't ready, or the memory pointed-to by RDI wasn't ready, then the add ALU uop can't execute. But still, this is rather the point of decoding to uops, unlike P5 Pentium which had to do the load and add together in its in-order pipeline1.

    (Or for variable-count shifts on Intel CPUs for example, the extra uops are only for the FLAGS output; the GP-integer part is ready with 1 cycle latency, but the FLAGS result is ready later. The uop that produces the GP-integer result is essentially the same as the only uop BMI2 shlx decodes to.)

    But some do have some ILP (well actually uop-level parallelism). For example xchg eax, ecx decodes to 3 register-copy uops on Intel CPUs, and we can measure the latency separately for the EAX->ECX and ECX->EAX directions at 1 and 2 cycles (respectively). Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures?

    Another example is phaddd; it decodes pretty much like two shufps uops (2-input shuffles) and one paddd uop which depends on the two shuffles. The two shuffles are both reading both inputs to feed the shuffle. Ice Lake has shuffle units on 2 ports and can actually run the shuffle uops in parallel, giving it 2 cycle latency (uops.info), down from 3 cycles on earlier Intel because of the resource conflict for the single shuffle port. (Ice Lake's extra shuffle port only runs some integer shuffles, so haddps is still just as bad as ever on Ice Lake)

    Note that we can't prove exactly what each uop is doing, but given the measured latencies and total uop per port counts, for many instructions there's only one sane design that explains the behaviour. e.g. for phaddd we know that the CPU has SIMD-integer add execution units, and integer shuffle units, so implementing phaddd as 3 uops can most obviously be done by decoding to two hard-wired shuffle patterns and a plain paddd uop.


    Footnote 1: Optimizing for P5 apparently involved using a RISCier subset of x86, like avoiding memory source operands except for mov, and definitely avoiding memory destination instructions. That's because it was an in order pipeline, as well as not being able to crack multi-uop instructions apart to schedule them independently.

    Futher reading re: p5 vs. later microarchitectures: https://agner.org/optimize/. Also https://www.realworldtech.com/sandy-bridge/ is very good.

    http://www.lighterra.com/papers/modernmicroprocessors/ is a great into if you haven't read it, but it doesn't go into the level of detail your question is about.