assemblyx86cpu-architectureintel

Can addition be done in less than a cycle when outputs depend on each other?


As you probably know modern Intel CPUs can execute 4 add instructions in one cycle assuming they are independent, and in fact Alder Lake and later can do 5 add instructions in one cycle. They also have a big reorder buffer so can find additions further in the assembly.

Is there any chance addition can be done in less than a cycle even if the dependency happens and input of second add depends on output of first add? And if not why not? Maybe the answer is that add operation will have less than 1 latency then and that is not allowed since addition is measuring the frequency of the CPU? Then what about making it zero latency then? https://stackoverflow.com/a/58862982/11173412

Same question can be asked on Arrow Lake multiplication, that has inverse throughput of 3 (only on P cores, not on E cores, first time this is available).

Pinging @PeterCordes

P.S. Apparently, that was already done before: "throughput of 0.5 cycles on both processors, but a latency of 0.5 cycles on Northwood" Was there a P4 model with double-pumped 64-bit operations?

The question is then why is it not done on modern CPUs and why is it not done for 5 adds?


Solution

  • Only Pentium 4 ever had execution units with non-zero latency less than 1 cycle, with their double-pumped ALUs giving add in general 0.5 cycle latency.

    But Alder Lake P-cores (Golden Cove) and later can handle small-constant additions with 64-bit operand-size (add/sub/dec/inc) with zero latency during rename, instead of sending them to an execution unit at all. Also for mov r64, imm with small values.
    It's only for add-immediate, not add rax, rcx even if the register values are small numbers.

    This allows execution of dependent instructions like add rax, 12 at up to 5 per clock cycle (with one of them not being eliminated).

    Even with 6 different registers for inc instructions, we only get 5 IPC. 4 eliminations per cycle seems to be a throughput limit in the renamer. (Also, 4 dependent eliminations would need 3 adds, for a tree depth of 2 to update the counter, if that's how the internals work.)
    In real code, it's rare to have many dependent add-constant ops back to back, but it's not rare to have 3 pointer increments all together in a loop. So a limit as high as 4 eliminations per cycle makes sense.


    https://tavianator.com/2025/shlxplained.html has details: there's an 11-bit signed counter [-1024, +1023] associated with each renamed register. Every uop which can can read an integer register needs to be able to do that small addition to produce the real value. For bit-shifts, that makes them take 3 cycles instead of 1 if any input has been renamed with the offset in-use (including from mov rax, 0 or inc+dec, but not from xor rax,rax which is eliminated with no offset by a different older mechanism.) Other instructions seem to be able to do the addition as well as their usual work for free as part of their usual latency.

    See also https://patents.google.com/patent/US20170123799A1/en

    We can verify that it's actually zero latency, despite latency being better than throughput for inc rax alone, by constructing a dependency chain involving other instructions with known non-zero latency. For example, 6*inc rcx + 3*movsx rcx, ecx in a loop runs at 3 cycles per iteration (exactly like with just the 3 movsx instructions). But 7*inc rcx pushes it to 3.04 cycles per iteration on average: occasionally the renaming doesn't succeed. (These numbers are from @TavianBarnes, the article's author, via the OP, @Валерий Заподовников)


    The article says that 5 IPC throughput can be achieved assuming you do not go above 1023.

    ; repeated  add rax, 1
    ; initially (rax -> pr1)
            ; rename    (rax -> pr1 + 1)
            ; rename    (rax -> pr1 + 2)
            ; rename    (rax -> pr1 + 3)
            ...
    

    If it overflows 11 bit buffer then no: a uop has to execute (on an execution port with 1 cycle latency) to zero out produce a real value with 0 offset. (This can be detected at rename time while renaming an add uop, so it's just a matter of actually putting it in the scheduler and marking it un-executed when adding to the ROB (reorder buffer). Not like stack-engine sync uops that need to add an extra uop that wasn't from the front-end when that similar counter overflows.)

    ; repeated   add rax, 256
    ; initially (rax -> pr1)
            ; rename    (rax -> pr1 + 256)
            ; rename    (rax -> pr1 + 512)
            ; rename    (rax -> pr1 + 768)
            add pr2, (pr1 + 768), 256      ; renaming would overflow
            ; rename    (rax -> pr2 + 256)
            ; rename    (rax -> pr2 + 512)
            ; rename    (rax -> pr2 + 768)
            add pr3, (pr2 + 768), 256
    

    The article doesn't mention how Golden Cove's more general renaming interacts with the stack engine; it would be nice if instructions like sub rsp, 8 or mov eax, [rsp+8] would no longer cost an extra stack-sync uop right after a call or push makes the RSP offset non-zero. But instructions like push don't include a uop that updates RSP, so if the counter overflows the rename stage would still need to invent one.


    (As requested, I've rewritten most of the OP's self-answer into my own words. At this point it seemed like I should just post it as my own answer, but the original structure / outline of the answer is theirs.)