x86x86-64intelcpu-architecture

Was there a P4 model with double-pumped 64-bit operations?


I recall that one of the interesting features of the initial P4 micro-architecture was it's double-pumped ALU. I think Intel called it something like the Rapid Execution Unit, but basically it meant that each execution unit in the ALU was effectively running at twice the frequency, and could handle two simple ALU operations in a single cycle, even if they were dependent.

This feature disappeared at some point (before or at the same time as the P4), but was there ever a 64-bit P4 with a double dumped ALU? The 64-bit variants of the P4 came out in 2004, about four years after the initial 32-bit release, but it isn't clear to me if the double-speed ALU had disappeared by then. It seems like the width-pipelined approach used to double the speed would be difficult for 64-bit which is what piqued my curiosity.

Since one may still need to support some (evidently quite old) 64-bit P4 hardware, knowing the ALU behavior is interesting for optimization.


Solution

  • I found the Intel Optimization Manual 2005 that covers both 32-bit and 64-bit NetBurst processors. Refer to Table C-8 on page C-17. According to the first comment on this blog post, the 32-bit Northwood's model is 02h and the 64-bit Nocona's model is 03h. The table shows that ADD/SUB/AND/OR/XOR have a throughput of 0.5 cycles on both processors, but a latency of 0.5 cycles on Northwood and 1 cycle on Nocona. This means that double-pumping is supported on Nocona, but only if the back-to-back instructions are not dependent. The rest of the table also shows that some instructions that were not double-pumped on Northwood were double-pumped on Nocona.

    (Intel's optimization manual was wrong about throughput: add/sub (unlike and/or/xor) can run on either of the ALU ports, 0 or 1, and have 0.25 cycle reciprocal-throughput. Measured by Agner Fog's for both 02h and 04h (Prescott/Nocona) P4 Netburst. This at least confirms that Nocona could run four 64-bit add operations per clock on its two integer ALU ports, whatever the actual internals were. Agner tested some 64-bit instructions, so it's likely he tested add r64,r64 and found it the same as r32,r32. Instlatx64 also confirmed better than 2/clock throughput for ADD r64, r64, although they only measured 0.35c throughput, probably due to the front-end only being 3-wide. Agner didn't say how he measured 0.25c throughput!)


    Summary: There is ample evidence that shows that some NetBurst-based processors (whether released or canceled) could perform at least 2 64-bit ALU operations per cycle using either 2 32-bit staggered ALUs or at least a single 64-bit staggered ALU (which would be enabled by smaller feature sizes such as 90nm at that time).


    Figure 7 of the original paper1 on Intel Pentium 4 Willamette2 processor discusses how the double-pumped3 ALU works in some detail (at the logic design level).

    enter image description here

    The figure shows a single 32-bit staggered ALU unit. This confirms that the ALU can perform two fully dependent (both input operands are dependent) simple ALU operations in three fast cycles (where a fast cycle is one half of the main clock cycle). The result of the operation itself is available after 2 fast cycles (1 main cycle), but the new flags are only available after the third fast cycle (1.5 main cycles). Note that there are two such ALUs on ports 0 and 1, both are staggered. So the design could execute 2 dependency ALU chains with 4 operations per slow cycle throughput.

    That paper was published in 2001. Intel has published another paper4 in 2005 that discusses in great detail at the circuit level how the staggered integer core in the Intel Pentium 4 Prescott5 processor. It's not clear to me whether the paper discusses the 64-bit version of Prescott or the 32-bit version. However, this paper clearly states that the staggered ALU units can only perform additions, Boolean operations, shifts, and rotations (the other paper discussed the design of pre-Prescott cores in which the two fast ALU units did not support shifting and rotating). The other important difference is this statement from the paper:

    There are two distinct 32-bit FCLK execution data paths staggered by one clock to implement 64-bit operations.

    So it seems that the two fast ALU units on ports 0 and 1 are staggered together, enabling 64-bit fast integer operations such as additions. Therefore, the design could execute either two 32-bit dependency ALU chains with 4 operations per slow cycle throughput or one 64-bit dependency ALU chain with 2 operations per slow cycle throughput. This is even more powerful than a single staggered 64-bit ALU that can do only 64 bit operations, not 32-bit ones. The is most probably the design used in the 64-bit variants of the NetBurst microarchitecture.

    Another6 paper7 from Intel confirms that Intel was indeed able to design a double-pumped 64-bit ALU. I quote from the paper:

    In this paper, we describe a single-cycle integer ALU fabricated in 90nm dual-Vt CMOS technology operating at 4GHz in the 64b mode, with a 32b mode latency of 7GHz (measured at 1.3V, 25◦C).

    The paper doesn't mention whether this design has actually being used in any particular processor. But considering that the paper was published in 2004, there is a good chance that all of the 64-bit NetBurst cores (whether released or canceled) used the design.

    There are many 64-bit NetBurst-based processors that have released by Intel. For example, see this list for the server-grade processors. One of the cores is called Nocona. There is some experimental evidence that the design mentioned earlier (2 staggered 32-bit ALUs) was actually used in Nocona. Refer to these slides used in some course taught in CMU in 2008 on code optimization. The slides compare between the performance of Nocona (64-bit NetBurst), Intel Core (also 64-bit), and AMD Opteron (also 64-bit and apparently implements the same 64-bit staggered ALU design). This is the code used in a loop:

    x = x + d[i];
    

    where all elements are 32-bit integers (unfortunately, 64-bits have not been used).

    On slide 35, you can see the 32-bit integer addition throughput achieved on Nocona and Opteron. Since each operation requires a load and Nocona only supports a single load per cycle, Nocona's performance maxed out at around 1 operation per cycle. Opteron, however, which supports two loads per cycle, was close to the theoretical maximum of 2 operations per cycle. This experiment of course does not take advantage of staggering, but only of the fact that there are two 32-bit simple ALUs.

    However, later in the slides, SSE3 is used instead of scalar integer registers. The results for all of the three processors are shown on slide 44. With SSE3, there will be only one 128-bit load per 4 elements. Nocona can perform a 64-bit load from the L1D per cycle (see the article cited below), while Core can perform a single 128-bit L1D load per cycle. However, Core has a feature called Advanced Digital Media Boost (ADMB) that enables it to perform 4 32-bit addition per cycle. That same paper also mentions that pre-Core architectures supported only 2 32-bit SSE3 ALU operations per cycle. But if there are two 32-bit staggered ALUs in Nocona, the low SSE3 throughput implies that an SSE3 operation makes use of only one of the staggered ALUs. ADMB can be implemented in two ways. Either by expanding each ALU to 64-bits and keeping them staggered and utilizing both ALUs to perform 2 64-bit ALU operations per cycle. Another possibility is expanding each ALU to 128-bit and eliminate staggering.

    There is a patent filed by Intel in 1998 and granted in 2001 on the staggered execution of an instruction, any instruction basically, not just ALU operations. That patent is still active. There is a lot of discussion there on how staggered execution can be useful for 128-bit SIMD instructions. Based on this patent, it's very possible that Intel Core uses two 64-bit staggered ALUs to achieved its throughput. Each of the 64-bit ALUs can actually be made using two staggered 32-bit ALUs shown in the figure above.

    In 2002, Intel filed a patent for a generic staggered ALU design. It was generic in the sense that it was not about any specific ALU operation or the number of clock cycles or the clock period. The interesting thing here is that one of the figure there shows a staggered 64-bit ALU design! That was in 2002. The patent also discusses some of the challenges in designing staggered ALUs.

    The patent says that it was both granted and abandoned on the same day in 2006. Then after few months, another identical patent application was filed.

    This article shows that Potomac (another server-grade Pentium 4) is 64-bit architecture and supports 4 64-bit per cycle. Yamhill and Jayhawk were canceled by Intel. (There is an error in the article: Nocona is a 64-bit CPU.)


    (1) In case the link goes down, the paper is titled "The Microarchitecture of the Pentium® 4 Processor" and authored by Glenn Hinton, et al.

    (2) Also known as the first-gen Pentium 4.

    (3) Also known as staggered ALU.

    (4) In case the link goes down, the paper is titled "Low-Voltage Swing Logic Circuits for a Pentium® 4 Processor Integer Core" and authored by Daniel J. Deleganes, et al.

    (5) Also known as the third-gen Pentium 4.

    (6) In case the link goes down, the paper is titled "A 4GHz 300mW 64b Integer Execution ALU with Dual Supply Voltages in 90nm CMOS" and authored by Sanu K. Mathew, et al.

    (7) In case the link goes down, the paper is titled "HIGH-PERFORMANCE ENERGY-EFFICIENT DUAL-SUPPLY ALU DESIGN" and authored by Sanu K. Mathew, et al.