x86cpualu128-bitint128

Is there hardware support for 128-bit integers in modern processors?


Do we still need to emulate 128-bit integers in software, or is there hardware support for them in your average desktop processor these days?


Solution

  • I'm going to explain it by comparing the desktop processors to simple microcontrollers because of the similar operation of the arithmetic logic units (ALU), which are the calculators in the CPU, and the Microsoft x64 Calling Convention vs the System-V Calling Convention. For the short answer scroll to the end, but the long answer is that it's easiest to see the difference by comparing the x86/x64 to ARM and AVR:

    Long Answer

    Native Double Word Integer Multiply Architecture Support Comparison

    CPU word x word => dword dword x dword => dword
    M0 No (only 32x32 => 32) No
    AVR 8x8 => 16 (some versions only) No
    M3/M4/A Yes (32x32 => 64) No
    x86/x64 Yes (up to 64x64 => 128) Yes (up to 64x64 => 64 for x64)
    SSE/SSE2/AVX/AVX2 Yes (32x32 => 64 SIMD elements) No (at most 32x32 => 32 SIMD elements)

    If you understand this chart, skip to Short Answer

    CPUs in smartphones, PCs, and Servers have multiple ALUs that perform calculations on registers of various widths. Microcontrollers on the other hand usually only have one ALU. The word-size of the CPU is not the same as the word size of the ALU, though they may be the same, the Cortex-M0 being a prime example.

    ARM Architecture

    The Cortex-M0 is a Thumb-2 processor, which is a compact (mostly 16-bit) instruction encoding for a 32-bit architecture. (Registers and ALU width). Cortex-M3/M4 have some more instructions, including smull / umull, 32x32 => 64-bit widening multiply that are helpful for extended precision. Despite these differences, all ARM CPUs share the same set of architectural registers, which is easy to upgrade from M0 to M3/M4 and faster Cortex-A series smartphone processors with NEON SIMD extensions.

    ARM Architectural Registers

    ARM Architecture

    When performing a binary operation, it is common for the value to overflow a register (i.e. get too large to fit in the register). ALUs have n-bits input and n-bits output with a carryout (i.e. overflow) flag.

    enter image description here

    Addition cannot be performed in one instruction but requires relatively few instructions. However, for multiplication you will need to double the word size to fit the result and the ALU only has n inputs and n outputs when you need 2n outputs so that wouldn't work. For example, by multiplying two 32-bit integers you need a 64-bit result and two 64-bit integers require up to a 128-bit result with 4 word-sized registers; 2 is not bad, but 4 gets complicated and you run out of registers. The way the CPU handles this is going to be different. For the Cortex-M0 there are no instructions for that but with the Cortex-M3/M4 there is an instruction for 32x32=>64-bit register multiply that takes 3 clock cycles.

    (You can use Cortex-M0's 32x32 => 32-bit muls as a 16x16=>32-bit building block for larger multiplies; this is obviously inefficient but probably still better than manually shifting and conditionally adding.)

    AVR Architecture

    The AVR microcontroller has 131 instructions that work on 32 8-bit registers and is classified as an 8-bit processor by its register width but it has both an 8-bit and a 16-bit ALU. The AVR processor cannot do 16x16=>32-bit calculations with two 16-bit register pairs or 64-bit integer math without a software hack. This is the opposite of the x86/x64 design in both organizations of registers and ALU overflow operation. This is why AVR is classified as an 8/16-bit CPU. Why do you care? It affects performance and interrupt behavior.

    AVR "tiny", and other devices without the "enhanced" instruction-set don't have hardware multiply at all. But if supported at all, the mul instruction is 8x8 => 16-bit hardware multiply. https://godbolt.org/z/7bbqKn7Go shows how GCC uses it.

    AVR Architectural Registers

    enter image description here

    x86 Architecture

    On x86, multiplying two 32-bit integers to create a 64-bit integer can be done with the the MUL instruction resulting in a unsigned 64-bit in EDX:EAX, or 128-bit result in RDX:RAX pair.

    Adding 64-bit integers on x86 requires only two instructions (add/adc thanks to the carry flag), same for 128-bit on x86-64. But multiplying two-register integers takes more work.

    On 32-bit x86 for example, 64x64 => 64-bit multiplication (long long) requires A LOT of instructions, including 3 multiplies (with the low x low widening, the cross products not, because we don't need the high 64 bits of the full result). Here is an example of 32x64=>64-bit x86 signed multiply assembly for x86:

     movl 16(%ebp), %esi    ; get y_l
     movl 12(%ebp), %eax    ; get x_l
     movl %eax, %edx
     sarl $31, %edx         ; get x_h, (x >>a 31), higher 32 bits of sign-extension of x
     movl 20(%ebp), %ecx    ; get y_h
     imull %eax, %ecx       ; compute s: x_l*y_h
     movl %edx, %ebx
     imull %esi, %ebx       ; compute t: x_h*y_l
     addl %ebx, %ecx        ; compute s + t
     mull %esi              ; compute u: x_l*y_l
     leal (%ecx,%edx), %edx ; u_h += (s + t), result is u
     movl 8(%ebp), %ecx
     movl %eax, (%ecx)
     movl %edx, 4(%ecx)
    

    x86 supports pairing up two registers to store the full multiply result (including the high-half), but you can't use the two registers to perform the task of a 64-bit ALU. This is the primary reason why x64 software runs faster than x86 software for 64-bit or wider integer math: you can do the work in a single instruction! You could imagine that 128-bit multiplication in x86 mode would be very computationally expensive, it is. The x64 is very similar to x86 except with twice the number of bits.

    x86 Architectural Registers

    x86 Architectural Registers

    x64 Architectural Registers

    x64 Architectural Registers

    When CPUs pair 2 word-sized registers to create a single double word-sized value, On the stack the resulting double word value will be aligned to a word boundary in RAM. Beyond the two register pair, four-word math is a software hack. This means that for x64 two 64-bit registers may be combined to create a 128-bit register pair overflow that gets aligned to a 64-bit word boundary in RAM, but 128x128=>128-bit math is a software hack.

    The x86/x64, however, is a superscalar CPU, and the registers you know of are merely the architectural registers. Behind the scenes, there are a lot more registers that help optimize the CPU pipeline to perform out of order instructions using multiple ALUs.

    SSE/SSE2 introduced 128-bit SIMD registers, but no instructions treat them as a single wide integer. There's paddq that does two 64-bit additions in parallel, but no hardware support for 128-bit addition, or even support for manually propagating carry across elements. The widest multiply is two 32x32=>64 operations in parallel, half the width of what you can do with x86-64 scalar mul. See Can long integer routines benefit from SSE? for the state of the art, and the hoops you have to jump through to get any benefit from SSE/AVX for very big integers.

    Even with AVX-512 (for 512-bit registers), the widest add / mul instructions are still 64-bit elements. x86-64 did introduce 64x64 => 64-bit multiply in SIMD elements.

    Short Answer

    The way that C++ applications will handle 128-bit integers will differ based on the Operating System or bare metal calling a convention. Microsoft has their own convention that, much to my own dismay, the resulting 128-bit return value CAN NOT be returned from a function as a single value. The Microsoft x64 Calling Convention dictates that when returning a value, you may return one 64-bit integer or two 32-bit integers. For example, you can do word * word = dword, but in Visual-C++ you must use _umul128 to return the HighProduct, regardless of it being in the RDX:RAX pair. I cried, it was sad. :-(

    The System-V calling convention, however, does allow for returning 128-bit return types in RAX:RDX. https://godbolt.org/z/vdd8rK38e. (And GCC / clang have __int128 to get the compiler to emit the necessary instructions to 2-register add/sub/mul, and helper function for div - Is there a 128 bit integer in gcc?)

    As for whether you should count on 128-bit integer support, it's extremely rare to come across a user using a 32-bit x86 CPU because they are too slow so it is not best practice to design software to run on 32-bit x86 CPUs because it increases development costs and may lead to a degraded user experience; expect an Athlon 64 or Core 2 Duo to the minimum spec. You can expect the code to not perform as well on Microsoft as Unix OS(s).

    The Intel architecture registers are set in stone, but Intel and AMD are constantly rolling out new architecture extensions but compilers and apps take a long time to update you can't count on it for cross-platform. You'll want to read the Intel 64 and IA-32 Architecture Software Developer’s Manual and AMD64 Programmers Manual.