assembly integer x86-64 multiplication division

Working with 64-bit products and quotients of 32-bit integers in assembly x86-64

Starting to learn assembly x86-64, I'm writing a program that gets an array of integers and does some calculations on it. The purpose isn't relevant to the question, but the calculations include multiplication and division, the size and sign of the integers are not known, and the program should handle any case.
I wonder what is the correct way to do it? Should I extend the integers to the x64 register size or should I work with the EDX:EAX extensions?

My first try is:

    movl (%r8), %edi #r8 holds the memory address of the array
    movl 4(%r8), %r10d # assume those are valid elements in the array, the check is done above 
    movl 8(%r8), %eax 
    imul %edi, %eax
    imul %r10d, %r10d
    cdq
    idiv %r10d

but from my understanding, with these instructions if the multiplication causes overflow, eax becomes 0 and the carry flag turned on, so dividing by r10d can cause SIGFPE due to division by 0.
What is the better approach to that? I tend towards working with the 64-bit registers so it will be easier to handle all the cases but I may be wrong.

Solution

x86-64 is a 64-bit architecture, so efficient bigint code should normally work in 64-bit chunks. e.g. imul %rdi to set 128-bit RDX:RAX to the product of RDI * RAX. Use the one-operand form of imul for that (https://www.felixcloutier.com/x86/imul). The forms with more operands like you're using don't implicitly write EDX or RDX.

If 64 bits is wide enough, you can use non-widening 64-bit math after sign-extending your inputs with loads like movslq (%r8), %rax. Then 2-operand imul to multiply without disturbing RDX. You'll want cqo / idiv if you want to do 64-bit division. AT&T syntax calls it cqto but GAS accepts either mnemonic.

(If values in memory can simply be 64-bit, you can do stuff like imulq 8(%r8), %rax instead of using mov to load.)

If you know that the quotient will fit in 32 bits, mov 8(%r8), %eax / imull (%r8) will get your 64-bit product in EDX:EAX (note the explicit l operand-size suffix on the imul), which sets up for 32-bit operand-size idivl 4(%r8). In that case you don't use cdq, that would replace your 64-bit product with the sign-extension of the low half. (When and why do we sign extend and use cdq with mul/div? / Why should EDX be 0 before using the DIV instruction?).

But using the full 64-bit width of the dividend for 32-bit operand-size division makes it possible for the quotient to not fit in the operand-size (for cases other than INT_MIN / -1 or division by 0), in which case you'll get a #DE exception. (And POSIX OSes including Linux will deliver SIGFPE to your process.)

Also, that limits your divisor to 32-bit. You say that squaring r10d can overflow, so that's another reason it's not an option.

64-bit operand-size for division is a lot slower on Intel CPUs before Ice Lake, but if your divisor might not fit in 32 bits its your only option. (Other than branching on the size of the divisor, which could be a win on older Intel CPUs; some clang versions / optimization options do that.)

With signed 32-bit inputs, using 64-bit integers for temporaries. Keeping the 64-bit integers in single registers for simplicity and efficiency. (two-operand imul is a single uop, vs. one-operand is 2 or 3 uops on Intel since it has to write 2 output registers and split up the output of the multiplier unit. https://uops.info/)

    movslq (%r8), %rdi   # sign-extend long (32-bit) to quad (64-bit)
    movslq 4(%r8), %r10
    movslq 8(%r8), %rax 

    imul   %rdi, %rax    # 64-bit non-widening multiplies can't overflow since the values are 32-bit
    imul   %r10, %r10

    cqto                # signed division of RAX by R10
    idiv   %r10

Even INT_MIN (-2^31) squared still fits in int64_t, so these multiplies can't result in overflow. There was no need for imul %rdi which would write the full 128-bit product into RDX:RAX. But that would avoid the need for cqto in this case since x86 integer division does need a double-width dividend.

That would actually give us smaller machine code and be break-even for uop count on recent Intel and AMD CPUs (https://uops.info/) where imul r64 is "only" 2 uops, and latency for writing RDX is just one cycle after the RAX result is ready (so same as cqo).

If you make your original inputs 64-bit, using widening multiply has the advantage that the dividend can be larger than 64-bit without faulting, as long as the divisor is large enough for the quotient to fit in 64 bits.