Starting to learn assembly x86-64, I'm writing a program that gets an array of integers and does some calculations on it. The purpose isn't relevant to the question, but the calculations include multiplication and division, the size and sign of the integers are not known, and the program should handle any case.
I wonder what is the correct way to do it? Should I extend the integers to the x64 register size or should I work with the EDX:EAX extensions?
My first try is:
movl (%r8), %edi #r8 holds the memory address of the array
movl 4(%r8), %r10d # assume those are valid elements in the array, the check is done above
movl 8(%r8), %eax
imul %edi, %eax
imul %r10d, %r10d
cdq
idiv %r10d
but from my understanding, with these instructions if the multiplication causes overflow, eax
becomes 0 and the carry flag turned on, so dividing by r10d
can cause SIGFPE due to division by 0.
What is the better approach to that? I tend towards working with the 64-bit registers so it will be easier to handle all the cases but I may be wrong.
x86-64 is a 64-bit architecture, so efficient bigint code should normally work in 64-bit chunks. e.g. imul %rdi
to set 128-bit RDX:RAX
to the product of RDI * RAX
. Use the one-operand form of imul
for that (https://www.felixcloutier.com/x86/imul). The forms with more operands like you're using don't implicitly write EDX or RDX.
If 64 bits is wide enough, you can use non-widening 64-bit math after sign-extending your inputs with loads like movslq (%r8), %rax
. Then 2-operand imul
to multiply without disturbing RDX. You'll want cqo
/ idiv
if you want to do 64-bit division. AT&T syntax calls it cqto
but GAS accepts either mnemonic.
(If values in memory can simply be 64-bit, you can do stuff like imulq 8(%r8), %rax
instead of using mov
to load.)
If you know that the quotient will fit in 32 bits, mov 8(%r8), %eax
/ imull (%r8)
will get your 64-bit product in EDX:EAX (note the explicit l
operand-size suffix on the imul
), which sets up for 32-bit operand-size idivl 4(%r8)
. In that case you don't use cdq
, that would replace your 64-bit product with the sign-extension of the low half. (When and why do we sign extend and use cdq with mul/div? / Why should EDX be 0 before using the DIV instruction?).
But using the full 64-bit width of the dividend for 32-bit operand-size division makes it possible for the quotient to not fit in the operand-size (for cases other than INT_MIN
/ -1
or division by 0), in which case you'll get a #DE exception. (And POSIX OSes including Linux will deliver SIGFPE to your process.)
Also, that limits your divisor to 32-bit. You say that squaring r10d
can overflow, so that's another reason it's not an option.
64-bit operand-size for division is a lot slower on Intel CPUs before Ice Lake, but if your divisor might not fit in 32 bits its your only option. (Other than branching on the size of the divisor, which could be a win on older Intel CPUs; some clang versions / optimization options do that.)
With signed 32-bit inputs, using 64-bit integers for temporaries. Keeping the 64-bit integers in single registers for simplicity and efficiency. (two-operand imul
is a single uop, vs. one-operand is 2 or 3 uops on Intel since it has to write 2 output registers and split up the output of the multiplier unit. https://uops.info/)
movslq (%r8), %rdi # sign-extend long (32-bit) to quad (64-bit)
movslq 4(%r8), %r10
movslq 8(%r8), %rax
imul %rdi, %rax # 64-bit non-widening multiplies can't overflow since the values are 32-bit
imul %r10, %r10
cqto # signed division of RAX by R10
idiv %r10
Even INT_MIN
(-2^31
) squared still fits in int64_t
, so these multiplies can't result in overflow. There was no need for imul %rdi
which would write the full 128-bit product into RDX:RAX. But that would avoid the need for cqto
in this case since x86 integer division does need a double-width dividend.
That would actually give us smaller machine code and be break-even for uop count on recent Intel and AMD CPUs (https://uops.info/) where imul r64
is "only" 2 uops, and latency for writing RDX is just one cycle after the RAX result is ready (so same as cqo
).
If you make your original inputs 64-bit, using widening multiply has the advantage that the dividend can be larger than 64-bit without faulting, as long as the divisor is large enough for the quotient to fit in 64 bits.