assemblyx86-64att

imulw from memory, but into a larger register – in one command?


I've been playing around with assembly a little (specifically: AT&T, x86-64).
My data section looks like so:

.section .data
num: .short 0b1111111111111111  #16 ones, so the maximum unsigned short value
junk: .quad 0x5555555555555555

Within %rax, I have a zero-extended 16 bit value for which it is guaranteed that its product with the value of num is up to 32 bits. I want to do something along those lines:
imulw (num), %eax
In other words: grab the word (short) within num, multiply it by the value within %ax, but then have the result – which could require up to 32 bits, being the size of %eax – indeed, stored within %eax... and I want to do it with a single command, without storing anything in an intermediate register or anything of the sort. In order to have %eax as my destination register I have to use imull, but n is just a short which would mean that I will fetch too many bytes from memory if I do it this way (in particular, I'd be grabbing some of those 0x55 bytes following the appropriately named "junk" label).

Using mul or the single-operand version of imul is also (seemingly?) out of the question because I need the final result in %rax.

Is there some way to accomplish the above?

TL;DR: Is there a single instruction for multiplying 2 bytes stored in the memory by a 2 byte register, and storing the result in a 4 byte register, rather than truncating it?


Solution

  • Almost all x86 instructions require their operands to be the same size, except movzx / movsx (AT&T movsbl / movswl etc.), SIMD broadcasts, and some other special cases.

    x86-64 is already cramped for opcode bits, there aren't enough to also give every instruction the ability to zero- or sign-extend a narrower 8 or 16-bit source. That would take 2 bits for those possibilities, plus another bit to signal that the source was narrow at all. (Or a second opcode for some instructions to allow a narrow-source form.) 8086 used most of the coding space (possible 1-byte opcodes), and only left a little room for future expansion in simple/sensible ways.


    Multiply is special only in that a widening form is available, but that's usually not what you want when the product could fit in a single register. As you say, mulw num(%rip) would do DX:AX = AX*num, with a false dependency writing the low 16 bits of RDX while leaving the upper bits unmodified. That also ignores the upper 16 bits of the EAX source, so it's not equivalent.

    And you'd need something like shl $16, %edx / mov %ax, %dx to get a 32-bit value in EDX, zero-extended to RDX by the SHL writing 32-bit EDX. (The upper bytes of RAX would still hold the original garbage, so or %eax, %edx or or %edx, %eax wouldn't be correct). So it's more efficient to just do a zero-extending load first.

       movzwl num(%rip), %edx
       imul   %edx, %eax          # This is your best option.
    

    If you really want to avoid the zero-extending load, you'll need to reserve more memory for num with two 0 bytes after the part of the value you want to use. Beware that a 32-bit load right after a 16-bit store will have extra latency, a store-forwarding stall. But if the narrow store has time to commit to cache there's no penalty.


    @fuz points out the fun fact that legacy x87 floating point can use narrow integer source operands, like fimuls num(%rip) to do %st(0) *= int16_to_longdouble(num) (s for short unlike the usual w for word.
    fimull uses a 32-bit integer source.
    Even with x86-64 there is no form that takes a 64-bit integer source. https://www.felixcloutier.com/x86/fmul:fmulp:fimul So you'd have to fildq / fmulp)