cassemblyarmembeddedinline-assembly

How to do 32*32 bit multiplication in Cortex-M3


I'm doing 32-bit multiplication for a Cortex-M3 controller using the "umull" ARM instruction. I'm getting the result in two 32-bit registers, RdLo and RdHi. How can I get the complete 64-bit result?

I wrote a function which takes two 32-bit values, and I multiplied them using the "umull" instruction, which gives the result in RdLo and RdHi, two 32-bit registers.

I want to return a 64-bit result from that function.

long mulhi1, mullo1;

unsigned long Multiply(long i, long j)
{
  asm ("umull  %0, %1, %[input_i], %[input_j];"
  : "=r" (mullo1), "=r" (mulhi1)
  : [input_i] "r" (i), [input_j] "r" (j)
      : /* No clobbers */
  );
}

I'm expecting a 64-bit result as the return value from that function. But "umull" gives result in mullo1,mulhi1, separate 32-bit registers.

What changes do I have to make to get a 64-bit result?


Solution

  • Method 1:

    Use a union of 32 and 64 bit integers to receive the result

    #include <stdint.h>
    
    union dw {
        uint64_t dword;
        struct {
    #ifdef __ARMEB__
            uint32_t high_word;
            uint32_t low_word;
    #else
            uint32_t low_word;
            uint32_t high_word;
    #endif
        };
    };
    
    uint64_t umull(uint32_t op1, uint32_t op2) {
        union dw result;
        asm volatile(
            "umull %[result_low], %[result_high], %[operand_1], %[operand_2]"
            :[result_low] "=r" (result.low_word), [result_high] "=r" (result.high_word)
            :[operand_1] "r" (op1), [operand_2] "r" (op2)
        );
        return result.dword;
    }
    

    Also note that unsigned long is a 32-bit type on 32-bit ARM. You need unsigned long long, or use uint64_t from <stdint.h>.


    Method 2:

    Use Q and R modifiers on a 64-bit integer in the assembly template to specify its halves.

    (This is based on an answer on how can I get the ARM MULL instruction to produce its output in a uint64_t in gcc?, which might be a duplicate by the way. Answers there also show using shift and OR to combine 32-bit halves into a uint64_t, which will optimize away on a 32-bit machine.)

    uint64_t umull(uint32_t op1, uint32_t op2) {
        uint64_t result;
        asm volatile(
            "umull %Q[dwresult], %R[dwresult], %[operand_1], %[operand_2]"
            :[dwresult] "=r" (result)
            :[operand_1] "r" (op1), [operand_2] "r" (op2)
        );
        return result;
    }
    

    This is not exactly well documented; one has to browse the GCC sources to find it.