gccfloating-pointcortex-mfpu

No FPU support with gcc for ARM Cortex M?


I have the following function from a well known benchmark that I am compiling with gcc-arm-none-eabi-10-2020-q4-major:

#include <unistd.h>

double b[1000], c[1000];

void tuned_STREAM_Scale(double scalar)
{
    ssize_t j;
    for (j = 0; j < 1000; j++)
        b[j] = scalar* c[j];
}

I am using the following compiler options:

arm-none-eabi-gcc -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=fpv5-sp-d16 -c test.c

However, if I check the compiled code, the compiler seems unable to use a basic FPU multiply instruction, and just uses the __aeabi_dmul function from libgcc (we can however see that a FPU vmov is used):

00000000 <tuned_STREAM_Scale>:
   0:   e92d 41f0       stmdb   sp!, {r4, r5, r6, r7, r8, lr}
   4:   4c08            ldr     r4, [pc, #32]   ; (28 <tuned_STREAM_Scale+0x28>)
   6:   4d09            ldr     r5, [pc, #36]   ; (2c <tuned_STREAM_Scale+0x2c>)
   8:   f504 58fa       add.w   r8, r4, #8000   ; 0x1f40
   c:   ec57 6b10       vmov    r6, r7, d0
  10:   e8f4 0102       ldrd    r0, r1, [r4], #8
  14:   4632            mov     r2, r6
  16:   463b            mov     r3, r7
  18:   f7ff fffe       bl      0 <__aeabi_dmul>
  1c:   4544            cmp     r4, r8
  1e:   e8e5 0102       strd    r0, r1, [r5], #8
  22:   d1f5            bne.n   10 <tuned_STREAM_Scale+0x10>
  24:   e8bd 81f0       ldmia.w sp!, {r4, r5, r6, r7, r8, pc}

If I compare with another compiler, the code is incomparably more efficient:

00000000 <tuned_STREAM_Scale>:
   0:   4808            ldr     r0, [pc, #32]   ; (24 <tuned_STREAM_Scale+0x24>)
   2:   b580            push    {r7, lr}
   4:   4b06            ldr     r3, [pc, #24]   ; (20 <tuned_STREAM_Scale+0x20>)
   6:   27c8            movs    r7, #200        ; 0xc8
   8:   c806            ldmia   r0!, {r1, r2}
   a:   ec42 1b11       vmov    d1, r1, r2
   e:   ee20 1b01       vmul.f64        d1, d0, d1
  12:   1e7f            subs    r7, r7, #1
  14:   ec52 1b11       vmov    r1, r2, d1
  18:   c306            stmia   r3!, {r1, r2}
  1a:   d1f5            bne.n   8 <tuned_STREAM_Scale+0x8>
  1c:   bd80            pop     {r7, pc}

If I check inside gcc package the various libgcc object files depending on CPU or FPU options, I cannot find any FPU instructions in __aeabi_dmul or any other function.

I find very strange that gcc is not able to use a basic FPU multiplication, and I could not find in any documentation or README this limitation, so I am wondering if I am not doing anything wrong. I have checked older gcc versions and I still have this problem. Would it be due to gcc or to the compiled binaries from ARM?


Solution

  • The clue is in the compiler options you already posted:

    -mfpu=fpv5-sp-d16 "sp" means single precision.

    You told it not to generate hardware double instructions, which is correct for most Cortex-M7 processors because they can't execute them. If you have an M7 which can then you need to set the correct fpu argument.