I have the following function from a well known benchmark that I am compiling with gcc-arm-none-eabi-10-2020-q4-major
:
#include <unistd.h>
double b[1000], c[1000];
void tuned_STREAM_Scale(double scalar)
{
ssize_t j;
for (j = 0; j < 1000; j++)
b[j] = scalar* c[j];
}
I am using the following compiler options:
arm-none-eabi-gcc -O3 -mcpu=cortex-m7 -mthumb -mfloat-abi=hard -mfpu=fpv5-sp-d16 -c test.c
However, if I check the compiled code, the compiler seems unable to use a basic FPU multiply instruction, and just uses the __aeabi_dmul
function from libgcc
(we can however see that a FPU vmov
is used):
00000000 <tuned_STREAM_Scale>:
0: e92d 41f0 stmdb sp!, {r4, r5, r6, r7, r8, lr}
4: 4c08 ldr r4, [pc, #32] ; (28 <tuned_STREAM_Scale+0x28>)
6: 4d09 ldr r5, [pc, #36] ; (2c <tuned_STREAM_Scale+0x2c>)
8: f504 58fa add.w r8, r4, #8000 ; 0x1f40
c: ec57 6b10 vmov r6, r7, d0
10: e8f4 0102 ldrd r0, r1, [r4], #8
14: 4632 mov r2, r6
16: 463b mov r3, r7
18: f7ff fffe bl 0 <__aeabi_dmul>
1c: 4544 cmp r4, r8
1e: e8e5 0102 strd r0, r1, [r5], #8
22: d1f5 bne.n 10 <tuned_STREAM_Scale+0x10>
24: e8bd 81f0 ldmia.w sp!, {r4, r5, r6, r7, r8, pc}
If I compare with another compiler, the code is incomparably more efficient:
00000000 <tuned_STREAM_Scale>:
0: 4808 ldr r0, [pc, #32] ; (24 <tuned_STREAM_Scale+0x24>)
2: b580 push {r7, lr}
4: 4b06 ldr r3, [pc, #24] ; (20 <tuned_STREAM_Scale+0x20>)
6: 27c8 movs r7, #200 ; 0xc8
8: c806 ldmia r0!, {r1, r2}
a: ec42 1b11 vmov d1, r1, r2
e: ee20 1b01 vmul.f64 d1, d0, d1
12: 1e7f subs r7, r7, #1
14: ec52 1b11 vmov r1, r2, d1
18: c306 stmia r3!, {r1, r2}
1a: d1f5 bne.n 8 <tuned_STREAM_Scale+0x8>
1c: bd80 pop {r7, pc}
If I check inside gcc package the various libgcc
object files depending on CPU or FPU options, I cannot find any FPU instructions in __aeabi_dmul
or any other function.
I find very strange that gcc is not able to use a basic FPU multiplication, and I could not find in any documentation or README this limitation, so I am wondering if I am not doing anything wrong. I have checked older gcc versions and I still have this problem. Would it be due to gcc or to the compiled binaries from ARM?
The clue is in the compiler options you already posted:
-mfpu=fpv5-sp-d16
"sp" means single precision.
You told it not to generate hardware double instructions, which is correct for most Cortex-M7 processors because they can't execute them. If you have an M7 which can then you need to set the correct fpu argument.