I'm confused by an auto-vectorization result. The following code addtest.c
#include <stdio.h>
#include <stdlib.h>
#define ELEMS 1024
int
main()
{
float data1[ELEMS], data2[ELEMS];
for (int i = 0; i < ELEMS; i++) {
data1[i] = drand48();
data2[i] = drand48();
}
for (int i = 0; i < ELEMS; i++)
data1[i] += data2[i];
printf("%g\n", data1[ELEMS-1]);
return 0;
}
is compiled with gcc 11.1.0
by
gcc-11 -O3 -march=haswell -masm=intel -save-temps -o addtest addtest.c
and the add-to loop is auto-vectorized as
.L3:
vmovaps ymm1, YMMWORD PTR [r12]
vaddps ymm0, ymm1, YMMWORD PTR [rax]
add r12, 32
add rax, 32
vmovaps YMMWORD PTR -32[r12], ymm0
cmp r12, r13
jne .L3
This is clear: load from data1
, load and add from data2
, store to data1
, and in between, advance the indices.
If I pass the same code to https://godbolt.org, select x86-64 gcc-11.1
and options -O3 -march=haswell
, I get the following assembly code:
.L3:
vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
vaddps ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
vmovaps YMMWORD PTR [rbp-8240], ymm1
vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
add rax, 32
cmp rax, 4096
jne .L3
One surprising thing is the different address handling, but the thing that confuses me completely is the additional store to [rbp-8240]
. This location is never used again, as far as I can see.
If I select gcc 7.5
on godbolt, the superfluous store disappears (but from 8.1 upwards, it is produced).
So my questions are:
Thanks a lot for your help!
The difference-maker is -fpie
, which is on by default in most distros but not Godbolt. This doesn't make a lot of sense, but compilers are complex pieces of machinery, not "smart".
It's not specific to -march=haswell
or AVX either; the same difference happens with just -O3
.
Godbolt configures GCC with simpler options than distros, e.g. without default-pie, and without -fstack-protector-strong
. To match Godbolt locally, use at least -fno-pie -no-pie -fno-stack-protector
. There might be others I'm forgetting about.
IDK why this would trigger or avoid a missed-optimization, but I can confirm it does on my Arch GNU/Linux system with GCC 11.1.
Locally with gcc -O3 -march=haswell -fno-stack-protector -fno-pie
(and -masm=intel -S -o- vec.c | less
) it matches Godbolt:
.L3:
vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
vaddps ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
vmovaps YMMWORD PTR [rbp-8240], ymm1
vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
add rax, 32
cmp rax, 4096
jne .L3
But with distro-configured GCC defaults from -O3 -march=haswell
:
.L3:
vmovaps ymm1, YMMWORD PTR [r12]
vaddps ymm0, ymm1, YMMWORD PTR [rax]
add r12, 32
add rax, 32
vmovaps YMMWORD PTR -32[r12], ymm0
cmp r12, r13
jne .L3
The same missed-opt happens without -march=haswell
; we get a movaps XMMWORD PTR [rsp], xmm1
store to a fixed address inside the loop. (Since GCC doesn't need to over-align the stack to spill a 32-byte vector, it didn't use RBP as a frame pointer.)
For no apparent reason, using -fpie
on the Godbolt compiler explorer gets GCC to use two pointer increments instead of indexed addressing modes, also avoiding the redundant store. (Making the same asm you get locally). -fpie
forces GCC to do that for arrays in static storage (because [arr + rax]
would require the symbol address as a 32-bit absolute: 32-bit absolute addresses no longer allowed in x86-64 Linux?)
You can and should report this on GCC's bugzilla with the keyword "missed-optimization".