This C source:
typedef struct {
unsigned long one;
unsigned long two;
} twin;
twin function( twin t ) {
return (twin){ 0,0 };
}
generates this assembly:
.file "p.c"
.option nopic
.attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zifencei2p0"
.attribute unaligned_access, 0
.attribute stack_align, 16
.text
.align 1
.globl function
.type function, @function
function:
addi sp,sp,-32 # <<< WHY?
li a0,0
li a1,0
addi sp,sp,32 # <<< WHY?
jr ra
.size function, .-function
.ident "GCC: (g04696df09) 14.2.0"
.section .note.GNU-stack,"",@progbits
when run through riscv64-unknown-elf-gcc (g04696df09) 14.2.0
with either -O3
or -O2
or -O1
or even -Os
.
So why is the code creating room on the stack for stuff (32-bytes) that is and will be kept in registers a0
and a1
?
Is this a bug, or am I missing something? The SP instructions seem useless.
[UPDATE] It is a bug, indeed!
Yeah, looks like a missed-optimization bug which you could report on GCC's bugzilla (https://gcc.gnu.org/bugzilla), if it's not already reported. Update: turns out it is, sorry I should have mentioned checking for duplicates first (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108016). The original had a terrible title (not specific at all about what kind of badness or what conditions cause it), so it would have been hard to find anyway.
It does the same thing targeting Linux with RV32 GCC (Godbolt) even with -fomit-frame-pointer
. The wasted stack-pointer instructions are present with GCC8 (the earliest on Godbolt) through trunk.
I'm pretty confident no ABI requires it, and Clang doesn't emit them.
# Clang -O2 or -Os for RV64, same for RV32 where unsigned long is only 32 bits
test1:
or a2, a1, a0
xor a1, a1, a0
mv a0, a2
ret
One mv
is unavoidable since we need to replace both a0
and a1
with values that each depend on both original inputs. So we can't write overwrite either a0
or a1
with the first instruction. But it certainly doesn't need to spill anything, and it's a leaf function so saving the return address isn't needed. And we're not using a frame pointer, so saving the caller's FP isn't needed either.
The key ingredient for reproducing this is a struct local; it doesn't happen with int r = u1^u2;
for example. So maybe GCC is failing to optimize away the stack space for this struct which it optimizes into registers. ret r = { 0, 0 };
still reproduces it.