assemblygccoptimizationriscv

Why is RISC-V GCC uselessly reserving stack space in a function that returns a small struct?


This C source:

typedef struct {
  unsigned long one;
  unsigned long two;
} twin;

twin function( twin t ) {
  return (twin){ 0,0 };
}

generates this assembly:

        .file   "p.c"
        .option nopic
        .attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zifencei2p0"
        .attribute unaligned_access, 0
        .attribute stack_align, 16
        .text
        .align  1
        .globl  function
        .type   function, @function
function:
        addi    sp,sp,-32  # <<< WHY?
        li      a0,0
        li      a1,0
        addi    sp,sp,32   # <<< WHY?
        jr      ra
        .size   function, .-function
        .ident  "GCC: (g04696df09) 14.2.0"
        .section        .note.GNU-stack,"",@progbits

when run through riscv64-unknown-elf-gcc (g04696df09) 14.2.0 with either -O3 or -O2 or -O1 or even -Os.

So why is the code creating room on the stack for stuff (32-bytes) that is and will be kept in registers a0 and a1?

Is this a bug, or am I missing something? The SP instructions seem useless.

[UPDATE] It is a bug, indeed!


Solution

  • Yeah, looks like a missed-optimization bug which you could report on GCC's bugzilla (https://gcc.gnu.org/bugzilla), if it's not already reported. Update: turns out it is, sorry I should have mentioned checking for duplicates first (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108016). The original had a terrible title (not specific at all about what kind of badness or what conditions cause it), so it would have been hard to find anyway.


    It does the same thing targeting Linux with RV32 GCC (Godbolt) even with -fomit-frame-pointer. The wasted stack-pointer instructions are present with GCC8 (the earliest on Godbolt) through trunk.

    I'm pretty confident no ABI requires it, and Clang doesn't emit them.

    # Clang -O2 or -Os for RV64, same for RV32 where unsigned long is only 32 bits
    test1:
            or      a2, a1, a0
            xor     a1, a1, a0
            mv      a0, a2
            ret
    

    One mv is unavoidable since we need to replace both a0 and a1 with values that each depend on both original inputs. So we can't write overwrite either a0 or a1 with the first instruction. But it certainly doesn't need to spill anything, and it's a leaf function so saving the return address isn't needed. And we're not using a frame pointer, so saving the caller's FP isn't needed either.

    The key ingredient for reproducing this is a struct local; it doesn't happen with int r = u1^u2; for example. So maybe GCC is failing to optimize away the stack space for this struct which it optimizes into registers. ret r = { 0, 0 }; still reproduces it.