c++gccarmcompiler-optimizationcode-size

Which GCC optimization flags affect binary size the most?


I am developing a C++ for ARM using GCC. I have ran into an issue where, I no optimizations are enabled, I am unable to create a binary (ELF) for my code because it will not fit in the available space. However, if I simply enable optimization for debugging (-Og), which is the lowest optimization available to my knowledge, the code easily fits.

In both cases, -ffunction-sections, -fdata-sections, -fno-exceptions, and -Wl,--gc-sections is enabled.

This is huge difference in binary size even with minimal optimizations.

I took a look at 3.11 Options That Control Optimization for details on to what optimizations are being performed with the -Og flag to see if that would give me any insight.

What optimization flags affect binary size the most? Is there anything I should be looking for to explain this massive difference?


Solution

  • Most of the extra code-size for an un-optimized build is the fact that the default -O0 also means a debug build, not keeping anything in registers across statements for consistent debugging even if you use a GDB j command to jump to a different source line in the same function. -O0 means a huge amount of store/reload vs. even the lightest level of optimization, especially disastrous for code-size on a non-CISC ISA that can't use memory source operands. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to GCC equally.

    Especially for modern C++, a debug build is disastrous because simple template wrapper functions that normally inline and optimize away to nothing in simple cases (or maybe one instruction), instead compile to actual function calls that have to set up args and run a call instruction. e.g. for a std::vector, the operator[] member function can normally inline to a single ldr instruction, assuming the compiler has the .data() pointer in a register. But without inlining, every call-site takes multiple instructions1


    Options that affect code-size in the actual .text section1 the most: alignment of branch-targets in general, or just loops, costs some code-size. Other than that:

    Clang has similar options, including -Os. It also has a clang -Oz option to optimize only for size, without caring about speed. It's very aggressive, e.g. on x86 using code-golf tricks like push 1; pop rax (3 bytes total) instead of mov eax, 1 (5 bytes).

    GCC's -Os unfortunately chooses to use div instead of a multiplicative inverse for division by a constant, costing lots of speed but not saving much if any size. (https://godbolt.org/z/x9h4vx1YG for x86-64). For ARM, GCC -Os still uses an inverse if you don't use a -mcpu= that implies udiv is even available, otherwise it uses udiv: https://godbolt.org/z/f4sa9Wqcj .

    Clang's -Os still uses a multiplicative inverse with umull, only using udiv with -Oz. (or a call to __aeabi_uidiv helper function without any -mcpu option). So in that respect, clang -Os makes a better tradeoff than GCC, still spending a little bit of code-size to avoid slow integer division.


    Footnote 1: inlining or not for std::vector

    #include <vector>
    int foo(std::vector<int> &v) {
        return v[0] + v[1];
    }
    

    Godbolt with gcc with the default -O0 vs. -Os for -mcpu=cortex-m7 just to randomly pick something. IDK if it's normal to use dynamic containers like std::vector on an actual microcontroller; probably not.

    # -Os (same as -Og for this case, actually, omitting the frame pointer for this leaf function)
    foo(std::vector<int, std::allocator<int> >&):
            ldr     r3, [r0]                @ load the _M_start member of the reference arg
            ldrd    r0, r3, [r3]            @ load a pair of words (v[0..1]) from there into r0 and r3
            add     r0, r0, r3              @ add them into the return-value register
            bx      lr
    

    vs. a debug build (with name-demangling enabled for the asm)

    # GCC -O0 -mcpu=cortex-m7 -mthumb
    foo(std::vector<int, std::allocator<int> >&):
            push    {r4, r7, lr}             @ non-leaf function requires saving LR (the return address) as well as some call-preserved registers
            sub     sp, sp, #12
            add     r7, sp, #0              @ Use r7 as a frame pointer.  -O0 defaults to -fno-omit-frame-pointer
            str     r0, [r7, #4]            @ spill the incoming register arg to the stack
    
    
            movs    r1, #0                  @ 2nd arg for operator[]
            ldr     r0, [r7, #4]            @ reload the pointer to the control block as the first arg
            bl      std::vector<int, std::allocator<int> >::operator[](unsigned int)
            mov     r3, r0                  @ useless copy, but hey we told GCC not to spend any time optimizing.
            ldr     r4, [r3]                @ deref the reference (pointer) it returned, into a call-preserved register that will survive across the next call
    
    
            movs    r1, #1                  @ arg for the v[1]  operator[]
            ldr     r0, [r7, #4]
            bl      std::vector<int, std::allocator<int> >::operator[](unsigned int)
            mov     r3, r0
            ldr     r3, [r3]                @ deref the returned reference
    
            add     r3, r3, r4              @ v[1] + v[0]
            mov     r0, r3                  @ and copy into the return value reg because GCC didn't bother to add into it directly
    
            adds    r7, r7, #12             @ tear down the stack frame
            mov     sp, r7
            pop     {r4, r7, pc}            @ and return by popping saved-LR into PC
    
    @ and there's an actual implementation of the operator[] function
    @ it's 15 instructions long.  
    @ But only one instance of this is needed for each type your program uses (vector<int>, vector<char*>, vector<my_foo>, etc.)
    @ so it doesn't add up as much as each call-site
    std::vector<int, std::allocator<int> >::operator[](unsigned int):
            push    {r7}
            sub     sp, sp, #12
      ...
    

    As you can see, un-optimized GCC cares more about fast compile-times than even the most simple things like avoiding useless mov reg,reg instructions even within code for evaluating one expression.


    Footnote 1: metadata

    If you could a whole ELF executable with metadata, not just the .text + .rodata + .data you'd need to burn to flash, then of course -g debug info is very significant for size of the file, but basically irrelevant because it's not mixed in with the parts that are needed while running, so it just sits there on disk.

    Symbol names and debug info can be stripped with gcc -s or strip.

    Stack-unwind info is an interesting tradeoff between code-size and metadata. -fno-omit-frame-pointer wastes extra instructions and a register as a frame pointer, leading to larger machine-code size, but smaller .eh_frame stack unwind metadata. (strip does not consider that "debug" info by default, even for C programs not C++ where exception-handling might need it in non-debugging contexts.)

    How to remove "noise" from GCC/clang assembly output? mentions how to get the compiler to omit some of that: -fno-asynchronous-unwind-tables omits .cfi directives in the asm output, and thus the metadata that goes into the .eh_frame section. Also -fno-exceptions -fno-rtti with C++ can reduce metadata. (Run-Time Type Information for reflection takes space.)

    Linker options that control alignment of sections / ELF segments can also take extra space, relevant for tiny executables but is basically a constant amount of space, not scaling with the size of the program. See also Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?