cgccoptimizationcompiler-optimizationc99

Which gcc optimization flags should I use?


If I want to minimize the time my c programs run, what optimization flags should I use (I want to keep it standard too)

Currently I'm using:

 -Wall -Wextra -pedantic -ansi -O3

Should I also use

-std=c99

for example?

And is there I specific order I should put those flags on my makefile? Does it make any difference?

And also, is there any reason not to use all the optimization flags I can find? do they ever counter eachother or something like that?


Solution

  • I'd recommend compiling new code with -std=gnu11, or -std=c11 if needed. (Or these days, a more recent version like -std=gnu23). Enable -Wall; it's usually a good idea for style and clarity to "fix" any such warnings even if they weren't a real correctness problem, such as adding parens to make order of operations easier to see in cases where operator precedence can be tricky. -Wextra warns for some things you might not want to change, but it's still useful to see what it finds.


    A good way to check how something compiles is to look at the compiler asm output. http://gcc.godbolt.org/ formats the asm output nicely (stripping out the noise). Putting some key functions up there and looking at what different compiler versions do is useful if you understand asm at all.


    Use a new compiler version. gcc and clang have both improved significantly in newer versions. gcc 5.3 and clang 3.8 are the current releases. gcc5 makes noticeably better code than gcc 4.9.3 in some cases.


    If you only need the binary to run on your own machine, you should use -O3 -march=native.

    If you need the binary to run on other machines, choose the baseline for instruction-set extensions with stuff like -mssse3 -mpopcnt. You can use -mtune=haswell to optimize for Haswell even while making code that still runs on older CPUs (as determined by -march).


    If your program doesn't depend on strict FP rounding behaviour, use -ffast-math. If it does, you can usually still use -fno-math-errno and stuff like that, without enabling -funsafe-math-optimizations. Some FP code can get big speedups from fast-math, like auto-vectorization.


    If you can usefully do a test-run of your program that exercises most of the code paths that need to be optimized for a real run, then use profile-directed optimization:

    gcc  -fprofile-generate -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
    ./my_program -option1 < test_input1
    ./my_program -option2 < test_input2
    gcc  -fprofile-use      -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
    

    -fprofile-use enables -funroll-loops, since it has enough information to decide when to actually unroll. Unrolling loops all over the place can make things worse. However, it's worth trying -funroll-loops to see if it helps.

    If your test runs don't cover all the code paths, then some important ones will be marked as "cold" and optimized less.


    -O3 enables auto-vectorization, which -O2 doesn't. This can give big speedups

    -fwhole-program allows cross-file inlining, but only works when you put all the source files on one gcc command-line. -flto is another way to get the same effect. (Link-Time Optimization). clang supports -flto but not -fwhole-program.

    -fomit-frame-pointer has been the default for a while now for x86-64, and more recently for x86 (32bit).


    As well as gcc, try compiling your program with clang. Clang sometimes makes better code than gcc, sometimes worse. Try both and benchmark.