gcccompilationcross-compilingcompiler-optimizationcompiler-options

Is gcc's -march=native a convenience option? Is there magic you can't get manually?


When using gcc's -march=native option it sets a number of flags/options, but could this be replicated by setting everything manually, or are there things which are set that are not exposed to the user when doing so manually?

By replicated I mean could you set everything yourself without native and produce the same binary, given of course you are allowed to specify the microarchitecture with -march.


Solution

  • According to gcc -v -march=native foo.c, the options passed to the actual C to asm compiler (cc1) don't include "native", only an arch/tune and some cache-size parameters. So yes, you could pass the same options to a cross-compiler and not be missing out on anything. I don't know how much effect that cache-size options have, perhaps on loop thresholds for using NT stores if it ever does that for some targets.

    And explicit -mabc -mno-xyz options for every ISA extension GCC knows about, so it can optimize e.g. for a VM where CPUID doesn't expose some features, or for Pentium/Celeron before Ice Lake without AVX/FMA/BMI extensions, or Ice Lake Pentium / Celeron which lack AVX-512, unlike -march=icelake-client

    I've manually line-wrapped the terminal output from x86-64 GCC on GNU/Arch Linux:

    $ gcc -v -march=native spawn.c
    ...
    gcc version 13.2.1 20230801 (GCC) 
    ...
    COLLECT_GCC_OPTIONS='-v' '-march=native' '-dumpdir' 'a-'
     /usr/lib/gcc/x86_64-pc-linux-gnu/13.2.1/cc1 -quiet -v spawn.c -march=skylake  \
    -mmmx mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a \
    -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl \
    -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi \
    -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2\
    -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 \
    -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mno-clwb \
    -mno-clzero -mcx16 -mno-enqcmd  -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt\
    -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 \
    -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -msgx -mno-sha \
    -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec \
    -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset \
    -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 \
    -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint \
    -mno-amx-complex \
    \
    --param l1-cache-size=32 --param l1-cache-line-size=64 \
    --param l2-cache-size=8192 -mtune=skylake \
    -quiet -dumpdir a- -dumpbase spawn.c -dumpbase-ext .c -version -o /tmp/ccj4afQ8.s
    ...
     as -v --64 -o /tmp/ccb6usSK.o /tmp/ccj4afQ8.s
    ...
    

    So "native" is still there in the environment, but I don't expect cc1 pulls it out. All the -m options between -mmmx and -mno-amx-complex are Intel and AMD ISA extensions my i7-6700k CPU does/doesn't have. So the -march=skylake passed to cc1 is probably redundant; everything it sets is overridden by -m feature options and -mtune=skylake.

    In the last couple lines of options with --param, 8192 KiB is actually the size of the L3 last-level cache on my i7-6700k (4c8t). When this part of GCC was designed, the last-level shared cache typically was L2, but three levels of cache are common these days (with two levels of per-core private cache, so cache-blocking for L2 size could be reasonable in some cases.) So anyway, presumably nobody bothered to rename the option to llc-size, and the way it's used for tuning heuristics by cc1 / cc1plus works if the gcc "driver" front-end just passes the last-level cache size.

    Skylake has 32 KiB L1d and 32 KiB L1i caches. Line size is 64 bytes in all levels. (L2 has a "spatial prefetcher" that likes to complete an aligned pair of lines, so there's a weak behaviour a bit like a 128-byte line there).

    Presumably -mtune=skylake has unrolling and inlining heuristics appropriate for the known I-cache / uop-cache sizes, and --param l1-cache-size=32 is based on L1d. Sizes are different on some CPUs like Ice Lake and Alder Lake P-cores (48K L1d / 32K L1i) have different sizes, or Zen 1 had 32K L1d / 64K L1i.


    LTO is similar, I think

    I only see --param l1-cache-size=32 and other cache options getting passed to cc1 (the C to GIMPLE + asm compiler), not to lto1 (GIMPLE to asm re-optimizer). I suspect they're not very important, and IDK if anything in modern GCC still depends on them, at least for x86.

    More options get passed via the environment in COLLECT_GCC_OPTIONS= which still includes the original command-line args at the end. So -march=native is in there, after the -march=skylake -mabc -mno-xyz -mtune=skylake options. But invocation of lto1 doesn't include that, it stops after -mtune=skylake, so I think the actual LTO optimization pass of -flto is still fully controlled by its command line, not the machine it's running on.