gccclangsimd

Too many SIMD instructions is bad?


I just played around with different widths of 32-byte aligned data with memcpy() with GCC 15.1 and Clang 20.1 with march=mtune=skylake-avx512.

And I noticed that GCC decided if length > 512, it will actually call memcpy() instead of using multiple SIMD instructions while Clang > 256.

Why don't both compilers use SIMD all the way? Does it mean heuristically it is bad to have too many SIMD instructions in a row? If yes, what's the downside?


Solution

  • Libc memcpy itself uses SIMD instructions. In Glibc, the big-copy loop is unrolled by 8 and uses NT stores. (https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#759) Or by 4 vectors for medium-sized copies (https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#481). That file is #included with macros defined for 64-byte (512-bit) ZMM vectors.

    The upside is avoiding the call/ret overhead, which is small relative to the total cost of a large copy. Also of avoiding branching to sort out the copy size when it's a compile-time constant. And of setting up the args, and clobbering call-clobbered registers.

    A lot of SIMD instructions in a row isn't worse than any other big block of other kinds of instructions.

    The downside of inlining is I-cache pressure and larger binaries.
    Another downside is that it can't take advantage of future ISA extensions. e.g. if rep movsb is even faster than AVX-512 instructions on some future CPU, or if there's an alignment-required version of it that goes fast, binaries with inlined memcpy will still be using 512-bit vectors. Which is fine for small copies, avoiding call/ret overhead still wins.

    Another downside of inlining is that it doesn't leave any room for tuning choices at dynamic-link time. Not a factor with tune=native running on the build host, but when building a binary to run on multiple systems, having libc's resolver function choose a memcpy implementation based on the CPU model lets it pick appropriate thresholds for using NT stores, and whether to only use 256-bit vectors even when AVX-512 is supported (e.g. on CPUs where there's a significant penalty in CPU frequency for using a 512-bit vector at all.)