I just played around with different widths of 32-byte aligned data with memcpy()
with GCC 15.1 and Clang 20.1 with march=mtune=skylake-avx512.
And I noticed that GCC decided if length > 512, it will actually call memcpy()
instead of using multiple SIMD instructions while Clang > 256.
Why don't both compilers use SIMD all the way? Does it mean heuristically it is bad to have too many SIMD instructions in a row? If yes, what's the downside?
Libc memcpy
itself uses SIMD instructions. In Glibc, the big-copy loop is unrolled by 8 and uses NT stores. (https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#759) Or by 4 vectors for medium-sized copies (https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#481). That file is #included with macros defined for 64-byte (512-bit) ZMM vectors.
The upside is avoiding the call
/ret
overhead, which is small relative to the total cost of a large copy. Also of avoiding branching to sort out the copy size when it's a compile-time constant. And of setting up the args, and clobbering call-clobbered registers.
A lot of SIMD instructions in a row isn't worse than any other big block of other kinds of instructions.
The downside of inlining is I-cache pressure and larger binaries.
Another downside is that it can't take advantage of future ISA extensions. e.g. if rep movsb
is even faster than AVX-512 instructions on some future CPU, or if there's an alignment-required version of it that goes fast, binaries with inlined memcpy will still be using 512-bit vectors. Which is fine for small copies, avoiding call/ret overhead still wins.
Another downside of inlining is that it doesn't leave any room for tuning choices at dynamic-link time. Not a factor with tune=native
running on the build host, but when building a binary to run on multiple systems, having libc's resolver function choose a memcpy
implementation based on the CPU model lets it pick appropriate thresholds for using NT stores, and whether to only use 256-bit vectors even when AVX-512 is supported (e.g. on CPUs where there's a significant penalty in CPU frequency for using a 512-bit vector at all.)