perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mean memory is unaligned?

I am trying to profile my C++ code using perf tool. Implementation contains code with SSE/AVX/AVX2 instructions. In addition to that code is compiled with -O3 -mavx2 -march=native flags. I believe __memset_avx2_unaligned_erms function is a libc implementation of memset. perf shows that this function has considerable overhead. Function name indicates that memory is unaligned, however in the code I am explicitly aligning the memory using GCC built-in macro __attribute__((aligned (x))) What might be the reason for this function to have significant overhead and also why unaligned version is called although memory is aligned explicitly?

I have attached the sample report as picture.

Solution

No, it doesn't. It means the memset strategy chosen by glibc on that hardware is one that doesn't try to avoid unaligned accesses entirely, in the small-size cases. (glibc selects a memset implementation at dynamic linker symbol resolution time, so it gets runtime dispatching with no extra overhead after the first call.)

If your buffer is in fact aligned and the size is a multiple of the vector width, all the accesses will be aligned and there's essentially no overhead. (Using vmovdqu with a pointer that happens to be aligned at runtime is exactly equivalent to vmovdqa on all CPUs that support AVX.)

For large buffers, it still aligns the pointer before the main loop in case it isn't aligned, at the cost of a couple extra instructions vs. an implementation that only worked for 32-byte aligned pointers. (But it looks like it uses rep stosb without aligning the pointer, if it's going to rep stosb at all.)

gcc+glibc doesn't have a special version of memset that's only called with aligned pointers. (Or multiple special versions for different alignment guarantees). GLIBC's AVX2-unaligned implementation works nicely for both aligned and unaligned inputs.

It's defined in glibc/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S, which defines a couple macros (like defining the vector size as 32) and then #includes "memset-vec-unaligned-erms.S".

The comment in the source code says:

/* memset is implemented as:
   1. Use overlapping store to avoid branch.
   2. If size is less than VEC, use integer register stores.
   3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
   4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
   5. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
      4 VEC stores and store 4 * VEC at a time until done.  */

The actual alignment before the main loop is done after some vmovdqu vector stores (which have no penalty if used on data that is in fact aligned: https://agner.org/optimize/):

L(loop_start):
    leaq        (VEC_SIZE * 4)(%rdi), %rcx   # rcx = input pointer + 4*VEC_SIZE
    VMOVU        %VEC(0), (%rdi)            # store the first vector
    andq        $-(VEC_SIZE * 4), %rcx      # align the pointer
    ...  some more vector stores
    ...  and stuff, including storing the last few vectors I think
    addq        %rdi, %rdx                  # size += start, giving an end-pointer
    andq        $-(VEC_SIZE * 4), %rdx      # align the end-pointer

L(loop):                                       # THE MAIN LOOP
    VMOVA        %VEC(0), (%rcx)               # vmovdqa = alignment required
    VMOVA        %VEC(0), VEC_SIZE(%rcx)
    VMOVA        %VEC(0), (VEC_SIZE * 2)(%rcx)
    VMOVA        %VEC(0), (VEC_SIZE * 3)(%rcx)
    addq        $(VEC_SIZE * 4), %rcx
    cmpq        %rcx, %rdx
    jne        L(loop)

So with VEC_SIZE = 32, it aligns the pointer by 128. This is overkill; cache lines are 64 bytes, and really just aligning to the vector width should be fine.

It also has a threshold for using rep stos if enabled and the buffer size is > 2kiB, on CPUs with ERMSB. (Enhanced REP MOVSB for memcpy).