cassemblyoptimizationx86-64memcmp

What prevents the compiler from optimizing a hand written memcmp()?


Given:

#include <string.h>

bool test_data(void *data)
{
    return memcmp(data, "abcd", 4) == 0;
}

The compiler can optimize it into:

test_data:
    cmpl    $1684234849, (%rdi)
    sete    %al
    ret

which is nice.

But if I use my own memcmp() (not from <string.h>), the compiler can't optimize it into a single cmpl instruction. Instead, it does this:

static int memcmp(const void *s1, const void *s2, size_t n)
{
    const unsigned char *p1 = s1, *p2 = s2;
    size_t i;

    for (i = 0; i < n; i++) {
        int ret = p1[i] - p2[i];
        if (ret)
            return ret;
    }

    return 0;
}

bool test_data(void *data)
{
    return memcmp(data, "abcd", 4) == 0;
}
test_data:
    cmpb    $97, (%rdi)
    jne     .L5
    cmpb    $98, 1(%rdi)
    jne     .L5
    cmpb    $99, 2(%rdi)
    jne     .L5
    cmpb    $100, 3(%rdi)
    sete    %al
    ret
.L5:
    xorl    %eax, %eax
    ret

Link: https://godbolt.org/z/Kfhchr45a


Solution

  • Data-dependent branching defeats auto-vectorization in GCC/Clang (but not classic ICC). The trip-count can't be computed before the first iteration (in the abstract machine), so GCC and clang wouldn't even try to use pcmpeqb/pmovmskb for large sizes. (Which is the efficient way to memcmp on x86-64 for large inputs.)

    Apparently this is a hard problem, as it's been unsolved in GCC/Clang for 15 to 20 years (from now back to whenever GCC first started doing auto-vectorization.)

    See also how to auto vectorization array comparison function - writing it as a loop that counts mismatches and always touches all elements can give auto-vectorization. (Or an OR reduction instead of a sum reduction). But that won't help for a small fixed size like 4 bytes. And removing the early-out entirely is a disaster for a 1GiB memcmp with a difference in the first byte. (A good SIMD memcmp like glibc's SSE2/AVX2 versions would branch maybe every 64 to 128 bytes on a vector compare results.)


    Apparently there's no idiom-recognition either, at least not for this way of writing it. (There is for memcpy; GCC and clang can turn simple copy loops into call memcpy or call memset, or inline expansions of those functions. So if you're writing your own, you need -fno-builtin-memcpy or else your function turns into a call to itself... Pick a different name for experiments.)

    Writing it as a loop that unconditionally touches all bytes could give auto-vectorization, but probably won't get recognized as a memcmp idiom in GCC's internals. I think that would be necessary for good code with small problems, like this case where we want a single dword cmp.


    Compilers must avoid introducing segfaults by inventing reads past where the abstract machine stops. If void *data points to a 1-byte buffer holding 'z', your manual loop has well-defined behaviour in the abstract machine. Reading all 4 bytes would be past the end of the buffer.

    But memcmp is UB if any part of the array is inaccessible, so the compiler can touch all 4 bytes without checking for early-out or for pointer alignment. (Can std::memcmp read any bytes past the first difference? yes, unlike your loop.)

    (In x86-64 asm, going past the end could go into an unmapped page, resulting in a segfault, violating the as-if rule. Is it safe to read past the end of a buffer within the same page on x86 and x64? - yes, but only within the same page. This is something you can work around with aligned loads for strlen and strchr, but a bigger obstacle for vectorizing strcmp with differently-aligned pointers.)

    Instead of comparing two unknown pointers from function args, I changed your test_data caller to pass pointers to two global arrays, char foo[4], bar[4];, which the compiler knows for certain are both readable. (Godbolt). But that didn't help, so still no.