Given:
#include <string.h>
bool test_data(void *data)
{
return memcmp(data, "abcd", 4) == 0;
}
The compiler can optimize it into:
test_data:
cmpl $1684234849, (%rdi)
sete %al
ret
which is nice.
But if I use my own memcmp()
(not from <string.h>
), the compiler can't optimize it into a single cmpl
instruction. Instead, it does this:
static int memcmp(const void *s1, const void *s2, size_t n)
{
const unsigned char *p1 = s1, *p2 = s2;
size_t i;
for (i = 0; i < n; i++) {
int ret = p1[i] - p2[i];
if (ret)
return ret;
}
return 0;
}
bool test_data(void *data)
{
return memcmp(data, "abcd", 4) == 0;
}
test_data:
cmpb $97, (%rdi)
jne .L5
cmpb $98, 1(%rdi)
jne .L5
cmpb $99, 2(%rdi)
jne .L5
cmpb $100, 3(%rdi)
sete %al
ret
.L5:
xorl %eax, %eax
ret
Link: https://godbolt.org/z/Kfhchr45a
Data-dependent branching defeats auto-vectorization in GCC/Clang (but not classic ICC). The trip-count can't be computed before the first iteration (in the abstract machine), so GCC and clang wouldn't even try to use pcmpeqb
/pmovmskb
for large sizes. (Which is the efficient way to memcmp
on x86-64 for large inputs.)
Apparently this is a hard problem, as it's been unsolved in GCC/Clang for 15 to 20 years (from now back to whenever GCC first started doing auto-vectorization.)
See also how to auto vectorization array comparison function - writing it as a loop that counts mismatches and always touches all elements can give auto-vectorization. (Or an OR reduction instead of a sum reduction). But that won't help for a small fixed size like 4 bytes. And removing the early-out entirely is a disaster for a 1GiB memcmp with a difference in the first byte. (A good SIMD memcmp like glibc's SSE2/AVX2 versions would branch maybe every 64 to 128 bytes on a vector compare results.)
Apparently there's no idiom-recognition either, at least not for this way of writing it. (There is for memcpy
; GCC and clang can turn simple copy loops into call memcpy
or call memset
, or inline expansions of those functions. So if you're writing your own, you need -fno-builtin-memcpy
or else your function turns into a call to itself... Pick a different name for experiments.)
Writing it as a loop that unconditionally touches all bytes could give auto-vectorization, but probably won't get recognized as a memcmp
idiom in GCC's internals. I think that would be necessary for good code with small problems, like this case where we want a single dword cmp
.
Compilers must avoid introducing segfaults by inventing reads past where the abstract machine stops. If void *data
points to a 1-byte buffer holding 'z'
, your manual loop has well-defined behaviour in the abstract machine. Reading all 4 bytes would be past the end of the buffer.
But memcmp
is UB if any part of the array is inaccessible, so the compiler can touch all 4 bytes without checking for early-out or for pointer alignment. (Can std::memcmp read any bytes past the first difference? yes, unlike your loop.)
(In x86-64 asm, going past the end could go into an unmapped page, resulting in a segfault, violating the as-if rule. Is it safe to read past the end of a buffer within the same page on x86 and x64? - yes, but only within the same page. This is something you can work around with aligned loads for strlen
and strchr
, but a bigger obstacle for vectorizing strcmp
with differently-aligned pointers.)
Instead of comparing two unknown pointers from function args, I changed your test_data
caller to pass pointers to two global arrays, char foo[4], bar[4];
, which the compiler knows for certain are both readable. (Godbolt). But that didn't help, so still no.