c++optimizationcomparisonx86-64memcmp

Test of 8 subsequent bytes isn't translated into a single compare instruction


Motivated by this question, I compared three different functions for checking if 8 bytes pointed to by the argument are zeros (note that in the original question, characters are compared with '0', not 0):

bool f1(const char *ptr)
{    
  for (int i = 0; i < 8; i++)
    if (ptr[i])
      return false;
  return true;
}

bool f2(const char *ptr)
{  
  bool res = true;
  for (int i = 0; i < 8; i++)
    res &= (ptr[i] == 0);
  return res;
}

bool f3(const char *ptr)
{  
  static const char tmp[8]{};
  return !std::memcmp(ptr, tmp, 8);
}

Though I would expect the same assembly outcome with enabled optimizations, only the memcmp version was translated into a single cmp instruction on x64. Both f1 and f2 were translated into either a winded or unwinded loop. Moreover, this holds for all GCC, Clang, and Intel compilers with -O3.

Is there any reason why f1 and f2 cannot be optimized into a single compare instruction? It seem to be a pretty straightforward optimization to me.

Live demo: https://godbolt.org/z/j48366


Solution

  • First of all, f1 stops reading at the first non-zero byte, so there are cases where it won't fault if you pass it a pointer to a shorter object near the end of a page, and the next page is unmapped. Unconditionally reading 8 bytes can fault in cases where f1 doesn't encounter UB, as @bruno points out. (Is it safe to read past the end of a buffer within the same page on x86 and x64?). The compiler doesn't know that you're never going to use it this way; it has to make code that works for every possible non-UB case for any hypothetical caller.

    You can fix that by making the function arg const char ptr[static 8] (but that's a C99 feature, not C++) to guarantee that it's safe to touch all 8 bytes even if the C abstract machine wouldn't. Then the compiler can safely invent reads. (A pointer to a struct {char buf[8]}; would also work, but wouldn't be strict-aliasing safe if the actual pointed-to object wasn't that.)


    GCC and clang can't auto-vectorize loops whose trip-count isn't known before the first iteration. So that rules out all search loops like f1, even if made it check a static array of known size or something. (ICC can vectorize some search loops like a naive strlen implementation, though.)

    Your f2 could have been optimized the same as f3, to a qword cmp, without overcoming that major compiler-internals limitations because it always does 8 iterations. In fact, current nightly builds of clang do optimize f2, thanks @Tharwen for spotting that.

    Recognizing loop patterns is not that simple, and takes compile time to look for. IDK how valuable this optimization would be in practice; that's what compiler devs need trade off against when considering writing more code to look for such patterns. (Maintenance cost of code, and compile-time cost.)

    The value depends on how much real world code actually has patterns like this, as well as how big a saving it is when you find it. In this case it's a very nice saving, so it's not crazy for clang to look for it, especially if they have the infrastructure to turn a loop over 8 bytes into an 8-byte integer operation in general.


    In practice, just use memcmp if that's what you want; apparently most compilers don't spend time looking for patterns like f2. Modern compilers do reliably inline it, especially for x86-64 where unaligned loads are known to be safe and efficient in asm.

    Or use memcpy to do an aliasing-safe unaligned load and compare that, if you think your compiler is more likely to have a builtin memcpy than memcmp.

    Or in GNU C++, use a typedef to express unaligned may-alias loads:

    bool f4(const char *ptr) {
       typedef uint64_t aliasing_unaligned_u64 __attribute__((aligned(1), may_alias));
        auto val = *(const aliasing_unaligned_u64*)ptr;
        return val != 0;
    }
    

    Compiles on Godbolt with GCC10 -O3:

    f4(char const*):
            cmp     QWORD PTR [rdi], 0
            setne   al
            ret
    

    Casting to uint64_t* would potentially violate alignof(uint64_t), and probably violate the strict-aliasing rule unless the actual object pointed to by the char* was compatible with uint64_t.

    And yes, alignment does matter on x86-64 because the ABI allows compilers to make assumptions based on it. A faulting movaps or other problems can happen with real compilers in corner cases.