cperformanceoptimizationmemcmp

Why is memcmp so much faster than a for loop check?


Why is memcmp(a, b, size) so much faster than:

for(i = 0; i < nelements; i++) {
    if a[i] != b[i] return 0;
}
return 1;

Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp over the loop.


Solution

  • memcmp is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.

    As a "builtin"

    GCC supports memcmp (as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp will be recognized as __builtin_memcmp. Instead of emitting a call to the memcmp library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.

    On x86, this leverages the use of the cmpsb instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp does).

    Given the following code:

    int test(const void* s1, const void* s2, int count)
    {
        return memcmp(s1, s2, count) == 0;
    }
    

    gcc version 3.4.4 on Cygwin generates the following assembly:

    ; (prologue)
    mov     esi, [ebp+arg_0]    ; Move first pointer to esi
    mov     edi, [ebp+arg_4]    ; Move second pointer to edi
    mov     ecx, [ebp+arg_8]    ; Move length to ecx
    
    cld                         ; Clear DF, the direction flag, so comparisons happen
                                ; at increasing addresses
    cmp     ecx, ecx            ; Special case: If length parameter to memcmp is
                                ; zero, don't compare any bytes.
    repe cmpsb                  ; Compare bytes at DS:ESI and ES:EDI, setting flags
                                ; Repeat this while equal ZF is set
    setz    al                  ; Set al (return value) to 1 if ZF is still set
                                ; (all bytes were equal).
    ; (epilogue) 
    

    Reference:

    As a library function

    Highly-optimized versions of memcmp exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.

    In Glibc, there are versions of memcmp for x86_64 that can take advantage of the following instruction set extensions:

    The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S:

    ENTRY(memcmp)
        .type   memcmp, @gnu_indirect_function
        LOAD_RTLD_GLOBAL_RO_RDX
        HAS_CPU_FEATURE (SSSE3)
        jnz 2f
        leaq    __memcmp_sse2(%rip), %rax
        ret 
    
    2:  HAS_CPU_FEATURE (SSE4_1)
        jz  3f  
        leaq    __memcmp_sse4_1(%rip), %rax
        ret 
    
    3:  leaq    __memcmp_ssse3(%rip), %rax
        ret 
    
    END(memcmp)
    

    In the Linux kernel

    Linux does not seem to have an optimized version of memcmp for x86_64, but it does for memcpy, in arch/x86/lib/memcpy_64.S. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.