c++performanceassemblyoptimizationstrlen

Fast generic strlen() implementation that can accept arbitrary terminator character


template <char terminator = '\0'>
size_t strlen(const char *str)
{
    const char *char_ptr;
    const unsigned long int *longword_ptr;
    unsigned long int longword, magic_bits, himagic, lomagic;

    for (char_ptr = str; ((unsigned long int) char_ptr 
             & (sizeof (longword) - 1)) != 0; ++char_ptr)
       if (*char_ptr == '\0')
           return char_ptr - str;

    longword_ptr = (unsigned long int *) char_ptr;

    himagic = 0x80808080L;
    lomagic = 0x01010101L;

    for (;;)
    { 
        longword = *longword_ptr++;

        if (((longword - lomagic) & himagic) != 0)
        {
            const char *cp = (const char *) (longword_ptr - 1);

            if (cp[0] == 0)
                return cp - str;
            if (cp[1] == 0)
                return cp - str + 1;
            if (cp[2] == 0)
                return cp - str + 2;
            if (cp[3] == 0)
                return cp - str + 3;
        }
    }
}

The above is glibc strlen() code. It relies on trick to Determine if a word has a zero byte to make it fast.

However, I wish to make the function work with any terminating character, not just '\0', using template. Is it possible to do something similar ?


Solution

  • Use std::memchr to take advantage of libc's hand-written asm

    It returns a pointer to the found byte, so you can get the length by subtracting. It returns NULL on not-found, but you said you can assume there will be a match, so we don't need to check except as a debug assert.

    Even better, use rawmemchr if you can assume GNU functions are available, so you don't even have to pass a length. (Or not, since glibc 2.37 deprecates it.)

    #include <cstring>
    
    size_t lenx(const char *p, int c, size_t len)
    {
        const void *match = std::memchr(p, c, len);  // old C functions take char in an int
        return static_cast<const char*>(match) - p;
    }
    

    Any decent libc implementation for a modern mainstream CPU will have a fast memchr implementation that checks multiple bytes at once, often hand-written in asm. Very similar to an strlen implementation, but with length-based loop exit condition in the unrolled loop, not just the match-finding loop exit condition.

    memchr is somewhat cheaper than strchr, which has to check every byte for being a potential 0; an amount of work that doesn't go down with unrolling and SIMD. If data isn't hot in L1 cache, a good strchr can typically still keep up with available bandwidth, though, on most CPUs for most ISAs. Checking for 0s is also a correctness problem for arrays that contain a 0 byte before the byte you're looking for.

    If available, it will even use SIMD instructions to check 16 or 32 bytes at once. A pure C bithack (with strict-aliasing UB) like the one you found is only used in portable fallback code in real C libraries (Why does glibc's strlen need to be so complicated to run quickly? explains this and has some links to glibc's asm implementations), or on targets where it compiles to asm as good as could be written by hand (e.g. MIPS for glibc). (But being wrapped up in a library function, the strict-aliasing UB is dealt with by some means, perhaps as simple as just not being able to inline into other code that accesses that data a different way. If you wanted to do it yourself, you'd want a typedef with something like GNU C __attribute__((may_alias)). See the link earlier in this paragraph.)

    You certainly don't want a bithack that's only going to check 4 bytes at a time, especially if unsigned long is an 8-byte type on a 64-bit CPU!


    If you don't know the buffer length, use len = -1 in C11 / C++17

    Use rawmemchr if available, otherwise use memchr(ptr, c, -1).
    That's equivalent to passing SIZE_MAX.

    See Is it legal to call memchr with a too-long length, if you know the character will be found before reaching the end of the valid region?

    It's guaranteed not to read past the match, or at least to behave as if it didn't, i.e. not faulting. So not reading into the next page just like an optimized strlen, and probably for performance reasons not reading into the next cache line. (At least since C++17 / C11, according to cppreference, but real implementations have almost certainly been safe to use this way for longer, at least for performance reasons.)

    The ISO C++ standard itself defers to the C standard for <cstring> functions; C++17 and later defer to C11, which added this requirement that C99 didn't have. (I also don't know if there are real-world implementations that violate that standard; I'd guess not and that it was more a matter of documenting a guarantee that real implementations were already doing.)

    The POSIX man page for memchr guarantees stopping on a match; I don't know how far back this guarantee goes for POSIX systems.

    Implementations shall behave as if they read the memory byte by byte from the beginning of the bytes pointed to by s and stop at the first occurrence of c (if it is found in the initial n bytes).

    Without a guarantee like this, it's hypothetically possible for an implementation to just use unaligned loads starting with the address you pass it, as long as it's far enough from the ptr[size-1] end of the buffer you told it about. That's unlikely for performance reasons, though.


    rawmemchr()

    If you're on a GNU system, glibc has rawmemchr which assumes there will be a match, not taking a size limit. So it can loop exactly like strlen does, not having a 2nd exit condition based on either length or checking every byte for a 0 as well as the given character.

    Fun fact: AArch64 glibc implements it as memchr(ptr, c, -1), or as strlen if the character happens to be 0. On other ISAs, it might actually duplicate the memchr code but without checking against the end of a buffer.

    Glibc 2.37 will deprecate it, so apparently not a good idea for new code to switch to it now.