c++performancebenchmarkingstring-concatenationabseil

Surprising Benchmark Result


After watching Titus Winters' "Live at Head" talk, where he mentions that StrCat() is one of people's favorite features, I decided to try and implement something similar to see if I could beat std::string::append (or operator+, which I figured uses append internally) in terms of runtime performance. My reasoning was that a strcat() function implemented as a variadic template would be able to determine the combined size of all its string-like arguments and make a single allocation to store the final result, rather than having to constantly reallocate in the case of operator+, which has no knowledge of the overarching context in which it's called.

However, when I compared my custom implementation against operator+ on quick-bench, I found that my strcat() implementation is about 4x slower than operator+ on recent versions of both clang and gcc, compiled with -std=c++17 -O3. I've included the quick-bench code below for reference.

Does anyone know what could be causing the slowdown here?

#include <cstring>
#include <iostream>
#include <string>

// Get the size of string-like args
int getsize(const std::string& s) { return s.size(); }
int getsize(const char* s) { return strlen(s); }
template <typename S>
int strcat_size(const S& s) {
  return getsize(s);
}
template <typename S, typename... Strings>
int strcat_size(const S& first, Strings... rest) {
  if (sizeof...(Strings) == 0) {
    return 0;
  } else {
    return getsize(first) + strcat_size(rest...);
  }
}

// Populate a pre-allocated string with content from another string-like object
template <typename S>
void strcat_fill(std::string& res, const S& first) {
  res += first;
}
template <typename S, typename... Strings>
void strcat_fill(std::string& res, const S& first, Strings... rest) {
  res += first;
  strcat_fill(res, rest...);
}

template <typename S, typename... Strings>
std::string strcat(const S& first, Strings... rest) {
  int totalsize = strcat_size(first, rest...);

  std::string res;
  res.reserve(totalsize);

  strcat_fill(res, first, rest...);

  return res;
}

const char* s1 = "Hello World! ";
std::string s2 = "Here is a string to concatenate. ";
std::string s3 = "Here is a longer string to concatenate that avoids small string optimization";
const char* s4 = "How about some more strings? ";
std::string s5 = "And more strings? ";
std::string s6 = "And even more strings to use!";

static void strcat_bench(benchmark::State& state) {
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    std::string s = strcat(s1, s2, s3, s4, s5, s6);

    benchmark::DoNotOptimize(s);
  }
}
BENCHMARK(strcat_bench);

static void append_bench(benchmark::State& state) {
  for (auto _ : state) {
    std::string s = s1 + s2 + s3 + s4 + s5 + s6;

    benchmark::DoNotOptimize(s);
  }
}
BENCHMARK(append_bench);

Solution

  • That's because of passing arguments by value.

    I changed the code to use fold-expressions instead (which looks a lot cleaner)
    and got rid of unnecessary copies (Strings... rest should've been a reference).

    int getsize(const std::string& s) { return s.size(); }
    int getsize(const char* s) { return strlen(s); }
    
    template <typename ...P>
    std::string strcat(const P &... params)
    {
      std::string res;
      res.reserve((getsize(params) + ...));
      (res += ... += params);
      return res;
    }
    

    This solution beats append by approximately 30%.


    There seems to be no difference between passing by const references and doing perfect forwarding in this case. It makes sense, because std::string += won't move it's arguments even if they're rvalues.


    If you don't have access to the new fancy fold-expressions but still want the performance, use 'dummy array' trick instead (which seems to have exactly same performance in this case).

    template <typename ...P>
    std::string strcat(const P &... params)
    {
      using dummy_array = int[]; // This is necessary because `int[]{blah}` doesn't compile.
      std::string res;
      std::size_t size = 0;
      dummy_array{(void(size += getsize(params)), 0)..., 0};
      res.reserve(size);
      dummy_array{(void(res += params), 0)..., 0};
      return res;
    }