I've created a very simple benchmark for illustration of short string optimization and run it on quick-bench.com. The benchmark works very well as for the comparison of SSO-disabled/enabled string class and the results are very consistent with both GCC and Clang. However, I realized that when I disable optimizations, the reported times are around 4 times faster than those observed with enabled optimizations (-O2
or -O3
), both with GCC and Clang.
The benchmark is here: http://quick-bench.com/DX2G2AdxUb7sGPE-zLRa41-MCk0.
Any idea what may cause the unoptimized benchmark to run 4-times faster?
Unfortunately, I can't see the generated assembly; don't know where the problem is (the "Record disassembly" box is checked but has no effect in my runs). Also, when I run the benchmark locally with Google Benchmark, the results are as expected, i.e., the optimized benchmark runs faster.
I also tried to compare both variants in Compiler Explorer and the unoptimized one seemingly executes much more instructions: https://godbolt.org/z/I4a171.
So, as discussed in the comments, the issue is that quick-bench.com does not show absolute time for the benchmarked code, but rather time relative to the time a no-op benchmark took. The no-op benchmark can be found in the source files of quick-bench.com:
static void Noop(benchmark::State& state) {
for (auto _ : state) benchmark::DoNotOptimize(0);
}
All benchmarks of a run are compiled together. Therefore the optimization flags apply to it as well.
Reproducing and comparing the no-op benchmark for different optimization levels one can see, that there is about a 6 to 7 times speedup from the -O0
to -O1
version. When comparing benchmark runs done with different optimization flags, this factor in the baseline must be considered to compare results. The 4x speed-up observed in the question's benchmark is therefore more than compensated and the behavior is really as one would expect.
One main difference in compilation of the no-op between -O0
and -O1
is that for -O0
there are some assertions and other additional branches in the google-benchmark code, that are optimized out for higher optimization levels.
Additionally at -O0
each iteration of the loop will load into register, modify, and store to memory parts of state
multiple time, e.g. for decrementing the loop counter and conditionals on the loop counter, while the -O1
version will keep state
in registers, making memory load/stores in the loop unnecessary. The former is much slower, taking at least a few cycles per iteration for necessary store-forwardings and/or reloads from memory.