I am reading this recently, which states:
Don’t assume that complicated code is necessarily faster than simple code.
The code is copied as following:
Example, good
// clear expression of intent, fast execution
vector<uint8_t> v(100000);
for (auto& c : v)
c = ~c;
Example, bad
// intended to be faster, but is often slower
vector<uint8_t> v(100000);
for (size_t i = 0; i < v.size(); i += sizeof(uint64_t)) {
uint64_t& quad_word = *reinterpret_cast<uint64_t*>(&v[i]);
quad_word = ~quad_word;
}
I am not sure what the purpose of the bad example is, why is it intended to be faster?
And why is it in fact often slower?
Compilers would attempt to vectorize operations by performing multiple operations at once using SIMD instructions (SSE/AVX on x86-64, others on other platforms).
By vectoring manually using uint64_t
you make the compiler performing 8 operations at once, which you see as an improvement in debug build, but may prevent it from making more operations at once (for example, 16 with SSE2, 32 with AVX2, 64 with AVX512), which you see as being slower in release.