I am trying to learn some C++ optimizations and I have tried using __mm_prefetch
for summing an array. The benchmark tests for my code is:
#include <benchmark/benchmark.h>
#include <vector>
#if defined(__GNUC__) || defined(__clang__)
#define PREFETCH(addr, hint) __builtin_prefetch(addr, 0, hint)
#elif defined(_MSC_VER)
#include <xmmintrin.h>
#define PREFETCH(addr, hint) _mm_prefetch(reinterpret_cast<const char*>(addr), hint)
#else
#define PREFETCH(addr, hint)
#endif
class PrefetchBenchmark : public benchmark::Fixture {
public:
static constexpr size_t data_size = 1 << 20;
void SetUp(const benchmark::State& state) override {
data.resize(data_size, 1);
}
void TearDown(const benchmark::State& state) override {
}
std::vector<int> data;
};
BENCHMARK_F(PrefetchBenchmark, NoPrefetch)(benchmark::State& state) {
for (auto _ : state) {
long sum = 0;
for (const auto& i : data) {
sum += i;
}
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_F(PrefetchBenchmark, WithPrefetch)(benchmark::State& state) {
int prefetch_distance = 10;
for (auto _ : state) {
long sum = 0;
for (int i = 0; i < data.size(); i++) {
if (i + prefetch_distance < data.size()) {
PREFETCH(&data[i + prefetch_distance], 3);
}
sum += data[i];
}
benchmark::DoNotOptimize(sum);
}
}
However the benchmark runs consistantly slow with the prefetch
PrefetchBenchmark/NoPrefetch 348484 ns 344905 ns 1948
PrefetchBenchmark/WithPrefetch 595119 ns 585938 ns 1120
Why is this and how could I make a test which gets a performance increase from using __mm_prefetch
?
My git repo for my benchmarks for a full example is here.
First, your code is introducing a needless branch, which very likely slows things down and can be avoided:
/* Original */
for (int i = 0; i < data.size(); i++) {
if (i + prefetch_distance < data.size()) {
PREFETCH(&data[i + prefetch_distance], 3);
}
sum += data[i];
}
/* Updated code */
std::size_t i;
for (i = 0; i + prefetch_distance < data.size(); i++) {
PREFETCH(&data[i + prefetch_distance], 3);
sum += data[i];
}
for ( ; i < data.size(); i++)
sum += data[i];
Now, looking at the code with no branching: the root cause of the problem seems to be the inability of the compiler to properly vectorize the loop with SIMD instructions when __builtin_prefetch()
is used in its body. The NoPrefetch
code is vectorized, but not explicitly prefetched. The WithPrefetch
code is explicitly prefecthed, but not vectorized. The slowdown from missed vectorization is much more severe than the speedup from explicit prefetching.
One interesting GCC bug report sheds some light on the issue: Bug 114061 - GCC fails vectorization when using __builtin_prefetch. At least for GCC, it seems like the compiler assumes that your __builtin_prefetch(&data[i + x])
clobbers memory and does a function call (it makes sense to avoid vectorization in such case) even though the call is to a builtin function that acts as a no-op.
GCC 15 should have a fix in place to overcome this limitation and allow the builtin without disrupting vectorization. However, from what I can see on Godbolt.org, even though GCC 16 trunk does vectorize the loop, it completely ignores the prefetching, leaving it entirely out of the loop. So it still seems broken to me.
So how should you "fix" this?
One very important observation, as pointed out by Peter Cordes, is that you are prefetching more than needed: you only need to prefetch at most once every 64 bytes (cache line size), but you are doing so every single iteration of the loop.
Software prefetching can be beneficial, but it is in general a diffucult optimization task, with lots of empirical trial and error, mostly because hardware prefetching is nowadays already very good. See the answer on "Can I read a CPU x86 flag to determine if prefetched data has arrived in the L1 cache?". SW prefetch too early (large distance) and data will be evicted by the time you want to use it. SW prefetch too late (small distance) and it becomes a no-op wasting CPU cycles as HW prefetch already happened.
Since it seems like you cannot rely on the compiler to vectorize correctly with your explicit SW prefetching in the way, you have two main options:
Correctly taking into account cache-line size and unrolling 16 iterations of the loop at once, we can see how things go and tune the right prefetch distance.
The scenarios to test (both with and without manual prefetch) are:
Here's some code to benchmark, testing with a range of prefetch distances that are powers of two:
// g++ -march=skylake -O3 -DNDEBUG test.cc -o test -lbenchmark
#include <vector>
#include <random>
#include <immintrin.h>
#include <benchmark/benchmark.h>
#if defined(__GNUC__) || defined(__clang__)
#define PREFETCH(addr, hint) __builtin_prefetch(addr, 0, hint)
#elif defined(_MSC_VER)
#include <xmmintrin.h> // For _mm_prefetch in MSVC
#define PREFETCH(addr, hint) _mm_prefetch(reinterpret_cast<const char*>(addr), hint)
#else
#define PREFETCH(addr, hint) // No prefetch support for this compiler
#endif
static inline long simd_sum8(const std::vector<int> &data, const size_t i) {
__m128i vec1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&data[i]));
__m128i vec2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&data[i + 4]));
__m128i sum_vec = _mm_add_epi32(vec1, vec2);
sum_vec = _mm_hadd_epi32(sum_vec, sum_vec);
sum_vec = _mm_hadd_epi32(sum_vec, sum_vec);
return _mm_cvtsi128_si32(sum_vec);
}
static inline int simd_sum16(const std::vector<int> &data, const size_t i) {
__m256i vec1 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&data[i]));
__m256i vec2 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&data[i + 8]));
__m256i sum_vec = _mm256_add_epi32(vec1, vec2);
__m128i sum_lo = _mm256_castsi256_si128(sum_vec);
__m128i sum_hi = _mm256_extracti128_si256(sum_vec, 1);
__m128i sum128 = _mm_add_epi32(sum_lo, sum_hi);
sum128 = _mm_hadd_epi32(sum128, sum128);
sum128 = _mm_hadd_epi32(sum128, sum128);
return _mm_cvtsi128_si32(sum128);
}
class PrefetchBenchmark : public benchmark::Fixture {
public:
static constexpr size_t data_size = 1 << 20;
void SetUp(const benchmark::State& state) override {
// https://stackoverflow.com/q/21516575/3889449
std::random_device rnd_device;
std::mt19937 mersenne_engine {rnd_device()};
std::uniform_int_distribution<int> dist(
std::numeric_limits<int>::min(),
std::numeric_limits<int>::max()
);
auto gen = [&](){ return dist(mersenne_engine); };
// Resize the vector and fill with random values
data.resize(data_size, 1);
std::generate(data.begin(), data.end(), gen);
}
std::vector<int> data;
};
BENCHMARK_DEFINE_F(PrefetchBenchmark, Normal)(benchmark::State& state) {
for (auto _ : state) {
const size_t size = data.size();
long sum = 0;
for (size_t i = 0; i < size; i++)
sum += data[i];
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_DEFINE_F(PrefetchBenchmark, NormalWithPrefetch)(benchmark::State& state) {
const size_t prefetch_distance = state.range(0);
for (auto _ : state) {
const size_t size = data.size();
size_t i = 0;
long sum = 0;
for (; i + std::max(prefetch_distance, std::size_t(16)) < size; i += 16) {
PREFETCH(&data[i + prefetch_distance], 3);
for (size_t j = i; j < i + 16; j++)
sum += data[i];
}
for (; i < size; ++i)
sum += data[i];
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_DEFINE_F(PrefetchBenchmark, ManualSSE2)(benchmark::State& state) {
for (auto _ : state) {
const size_t size = data.size();
size_t i = 0;
long sum = 0;
for (; i < size; i += 16) {
sum += simd_sum8(data, i);
sum += simd_sum8(data, i + 8);
}
for (; i < size; ++i)
sum += data[i];
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_DEFINE_F(PrefetchBenchmark, ManualSSE2WithPrefetch)(benchmark::State& state) {
const int prefetch_distance = state.range(0);
for (auto _ : state) {
const size_t size = data.size();
size_t i = 0;
long sum = 0;
for (; i + prefetch_distance < size; i += 16) {
PREFETCH(&data[i + prefetch_distance], 3);
sum += simd_sum8(data, i);
sum += simd_sum8(data, i + 8);
}
for (; i < size; ++i)
sum += data[i];
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_DEFINE_F(PrefetchBenchmark, ManualAVX2)(benchmark::State& state) {
for (auto _ : state) {
const size_t size = data.size();
size_t i = 0;
long sum = 0;
for (; i < size; i += 16)
sum += simd_sum16(data, i);
for (; i < size; ++i)
sum += data[i];
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_DEFINE_F(PrefetchBenchmark, ManualAVX2WithPrefetch)(benchmark::State& state) {
const int prefetch_distance = state.range(0);
for (auto _ : state) {
size_t size = data.size();
size_t i = 0;
long sum = 0;
for (; i + prefetch_distance < size; i += 16) {
PREFETCH(&data[i + prefetch_distance], 3);
sum += simd_sum16(data, i);
}
for (; i < size; ++i)
sum += data[i];
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_REGISTER_F(PrefetchBenchmark, Normal);
BENCHMARK_REGISTER_F(PrefetchBenchmark, NormalWithPrefetch)->RangeMultiplier(2)->Range(16, 1024);
BENCHMARK_REGISTER_F(PrefetchBenchmark, ManualSSE2);
BENCHMARK_REGISTER_F(PrefetchBenchmark, ManualSSE2WithPrefetch)->RangeMultiplier(2)->Range(16, 1024);
BENCHMARK_REGISTER_F(PrefetchBenchmark, ManualAVX2);
BENCHMARK_REGISTER_F(PrefetchBenchmark, ManualAVX2WithPrefetch)->RangeMultiplier(2)->Range(16, 1024);
BENCHMARK_MAIN();
The corresponding output on my machine (i9-10900) is:
Benchmark Time CPU Iterations
----------------------------------------------------------------------------------------
PrefetchBenchmark/Normal 98994 ns 98991 ns 6957
PrefetchBenchmark/NormalWithPrefetch/16 56664 ns 56663 ns 12007
PrefetchBenchmark/NormalWithPrefetch/32 56356 ns 56354 ns 12606
PrefetchBenchmark/NormalWithPrefetch/64 55560 ns 55555 ns 12184
PrefetchBenchmark/NormalWithPrefetch/128 54374 ns 54372 ns 12901
PrefetchBenchmark/NormalWithPrefetch/256 54371 ns 54369 ns 12562
PrefetchBenchmark/NormalWithPrefetch/512 54084 ns 54082 ns 13334
PrefetchBenchmark/NormalWithPrefetch/1024 54257 ns 54253 ns 13092
PrefetchBenchmark/ManualSSE2 152568 ns 152561 ns 4475
PrefetchBenchmark/ManualSSE2WithPrefetch/16 158408 ns 158404 ns 4533
PrefetchBenchmark/ManualSSE2WithPrefetch/32 146926 ns 146921 ns 4815
PrefetchBenchmark/ManualSSE2WithPrefetch/64 142630 ns 142627 ns 4978
PrefetchBenchmark/ManualSSE2WithPrefetch/128 132613 ns 132608 ns 5543
PrefetchBenchmark/ManualSSE2WithPrefetch/256 133385 ns 133380 ns 5547
PrefetchBenchmark/ManualSSE2WithPrefetch/512 126122 ns 126115 ns 5444
PrefetchBenchmark/ManualSSE2WithPrefetch/1024 128566 ns 128562 ns 5480
PrefetchBenchmark/ManualAVX2 129828 ns 129823 ns 5414
PrefetchBenchmark/ManualAVX2WithPrefetch/16 128373 ns 128370 ns 5429
PrefetchBenchmark/ManualAVX2WithPrefetch/32 110661 ns 110656 ns 6355
PrefetchBenchmark/ManualAVX2WithPrefetch/64 101423 ns 101418 ns 6957
PrefetchBenchmark/ManualAVX2WithPrefetch/128 96401 ns 96397 ns 7640
PrefetchBenchmark/ManualAVX2WithPrefetch/256 94296 ns 94292 ns 7306
PrefetchBenchmark/ManualAVX2WithPrefetch/512 93809 ns 93807 ns 7375
PrefetchBenchmark/ManualAVX2WithPrefetch/1024 94009 ns 94007 ns 7450
So at the end of the day, it seems like the compiler outsmarted us and the nested loop with manual prefetch but no manual SIMD vectorization runs the fastest:
const size_t size = data.size();
size_t i = 0;
long sum = 0;
for (; i + std::max(prefetch_distance, std::size_t(16)) < size; i += 16) {
PREFETCH(&data[i + prefetch_distance], 3);
for (size_t j = i; j < i + 16; j++)
sum += data[i];
}
for (; i < size; ++i)
sum += data[i];
In any case, as we can see from the timing reports, higher prefetch distances seem to perform better up a certain cutoff: a distance of 1024 seems to perform as good as (if not worse) than a distance of 512 for all cases. Similarly, a distance of 16 seems to perform as good as no manual SW prefetch for the SSE2 and AVX2 cases, indicating that we aren't doing better than HW prefetch there.