I am trying to learn some C++ optimizations and I have tried using __mm_prefetch
for summing an array. The benchmark tests for my code is:
#include <benchmark/benchmark.h>
#include <vector>
#if defined(__GNUC__) || defined(__clang__)
#define PREFETCH(addr, hint) __builtin_prefetch(addr, 0, hint)
#elif defined(_MSC_VER)
#include <xmmintrin.h>
#define PREFETCH(addr, hint) _mm_prefetch(reinterpret_cast<const char*>(addr), hint)
#else
#define PREFETCH(addr, hint)
#endif
class PrefetchBenchmark : public benchmark::Fixture {
public:
static constexpr size_t data_size = 1 << 20;
void SetUp(const benchmark::State& state) override {
data.resize(data_size, 1);
}
void TearDown(const benchmark::State& state) override {
}
std::vector<int> data;
};
BENCHMARK_F(PrefetchBenchmark, NoPrefetch)(benchmark::State& state) {
for (auto _ : state) {
long sum = 0;
for (const auto& i : data) {
sum += i;
}
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_F(PrefetchBenchmark, WithPrefetch)(benchmark::State& state) {
int prefetch_distance = 10;
for (auto _ : state) {
long sum = 0;
for (int i = 0; i < data.size(); i++) {
if (i + prefetch_distance < data.size()) {
PREFETCH(&data[i + prefetch_distance], 3);
}
sum += data[i];
}
benchmark::DoNotOptimize(sum);
}
}
However the benchmark runs consistantly slow with the prefetch
PrefetchBenchmark/NoPrefetch 348484 ns 344905 ns 1948
PrefetchBenchmark/WithPrefetch 595119 ns 585938 ns 1120
Why is this and how could I make a test which gets a performance increase from using __mm_prefetch
?
My git repo for my benchmarks for a full example is here
First, your code is introducing a needless branch, which very likely slows things down and can be avoided:
/* Original */
for (int i = 0; i < data.size(); i++) {
if (i + prefetch_distance < data.size()) {
PREFETCH(&data[i + prefetch_distance], 3);
}
sum += data[i];
}
/* Updated code */
int i;
for (i = 0; i < data.size() - prefetch_distance; i++) {
PREFETCH(&data[i + prefetch_distance], 3);
sum += data[i];
}
for ( ; i < data.size(); i++)
sum += data[i];
Now, looking at the code with no branching: the root cause of the problem seems to be the inability of the compiler to properly vectorize the loop with SIMD instructions when __builtin_prefetch()
is used in its body. The NoPrefetch
code is vectorized, but not explicitly prefetched. The WithPrefetch
code is explicitly prefecthed, but not vectorized. The slowdown from missed vectorization is much more severe than the speedup from explicit prefetching.
An interesting GCC bug report sheds some light on the issue: Bug 114061 - GCC fails vectorization when using __builtin_prefetch. At least for GCC, it seems like the compiler assumes that your __builtin_prefetch(&data[i + x])
clobbers memory and does a function call (it makes sense to avoid vectorization in such case) even though the call is to a builtin function that acts as a no-op.
GCC 15 should have a fix in place to overcome this limitation and allow the builtin without disrupting vectorization. However, from what I can see on Godbolt.org, even though GCC 16 trunk does vectorize the loop, it completely ignores the prefetching, leaving it entirely out of the loop. So it still seems broken to me.
So how should you "fix" this, if needed at all? Well, you unfortunately cannot rely on the compiler to vectorize correctly with your explicit prefetching. Moreover, even though one particular compiler may be smart enough to handle this (it seems none of the major ones are), it seems unlikely that all compilers will get it right for a multi-platform build. This means that, as you already noticed, you will have to apply vectorization manually and throw in a prefetch as needed.
Yes, software prefetching can be beneficial, but it is in general a diffucult optimization task, with lots of empirical trial and error, mostly because hardware prefetching is nowadays already very good. See the answer on "Can I read a CPU x86 flag to determine if prefetched data has arrived in the L1 cache?". SW prefetch too early (large distance) and data will be evicted by the time you want to use it. SW prefetch too late (small distance) and it becomes a no-op as HW prefetch already happened.
In the example code below, with prefetch_disance
set to a small value like 8
, performance decreases instead of increasing because you are prefetching data that was already in cache:
PrefetchBenchmark/ManualSIMDWithPrefetch 151742 ns 151738 ns 4599
PrefetchBenchmark/ManualSIMD 145401 ns 145401 ns 4790
This makes sense as one cache line already covers 64 / 4 = 16 contiguous int
values (assuming 32-bit int
) and hardware prefetching is likely already loading more than one cache line ahead. Modern [x86] CPUs are already pretty good at hardware prefetching for simple memory access patterns such as sequential access.
Increasing prefetch_disance
to a larger value such as 64
or 128
you may start to see an improvement (as I am on x86 Skylake):
PrefetchBenchmark/ManualSIMDWithPrefetch 137408 ns 137407 ns 5022
PrefetchBenchmark/ManualSIMD 146430 ns 146429 ns 4803
Example code:
static inline long simd_sum8(const std::vector<int> &data, const int i) {
__m128i vec1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&data[i]));
__m128i vec2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&data[i + 4]));
__m128i sum_vec = _mm_add_epi32(vec1, vec2);
int temp[4];
_mm_storeu_si128(reinterpret_cast<__m128i*>(temp), sum_vec);
return temp[0] + temp[1] + temp[2] + temp[3];
}
BENCHMARK_F(PrefetchBenchmark, ManualSIMDWithPrefetch)(benchmark::State& state) {
for (auto _ : state) {
long sum = 0;
int i = 0;
int size = static_cast<int>(data.size());
// TODO: benchmark and tune this!
constexpr int prefetch_disance = 128;
for (; i <= size - prefetch_disance; i += 8) {
PREFETCH(&data[i + prefetch_disance], 3);
sum += simd_sum8(data, i);
}
for (; i < size; ++i) {
sum += data[i];
}
benchmark::DoNotOptimize(sum);
}
}
BENCHMARK_F(PrefetchBenchmark, ManualSIMD)(benchmark::State& state) {
for (auto _ : state) {
long sum = 0;
int i = 0;
int size = static_cast<int>(data.size());
for (; i < size; i += 8)
sum += simd_sum8(data, i);
benchmark::DoNotOptimize(sum);
}
}