c++x86

Why is my benchmark using __mm_prefetch slower?


I am trying to learn some C++ optimizations and I have tried using __mm_prefetch for summing an array. The benchmark tests for my code is:

#include <benchmark/benchmark.h>
#include <vector>


#if defined(__GNUC__) || defined(__clang__)
    #define PREFETCH(addr, hint) __builtin_prefetch(addr, 0, hint)
#elif defined(_MSC_VER)
    #include <xmmintrin.h>
    #define PREFETCH(addr, hint) _mm_prefetch(reinterpret_cast<const char*>(addr), hint)
#else
    #define PREFETCH(addr, hint)
#endif


class PrefetchBenchmark : public benchmark::Fixture {
public:
    static constexpr size_t data_size = 1 << 20;

    void SetUp(const benchmark::State& state) override {
        data.resize(data_size, 1);
    }

    void TearDown(const benchmark::State& state) override {

    }

    std::vector<int> data;
};


BENCHMARK_F(PrefetchBenchmark, NoPrefetch)(benchmark::State& state) {
    for (auto _ : state) {
        long sum = 0;
        for (const auto& i : data) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
}


BENCHMARK_F(PrefetchBenchmark, WithPrefetch)(benchmark::State& state) {
    int prefetch_distance = 10;
    for (auto _ : state) {
        long sum = 0;
        for (int i = 0; i < data.size(); i++) {
            if (i + prefetch_distance < data.size()) {
                PREFETCH(&data[i + prefetch_distance], 3);
            }
            sum += data[i];
        }
        benchmark::DoNotOptimize(sum);
    }
}

However the benchmark runs consistantly slow with the prefetch

PrefetchBenchmark/NoPrefetch       348484 ns       344905 ns         1948
PrefetchBenchmark/WithPrefetch     595119 ns       585938 ns         1120

Why is this and how could I make a test which gets a performance increase from using __mm_prefetch?

My git repo for my benchmarks for a full example is here


Solution

  • First, your code is introducing a needless branch, which very likely slows things down and can be avoided:

    /* Original */
    for (int i = 0; i < data.size(); i++) {
        if (i + prefetch_distance < data.size()) {
            PREFETCH(&data[i + prefetch_distance], 3);
        }
        sum += data[i];
    }
    
    /* Updated code */
    int i;
    for (i = 0; i < data.size() - prefetch_distance; i++) {
        PREFETCH(&data[i + prefetch_distance], 3);
        sum += data[i];
    }
    for ( ; i < data.size(); i++)
        sum += data[i];
    

    Now, looking at the code with no branching: the root cause of the problem seems to be the inability of the compiler to properly vectorize the loop with SIMD instructions when __builtin_prefetch() is used in its body. The NoPrefetch code is vectorized, but not explicitly prefetched. The WithPrefetch code is explicitly prefecthed, but not vectorized. The slowdown from missed vectorization is much more severe than the speedup from explicit prefetching.

    An interesting GCC bug report sheds some light on the issue: Bug 114061 - GCC fails vectorization when using __builtin_prefetch. At least for GCC, it seems like the compiler assumes that your __builtin_prefetch(&data[i + x]) clobbers memory and does a function call (it makes sense to avoid vectorization in such case) even though the call is to a builtin function that acts as a no-op.

    GCC 15 should have a fix in place to overcome this limitation and allow the builtin without disrupting vectorization. However, from what I can see on Godbolt.org, even though GCC 16 trunk does vectorize the loop, it completely ignores the prefetching, leaving it entirely out of the loop. So it still seems broken to me.


    So how should you "fix" this, if needed at all? Well, you unfortunately cannot rely on the compiler to vectorize correctly with your explicit prefetching. Moreover, even though one particular compiler may be smart enough to handle this (it seems none of the major ones are), it seems unlikely that all compilers will get it right for a multi-platform build. This means that, as you already noticed, you will have to apply vectorization manually and throw in a prefetch as needed.

    Yes, software prefetching can be beneficial, but it is in general a diffucult optimization task, with lots of empirical trial and error, mostly because hardware prefetching is nowadays already very good. See the answer on "Can I read a CPU x86 flag to determine if prefetched data has arrived in the L1 cache?". SW prefetch too early (large distance) and data will be evicted by the time you want to use it. SW prefetch too late (small distance) and it becomes a no-op as HW prefetch already happened.

    In the example code below, with prefetch_disance set to a small value like 8, performance decreases instead of increasing because you are prefetching data that was already in cache:

    PrefetchBenchmark/ManualSIMDWithPrefetch     151742 ns       151738 ns         4599
    PrefetchBenchmark/ManualSIMD                 145401 ns       145401 ns         4790
    

    This makes sense as one cache line already covers 64 / 4 = 16 contiguous int values (assuming 32-bit int) and hardware prefetching is likely already loading more than one cache line ahead. Modern [x86] CPUs are already pretty good at hardware prefetching for simple memory access patterns such as sequential access.

    Increasing prefetch_disance to a larger value such as 64 or 128 you may start to see an improvement (as I am on x86 Skylake):

    PrefetchBenchmark/ManualSIMDWithPrefetch     137408 ns       137407 ns         5022
    PrefetchBenchmark/ManualSIMD                 146430 ns       146429 ns         4803
    

    Example code:

    static inline long simd_sum8(const std::vector<int> &data, const int i) {
        __m128i vec1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&data[i]));
        __m128i vec2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&data[i + 4]));
        __m128i sum_vec = _mm_add_epi32(vec1, vec2);
    
        int temp[4];
        _mm_storeu_si128(reinterpret_cast<__m128i*>(temp), sum_vec);
        return temp[0] + temp[1] + temp[2] + temp[3];
    }
    
    BENCHMARK_F(PrefetchBenchmark, ManualSIMDWithPrefetch)(benchmark::State& state) {
        for (auto _ : state) {
            long sum = 0;
            int i = 0;
            int size = static_cast<int>(data.size());
    
            // TODO: benchmark and tune this!
            constexpr int prefetch_disance = 128;
    
            for (; i <= size - prefetch_disance; i += 8) {
                PREFETCH(&data[i + prefetch_disance], 3);
                sum += simd_sum8(data, i);
            }
    
            for (; i < size; ++i) {
                sum += data[i];
            }
    
            benchmark::DoNotOptimize(sum);
        }
    }
    
    BENCHMARK_F(PrefetchBenchmark, ManualSIMD)(benchmark::State& state) {
        for (auto _ : state) {
            long sum = 0;
            int i = 0;
            int size = static_cast<int>(data.size());
    
            for (; i < size; i += 8)
                sum += simd_sum8(data, i);
    
            benchmark::DoNotOptimize(sum);
        }
    }