c++x86cpu-architectureprefetch

Why is my benchmark using __mm_prefetch slower?


I am trying to learn some C++ optimizations and I have tried using __mm_prefetch for summing an array. The benchmark tests for my code is:

#include <benchmark/benchmark.h>
#include <vector>


#if defined(__GNUC__) || defined(__clang__)
    #define PREFETCH(addr, hint) __builtin_prefetch(addr, 0, hint)
#elif defined(_MSC_VER)
    #include <xmmintrin.h>
    #define PREFETCH(addr, hint) _mm_prefetch(reinterpret_cast<const char*>(addr), hint)
#else
    #define PREFETCH(addr, hint)
#endif


class PrefetchBenchmark : public benchmark::Fixture {
public:
    static constexpr size_t data_size = 1 << 20;

    void SetUp(const benchmark::State& state) override {
        data.resize(data_size, 1);
    }

    void TearDown(const benchmark::State& state) override {

    }

    std::vector<int> data;
};


BENCHMARK_F(PrefetchBenchmark, NoPrefetch)(benchmark::State& state) {
    for (auto _ : state) {
        long sum = 0;
        for (const auto& i : data) {
            sum += i;
        }
        benchmark::DoNotOptimize(sum);
    }
}


BENCHMARK_F(PrefetchBenchmark, WithPrefetch)(benchmark::State& state) {
    int prefetch_distance = 10;
    for (auto _ : state) {
        long sum = 0;
        for (int i = 0; i < data.size(); i++) {
            if (i + prefetch_distance < data.size()) {
                PREFETCH(&data[i + prefetch_distance], 3);
            }
            sum += data[i];
        }
        benchmark::DoNotOptimize(sum);
    }
}

However the benchmark runs consistantly slow with the prefetch

PrefetchBenchmark/NoPrefetch       348484 ns       344905 ns         1948
PrefetchBenchmark/WithPrefetch     595119 ns       585938 ns         1120

Why is this and how could I make a test which gets a performance increase from using __mm_prefetch?

My git repo for my benchmarks for a full example is here.


Solution

  • First, your code is introducing a needless branch, which very likely slows things down and can be avoided:

    /* Original */
    for (int i = 0; i < data.size(); i++) {
        if (i + prefetch_distance < data.size()) {
            PREFETCH(&data[i + prefetch_distance], 3);
        }
        sum += data[i];
    }
    
    /* Updated code */
    std::size_t i;
    for (i = 0; i + prefetch_distance < data.size(); i++) {
        PREFETCH(&data[i + prefetch_distance], 3);
        sum += data[i];
    }
    for ( ; i < data.size(); i++)
        sum += data[i];
    

    Now, looking at the code with no branching: the root cause of the problem seems to be the inability of the compiler to properly vectorize the loop with SIMD instructions when __builtin_prefetch() is used in its body. The NoPrefetch code is vectorized, but not explicitly prefetched. The WithPrefetch code is explicitly prefecthed, but not vectorized. The slowdown from missed vectorization is much more severe than the speedup from explicit prefetching.

    One interesting GCC bug report sheds some light on the issue: Bug 114061 - GCC fails vectorization when using __builtin_prefetch. At least for GCC, it seems like the compiler assumes that your __builtin_prefetch(&data[i + x]) clobbers memory and does a function call (it makes sense to avoid vectorization in such case) even though the call is to a builtin function that acts as a no-op.

    GCC 15 should have a fix in place to overcome this limitation and allow the builtin without disrupting vectorization. However, from what I can see on Godbolt.org, even though GCC 16 trunk does vectorize the loop, it completely ignores the prefetching, leaving it entirely out of the loop. So it still seems broken to me.


    So how should you "fix" this?

    One very important observation, as pointed out by Peter Cordes, is that you are prefetching more than needed: you only need to prefetch at most once every 64 bytes (cache line size), but you are doing so every single iteration of the loop.

    Software prefetching can be beneficial, but it is in general a diffucult optimization task, with lots of empirical trial and error, mostly because hardware prefetching is nowadays already very good. See the answer on "Can I read a CPU x86 flag to determine if prefetched data has arrived in the L1 cache?". SW prefetch too early (large distance) and data will be evicted by the time you want to use it. SW prefetch too late (small distance) and it becomes a no-op wasting CPU cycles as HW prefetch already happened.

    Since it seems like you cannot rely on the compiler to vectorize correctly with your explicit SW prefetching in the way, you have two main options:

    1. Manually unroll with a factor of 16 (16x4 = 64) using SIMD;
    2. Perform two nested loops: one outer loop with a stride of 16 and one inner loop over the 16 values. SW prefetch can be applied in the outer loop, and the compiler should be able to unroll and vectorize the inner loop.

    Correctly taking into account cache-line size and unrolling 16 iterations of the loop at once, we can see how things go and tune the right prefetch distance.

    The scenarios to test (both with and without manual prefetch) are:

    1. Normal loop, no manual SIMD vectorization (let the compiler do its magic);
    2. Manual SSE2 vectorization (sum 8 elements two times);
    3. Manual AVX2 vectorization (sum 16 elements at once).

    Here's some code to benchmark, testing with a range of prefetch distances that are powers of two:

    // g++ -march=skylake -O3 -DNDEBUG test.cc -o test -lbenchmark
    
    #include <vector>
    #include <random>
    
    #include <immintrin.h>
    #include <benchmark/benchmark.h>
    
    #if defined(__GNUC__) || defined(__clang__)
        #define PREFETCH(addr, hint) __builtin_prefetch(addr, 0, hint)
    #elif defined(_MSC_VER)
        #include <xmmintrin.h>  // For _mm_prefetch in MSVC
        #define PREFETCH(addr, hint) _mm_prefetch(reinterpret_cast<const char*>(addr), hint)
    #else
        #define PREFETCH(addr, hint)  // No prefetch support for this compiler
    #endif
    
    static inline long simd_sum8(const std::vector<int> &data, const size_t i) {
        __m128i vec1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&data[i]));
        __m128i vec2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(&data[i + 4]));
    
        __m128i sum_vec = _mm_add_epi32(vec1, vec2);
        sum_vec = _mm_hadd_epi32(sum_vec, sum_vec);
        sum_vec = _mm_hadd_epi32(sum_vec, sum_vec);
        return _mm_cvtsi128_si32(sum_vec);
    }
    
    static inline int simd_sum16(const std::vector<int> &data, const size_t i) {
        __m256i vec1 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&data[i]));
        __m256i vec2 = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&data[i + 8]));
    
        __m256i sum_vec = _mm256_add_epi32(vec1, vec2);
        __m128i sum_lo = _mm256_castsi256_si128(sum_vec);
        __m128i sum_hi = _mm256_extracti128_si256(sum_vec, 1);
    
        __m128i sum128 = _mm_add_epi32(sum_lo, sum_hi);
        sum128 = _mm_hadd_epi32(sum128, sum128);
        sum128 = _mm_hadd_epi32(sum128, sum128);
        return _mm_cvtsi128_si32(sum128);
    }
    
    class PrefetchBenchmark : public benchmark::Fixture {
    public:
        static constexpr size_t data_size = 1 << 20;
    
        void SetUp(const benchmark::State& state) override {
            // https://stackoverflow.com/q/21516575/3889449
            std::random_device rnd_device;
            std::mt19937 mersenne_engine {rnd_device()};
            std::uniform_int_distribution<int> dist(
                std::numeric_limits<int>::min(),
                std::numeric_limits<int>::max()
            );
            auto gen = [&](){ return dist(mersenne_engine); };
    
            // Resize the vector and fill with random values
            data.resize(data_size, 1);
            std::generate(data.begin(), data.end(), gen);
        }
    
        std::vector<int> data;
    };
    
    BENCHMARK_DEFINE_F(PrefetchBenchmark, Normal)(benchmark::State& state) {
        for (auto _ : state) {
            const size_t size = data.size();
            long sum = 0;
    
            for (size_t i = 0; i < size; i++)
                sum += data[i];
    
            benchmark::DoNotOptimize(sum);
        }
    }
    
    BENCHMARK_DEFINE_F(PrefetchBenchmark, NormalWithPrefetch)(benchmark::State& state) {
        const size_t prefetch_distance = state.range(0);
    
        for (auto _ : state) {
            const size_t size = data.size();
            size_t i = 0;
            long sum = 0;
    
            for (; i + std::max(prefetch_distance, std::size_t(16)) < size; i += 16) {
                PREFETCH(&data[i + prefetch_distance], 3);
    
                for (size_t j = i; j < i + 16; j++)
                    sum += data[i];
            }
    
            for (; i < size; ++i)
                sum += data[i];
    
            benchmark::DoNotOptimize(sum);
        }
    }
    
    BENCHMARK_DEFINE_F(PrefetchBenchmark, ManualSSE2)(benchmark::State& state) {
        for (auto _ : state) {
            const size_t size = data.size();
            size_t i = 0;
            long sum = 0;
    
            for (; i < size; i += 16) {
                sum += simd_sum8(data, i);
                sum += simd_sum8(data, i + 8);
            }
    
            for (; i < size; ++i)
                sum += data[i];
    
            benchmark::DoNotOptimize(sum);
        }
    }
    
    BENCHMARK_DEFINE_F(PrefetchBenchmark, ManualSSE2WithPrefetch)(benchmark::State& state) {
        const int prefetch_distance = state.range(0);
    
        for (auto _ : state) {
            const size_t size = data.size();
            size_t i = 0;
            long sum = 0;
    
            for (; i + prefetch_distance < size; i += 16) {
                PREFETCH(&data[i + prefetch_distance], 3);
                sum += simd_sum8(data, i);
                sum += simd_sum8(data, i + 8);
            }
    
            for (; i < size; ++i)
                sum += data[i];
    
            benchmark::DoNotOptimize(sum);
        }
    }
    
    BENCHMARK_DEFINE_F(PrefetchBenchmark, ManualAVX2)(benchmark::State& state) {
        for (auto _ : state) {
            const size_t size = data.size();
            size_t i = 0;
            long sum = 0;
    
            for (; i < size; i += 16)
                sum += simd_sum16(data, i);
    
            for (; i < size; ++i)
                sum += data[i];
    
            benchmark::DoNotOptimize(sum);
        }
    }
    
    BENCHMARK_DEFINE_F(PrefetchBenchmark, ManualAVX2WithPrefetch)(benchmark::State& state) {
        const int prefetch_distance = state.range(0);
    
        for (auto _ : state) {
            size_t size = data.size();
            size_t i = 0;
            long sum = 0;
    
            for (; i + prefetch_distance < size; i += 16) {
                PREFETCH(&data[i + prefetch_distance], 3);
                sum += simd_sum16(data, i);
            }
    
            for (; i < size; ++i)
                sum += data[i];
    
            benchmark::DoNotOptimize(sum);
        }
    }
    
    BENCHMARK_REGISTER_F(PrefetchBenchmark, Normal);
    BENCHMARK_REGISTER_F(PrefetchBenchmark, NormalWithPrefetch)->RangeMultiplier(2)->Range(16, 1024);
    BENCHMARK_REGISTER_F(PrefetchBenchmark, ManualSSE2);
    BENCHMARK_REGISTER_F(PrefetchBenchmark, ManualSSE2WithPrefetch)->RangeMultiplier(2)->Range(16, 1024);
    BENCHMARK_REGISTER_F(PrefetchBenchmark, ManualAVX2);
    BENCHMARK_REGISTER_F(PrefetchBenchmark, ManualAVX2WithPrefetch)->RangeMultiplier(2)->Range(16, 1024);
    
    BENCHMARK_MAIN();
    

    The corresponding output on my machine (i9-10900) is:

    Benchmark                                              Time             CPU   Iterations
    ----------------------------------------------------------------------------------------
    PrefetchBenchmark/Normal                           98994 ns        98991 ns         6957
    PrefetchBenchmark/NormalWithPrefetch/16            56664 ns        56663 ns        12007
    PrefetchBenchmark/NormalWithPrefetch/32            56356 ns        56354 ns        12606
    PrefetchBenchmark/NormalWithPrefetch/64            55560 ns        55555 ns        12184
    PrefetchBenchmark/NormalWithPrefetch/128           54374 ns        54372 ns        12901
    PrefetchBenchmark/NormalWithPrefetch/256           54371 ns        54369 ns        12562
    PrefetchBenchmark/NormalWithPrefetch/512           54084 ns        54082 ns        13334
    PrefetchBenchmark/NormalWithPrefetch/1024          54257 ns        54253 ns        13092
    PrefetchBenchmark/ManualSSE2                      152568 ns       152561 ns         4475
    PrefetchBenchmark/ManualSSE2WithPrefetch/16       158408 ns       158404 ns         4533
    PrefetchBenchmark/ManualSSE2WithPrefetch/32       146926 ns       146921 ns         4815
    PrefetchBenchmark/ManualSSE2WithPrefetch/64       142630 ns       142627 ns         4978
    PrefetchBenchmark/ManualSSE2WithPrefetch/128      132613 ns       132608 ns         5543
    PrefetchBenchmark/ManualSSE2WithPrefetch/256      133385 ns       133380 ns         5547
    PrefetchBenchmark/ManualSSE2WithPrefetch/512      126122 ns       126115 ns         5444
    PrefetchBenchmark/ManualSSE2WithPrefetch/1024     128566 ns       128562 ns         5480
    PrefetchBenchmark/ManualAVX2                      129828 ns       129823 ns         5414
    PrefetchBenchmark/ManualAVX2WithPrefetch/16       128373 ns       128370 ns         5429
    PrefetchBenchmark/ManualAVX2WithPrefetch/32       110661 ns       110656 ns         6355
    PrefetchBenchmark/ManualAVX2WithPrefetch/64       101423 ns       101418 ns         6957
    PrefetchBenchmark/ManualAVX2WithPrefetch/128       96401 ns        96397 ns         7640
    PrefetchBenchmark/ManualAVX2WithPrefetch/256       94296 ns        94292 ns         7306
    PrefetchBenchmark/ManualAVX2WithPrefetch/512       93809 ns        93807 ns         7375
    PrefetchBenchmark/ManualAVX2WithPrefetch/1024      94009 ns        94007 ns         7450
    

    So at the end of the day, it seems like the compiler outsmarted us and the nested loop with manual prefetch but no manual SIMD vectorization runs the fastest:

    const size_t size = data.size();
    size_t i = 0;
    long sum = 0;
    
    for (; i + std::max(prefetch_distance, std::size_t(16)) < size; i += 16) {
        PREFETCH(&data[i + prefetch_distance], 3);
    
        for (size_t j = i; j < i + 16; j++)
            sum += data[i];
    }
    
    for (; i < size; ++i)
        sum += data[i];
    

    In any case, as we can see from the timing reports, higher prefetch distances seem to perform better up a certain cutoff: a distance of 1024 seems to perform as good as (if not worse) than a distance of 512 for all cases. Similarly, a distance of 16 seems to perform as good as no manual SW prefetch for the SSE2 and AVX2 cases, indicating that we aren't doing better than HW prefetch there.