performancex86armprefetch

When should we use prefetch?


Some CPU and compilers supply prefetch instructions. Eg: __builtin_prefetch in GCC Document. Although there is a comment in GCC's document, but it's too short to me.

I want to know, in practice, when should we use prefetch? Are there some examples?


Solution

  • This question isn't really about compilers as they're just providing some hook to insert prefetch instructions into your assembly code / binary. Different compilers may provide different intrinsic formats but you can just ignore all these and (carefully) add it directly in assembly code.

    Now the real question seems to be "when are prefetches useful", and the answer is - in any scenario where you're bounded on memory latency, and the access pattern isn't regular and distinguishable for the HW prefetch to capture (organized in a stream or strides), or when you suspect there are too many different streams for the HW to track simultaneously.
    Most compilers would only very seldom insert their own prefetches for you, so it's basically up to you to play with your code and benchmark how prefetches could be useful.

    The link by @Mysticial shows a nice example, but here's a more straight forward one that I think can't be caught by HW:

    #include "stdio.h"
    #include "sys/timeb.h"
    #include "emmintrin.h"
    
    #define N 4096
    #define REP 200
    #define ELEM int
    
    int main() {
        int i,j, k, b;
        const int blksize = 64 / sizeof(ELEM);
        ELEM __attribute ((aligned(4096))) a[N][N];
        for (i = 0; i < N; ++i) {
            for (j = 0; j < N; ++j) {
                a[i][j] = 1;
            }
        }
        unsigned long long int sum = 0;
        struct timeb start, end;
        unsigned long long delta;
    
        ftime(&start);
        for (k = 0; k < REP; ++k) {
            for (i = 0; i < N; ++i) {
                for (j = 0; j < N; j ++) {
                    sum += a[i][j];
                }
            }
        }
        ftime(&end);
        delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
        printf ("Prefetching off: N=%d, sum=%lld, time=%lld\n", N, sum, delta); 
    
        ftime(&start);
        sum = 0;
        for (k = 0; k < REP; ++k) {
            for (i = 0; i < N; ++i) {
                for (j = 0; j < N; j += blksize) {
                    for (b = 0; b < blksize; ++b) {
                        sum += a[i][j+b];
                    }
                    _mm_prefetch(&a[i+1][j], _MM_HINT_T2);
                }
            }
        }
        ftime(&end);
        delta = (end.time * 1000 + end.millitm) - (start.time * 1000 + start.millitm);
        printf ("Prefetching on:  N=%d, sum=%lld, time=%lld\n", N, sum, delta); 
    }
    

    What I do here is traverse each matrix line (enjoying the HW prefetcher help with the consecutive lines), but prefetch ahead the element with the same column index from the next line that resides in a different page (which the HW prefetch should be hard pressed to catch). I sum the data just so that it's not optimized away, the important thing is that I basically just loop over a matrix, should have been pretty straightforward and simple to detect, and yet still get a speedup.

    Built with gcc 4.8.1 -O3, it gives me an almost 20% boost on an Intel Xeon X5670:

    Prefetching off: N=4096, sum=3355443200, time=1839
    Prefetching on:  N=4096, sum=3355443200, time=1502
    

    Note that the speedup is received even though I made the control flow more complicated (extra loop nesting level), the branch predictor should easily catch the pattern of that short block-size loop, and it saves execution of unneeded prefetches.

    Note that Ivybridge and onward on should have a "next-page prefetcher", so the HW may be able to mitigate that on these CPUs (if anyone has one available and cares to try I'll be happy to know). In that case I'd modify the benchmark to sum every second line (and the prefetch would look ahead two lines everytime), that should confuse the hell out of the HW prefetchers.

    Skylake results

    Here are some results from a Skylake i7-6700-HQ, running at 2.6 GHz (no turbo) with gcc:

    Compile flags: -O3 -march=native

    Prefetching off: N=4096, sum=28147495993344000, time=896
    Prefetching on:  N=4096, sum=28147495993344000, time=1222
    Prefetching off: N=4096, sum=28147495993344000, time=886
    Prefetching on:  N=4096, sum=28147495993344000, time=1291
    Prefetching off: N=4096, sum=28147495993344000, time=890
    Prefetching on:  N=4096, sum=28147495993344000, time=1234
    Prefetching off: N=4096, sum=28147495993344000, time=848
    Prefetching on:  N=4096, sum=28147495993344000, time=1220
    Prefetching off: N=4096, sum=28147495993344000, time=852
    Prefetching on:  N=4096, sum=28147495993344000, time=1253
    

    Compile flags: -O2 -march=native

    Prefetching off: N=4096, sum=28147495993344000, time=1955
    Prefetching on:  N=4096, sum=28147495993344000, time=1813
    Prefetching off: N=4096, sum=28147495993344000, time=1956
    Prefetching on:  N=4096, sum=28147495993344000, time=1814
    Prefetching off: N=4096, sum=28147495993344000, time=1955
    Prefetching on:  N=4096, sum=28147495993344000, time=1811
    Prefetching off: N=4096, sum=28147495993344000, time=1961
    Prefetching on:  N=4096, sum=28147495993344000, time=1811
    Prefetching off: N=4096, sum=28147495993344000, time=1965
    Prefetching on:  N=4096, sum=28147495993344000, time=1814
    

    So using prefetch is either about 40% slower, or 8% faster depending on if you use -O3 or -O2 respectively for this particular example. The big slowdown for -O3 is actually due to a code generation quirk: at -O3 the loop without prefetch is vectorized, but the extra complexity of the prefetch variant loop prevents vectorization on my version of gcc anyway.

    So the -O2 results are probably more apples-to-apples, and the benefit is about half (8% speedup vs 16%) of what we saw on Leeor's Westmere. Still it's worth noting that you have to be careful not to change code generation such that you get a big slowdown.

    This test probably isn't ideal in that by going int by int implies a lot of CPU overhead rather than stressing the memory subsystem (that's why vectorization helped so much).