I am trying to vectorize a loop, computing dot product of a large float vectors. I am computing it in parallel, utilizing the fact that CPU has large amount of XMM registers, like this:
__m128* A, B;
__m128 dot0, dot1, dot2, dot3 = _mm_set_ps1(0);
for(size_t i=0; i<1048576;i+=4) {
dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
}
... // add dots, then shuffle/hadd result.
I heard that using prefetch instructions could help speedup things, as it could fetch further data "in background", while doing muls and adds on a data that is in cache. However i failed to find examples and explanations on how to use _mm_prefetch(), when, with what addresses, and what hits. Could you assist on this?
The short answer that probably works for perfectly linear streaming loops like yours is probably: don't use them at all, let the hardware prefetchers do the work.
Still, it's possible that you can speed things up with software prefetching, and here is the theory and some detail if you want to try...
Basically you call _mm_prefetch()
on an address you'll need at some point in the future. It is similar in some respects to loading a value from memory and doing nothing with it: both bring the line into the L1 cache0, but the prefetch intrinsic, which under the covers is emitting specific prefetch instructions, has some advantages which make it suitable for prefetching.
It works at cache-line granularity1: you only need to issue one prefetch for each cache line: more is just a waste. That means that in general, you should try to unroll your loop enough so that you can issue only one prefetch per cache line. In the case of 16-byte __m128
values, that means unroll at least by 4 (which you've done, so you are good there).
Then simple prefetch each of your access streams by some PF_DIST
distance ahead of the current calculation, something like:
for(size_t i=0; i<1048576;i+=4) {
dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
_mm_prefetch(A + i + PF_A_DIST, HINT_A);
_mm_prefetch(B + i + PF_B_DIST, HINT_B);
}
Here PF_[A|B]_DIST
is the distance to prefetch ahead of the current iteration and HINT_
is the temporal hint to use. Rather than try to calculate the right distance value from first principles, I would simply determine good values of PF_[A|B]_DIST
experimentally4. To reduce the search space, you can start by setting them both equal, since logically a similar distance is likely to be ideal. You might find that only prefetching one of the two streams is ideal.
It is very important that the ideal PF_DIST
depends on the hardware configuration. Not just on the CPU model, but also on the memory configuration, including details such as the snooping mode for multi-socket systems. For example, the best value could be wildly different on client and server chips of the same CPU family. So you should run your tuning experiment on the actual hardware you are a targeting, as much as possible. If you target a variety of hardware, you can test on all the hardware and hopefully find a value that's good on all of them, or even consider compile-time or runtime dispatching depending on CPU type (not always enough, as above) or based on a runtime test. Now just relying on hardware prefetching is starting to sound a lot better, isn't it?
You can use the same approach to find the best HINT
since the search space is small (only 4 values to try) - but here you should be aware than the difference between the different hints (particularly _MM_HINT_NTA
) might only show as a performance difference in code that runs after this loop, since they affect how much data unrelated to this kernel remain in the cache.
You might also find that this prefetching doesn't help at all, since your access patterns are perfectly linear and likely to be handled well by the L2 stream prefetchers. Still there are some additional, more hardcode things you could try or consider:
PF_DIST
window), and at the end of the loop you'll prefetch additional and PF_DIST
beyond the end of your array. At best these waste fetch and instruction bandwidth, but they may also cause (ultimately discarded) page faults which may affect performance. You can fix both by special intro and outro loops to handle these cases.I also highly recommend the 5-part blog post Optimizing AMD Opteron Memory Bandwidth, which describes optimizing a problem very similar to yours, and which covers prefetching in some detail (it gave a large boost). Now this is totally different hardware (AMD Opteron) which likely behaves differently to more recent hardware (and especially to Intel hardware if that's what you're using) - but the process of improvement is key and the author is an expert in the field.
0 In the case of software prefetch, you can also select some other level of cache, using the temporal hint.
1 It may actually work at something like 2-cache-line granularity depending on how it interacts with the adjacent cache line prefetcher(s). In this case, you may be able to get away with issuing half the number of prefetches: one every 128 bytes.
3 There is some indication that even with perfect streaming loads, and despite the presence of "next page prefetchers" in modern Intel hardware, page boundaries are still a barrier to hardware prefetching that can be partially alleviated by software prefetching. Maybe because software prefetch serves as a stronger hint that "Yes, I'm going to read into this page", or because software prefetch works at the virtual address level and necessarily involves the translation machinery, while L2 prefetching works at the physical level.
4 Note that the "units" of the PF_DIST
value is sizeof(__mm128)
, i.e., 16 bytes due to the way I calculated the address.