[SOLVED] How to properly use prefetch instructions?

The short answer that probably works for perfectly linear streaming loops like yours is probably: don't use them at all, let the hardware prefetchers do the work.

Still, it's possible that you can speed things up with software prefetching, and here is the theory and some detail if you want to try...

Basically you call _mm_prefetch() on an address you'll need at some point in the future. It is similar in some respects to loading a value from memory and doing nothing with it: both bring the line into the L1 cache⁰, but the prefetch intrinsic, which under the covers is emitting specific prefetch instructions, has some advantages which make it suitable for prefetching.

It works at cache-line granularity¹: you only need to issue one prefetch for each cache line: more is just a waste. That means that in general, you should try to unroll your loop enough so that you can issue only one prefetch per cache line. In the case of 16-byte __m128 values, that means unroll at least by 4 (which you've done, so you are good there).

Then simple prefetch each of your access streams by some PF_DIST distance ahead of the current calculation, something like:

for(size_t i=0; i<1048576;i+=4) {
    dot0 = _mm_add_ps( dot0, _mm_mul_ps( A[i+0], B[i+0]);
    dot1 = _mm_add_ps( dot1, _mm_mul_ps( A[i+1], B[i+1]);
    dot2 = _mm_add_ps( dot2, _mm_mul_ps( A[i+2], B[i+2]);
    dot3 = _mm_add_ps( dot3, _mm_mul_ps( A[i+3], B[i+3]);
    _mm_prefetch(A + i + PF_A_DIST, HINT_A);
    _mm_prefetch(B + i + PF_B_DIST, HINT_B);
}

Here PF_[A|B]_DIST is the distance to prefetch ahead of the current iteration and HINT_ is the temporal hint to use. Rather than try to calculate the right distance value from first principles, I would simply determine good values of PF_[A|B]_DIST experimentally⁴. To reduce the search space, you can start by setting them both equal, since logically a similar distance is likely to be ideal. You might find that only prefetching one of the two streams is ideal.

It is very important that the ideal PF_DIST depends on the hardware configuration. Not just on the CPU model, but also on the memory configuration, including details such as the snooping mode for multi-socket systems. For example, the best value could be wildly different on client and server chips of the same CPU family. So you should run your tuning experiment on the actual hardware you are a targeting, as much as possible. If you target a variety of hardware, you can test on all the hardware and hopefully find a value that's good on all of them, or even consider compile-time or runtime dispatching depending on CPU type (not always enough, as above) or based on a runtime test. Now just relying on hardware prefetching is starting to sound a lot better, isn't it?

You can use the same approach to find the best HINT since the search space is small (only 4 values to try) - but here you should be aware than the difference between the different hints (particularly _MM_HINT_NTA) might only show as a performance difference in code that runs after this loop, since they affect how much data unrelated to this kernel remain in the cache.

You might also find that this prefetching doesn't help at all, since your access patterns are perfectly linear and likely to be handled well by the L2 stream prefetchers. Still there are some additional, more hardcode things you could try or consider:

You might investigate whether prefetching only at the start of 4K page boundaries helps³. This will complicate your loop structure: you'll probably need a nested loop to separate the "near edge of page" and "deep inside the page" cases in order to only issue the prefetches near page boundaries. You'll also want to make your input arrays page-aligned too, or else it gets even more complicated.
You can try disabling some/all of the hardware prefetchers. This is usually terrible for overall performance, but on a highly tuned load with software prefetching, you might see better performance by eliminating interference from hardware prefetching. Selecting disabling prefetching also gives you an important a key tool to help understand what's going on, even if you ultimately leave all the prefetchers enabled.
Make sure you are using huge pages, since for large contiguous blocks like this they are idea.
There are problems with prefetching at the beginning and end of your main calculation loop: at the start, you'll miss prefetching all data at the start of each array (within the initial PF_DIST window), and at the end of the loop you'll prefetch additional and PF_DIST beyond the end of your array. At best these waste fetch and instruction bandwidth, but they may also cause (ultimately discarded) page faults which may affect performance. You can fix both by special intro and outro loops to handle these cases.

I also highly recommend the 5-part blog post Optimizing AMD Opteron Memory Bandwidth, which describes optimizing a problem very similar to yours, and which covers prefetching in some detail (it gave a large boost). Now this is totally different hardware (AMD Opteron) which likely behaves differently to more recent hardware (and especially to Intel hardware if that's what you're using) - but the process of improvement is key and the author is an expert in the field.

⁰ In the case of software prefetch, you can also select some other level of cache, using the temporal hint.

¹ It may actually work at something like 2-cache-line granularity depending on how it interacts with the adjacent cache line prefetcher(s). In this case, you may be able to get away with issuing half the number of prefetches: one every 128 bytes.

³ There is some indication that even with perfect streaming loads, and despite the presence of "next page prefetchers" in modern Intel hardware, page boundaries are still a barrier to hardware prefetching that can be partially alleviated by software prefetching. Maybe because software prefetch serves as a stronger hint that "Yes, I'm going to read into this page", or because software prefetch works at the virtual address level and necessarily involves the translation machinery, while L2 prefetching works at the physical level.

⁴ Note that the "units" of the PF_DIST value is sizeof(__mm128), i.e., 16 bytes due to the way I calculated the address.