Intel compiler generates the following prefetch instruction within a loop for accessing an array by a_ptr
pointer:
400e93: 62 d1 78 08 18 4c 24 vprefetch0 [r12+0x80]
If I manually change (by hex-editing the executable) this to non-temporal prefetching:
400e93: 62 d1 78 08 18 44 24 vprefetchnta [r12+0x80]
the loop runs almost 1.5 times faster (!!!). However, I would prefer the compiler to generate non-temporal prefetching for me. I thought that
#pragma prefetch a_ptr:_MM_HINT_NTA
before the loop should do the trick, but it actually does not; it generates the very same instructions as withnout the pragma. Why icpc
ignores this pragma? How may I force it to generate non-temporal prefetchning?
Opt. report does not say anything useful as far as I see:
LOOP BEGIN at test-mic.cpp(56,5)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed ANTI dependence between b_ptr line 64 and b_ptr line 65
remark #15346: vector dependence: assumed FLOW dependence between b_ptr line 65 and b_ptr line 64
remark #25018: Total number of lines prefetched=2
remark #25019: Number of spatial prefetches=2, dist=29
remark #25021: Number of initial-value prefetches=2
remark #25139: Using second-level distance 2 for prefetching spatial memory reference [ test-mic.cpp(61,50) ]
remark #25015: Estimate of max trip count of loop=1048576
LOOP END
This is a known issue - the BKM is to use explicit values 0,1,2,3 for hints (t0, t1, t2, nta) in the prefetch directives/pragmas (and NOT use the MM_HINT enum).
This is because the MM_HINT enum in the header files map differently:
/* constants to use with _mm_prefetch (extracted from *mmintrin.h) */
#define _MM_HINT_T0 1
#define _MM_HINT_T1 2
#define _MM_HINT_T2 3
#define _MM_HINT_NTA 0 <--maps here
#define _MM_HINT_ENTA 4
#define _MM_HINT_ET0 5
#define _MM_HINT_ET1 6
#define _MM_HINT_ET2 7
Plus the Intel headers and gcc headers use different enum values - that is also troublesome. So the hint --enums are to be used only for the _mm_prefetch intrinsics, NOT for the prefetch directives.
For this example, you should be able to use: #pragma prefetch a_ptr:3
However, that suggested syntax is not currently usable due to a defect where the compiler is currently unable to properly connect the a_ptr load memory-ref inside the loop with the expression in the prefetch directive; therefore, a temporary solution is to use the following syntax:
#pragma prefetch *:3
Note: The asterisk means the directive will apply for "ALL" memory refs inside the loop. In this loop, b_ptr cannot be prefetched by the compiler anyway - since it is not a linear address expression. So the "*" applies only to a_ptr anyway here - and leads to vprefetchnta (on both KNC and KNL).
The defect will be fixed in a future release.