Why Intel compiler ignores the non-temporal prefetch pragma directive for Intel MIC?

Intel compiler generates the following prefetch instruction within a loop for accessing an array by a_ptr pointer:

400e93:       62 d1 78 08 18 4c 24    vprefetch0 [r12+0x80]

If I manually change (by hex-editing the executable) this to non-temporal prefetching:

400e93:       62 d1 78 08 18 44 24    vprefetchnta [r12+0x80]

the loop runs almost 1.5 times faster (!!!). However, I would prefer the compiler to generate non-temporal prefetching for me. I thought that

#pragma prefetch a_ptr:_MM_HINT_NTA

before the loop should do the trick, but it actually does not; it generates the very same instructions as withnout the pragma. Why icpc ignores this pragma? How may I force it to generate non-temporal prefetchning?

Opt. report does not say anything useful as far as I see:

LOOP BEGIN at test-mic.cpp(56,5)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed ANTI dependence between b_ptr line 64 and b_ptr line 65
   remark #15346: vector dependence: assumed FLOW dependence between b_ptr line 65 and b_ptr line 64
   remark #25018: Total number of lines prefetched=2
   remark #25019: Number of spatial prefetches=2, dist=29
   remark #25021: Number of initial-value prefetches=2
   remark #25139: Using second-level distance 2 for prefetching spatial memory reference   [ test-mic.cpp(61,50) ]
   remark #25015: Estimate of max trip count of loop=1048576
LOOP END

Solution

This is a known issue - the BKM is to use explicit values 0,1,2,3 for hints (t0, t1, t2, nta) in the prefetch directives/pragmas (and NOT use the MM_HINT enum).

This is because the MM_HINT enum in the header files map differently:

/* constants to use with _mm_prefetch  (extracted from *mmintrin.h) */
#define _MM_HINT_T0 1
#define _MM_HINT_T1 2
#define _MM_HINT_T2 3
#define _MM_HINT_NTA    0    <--maps here
#define _MM_HINT_ENTA   4
#define _MM_HINT_ET0    5
#define _MM_HINT_ET1    6
#define _MM_HINT_ET2    7

Plus the Intel headers and gcc headers use different enum values - that is also troublesome. So the hint --enums are to be used only for the _mm_prefetch intrinsics, NOT for the prefetch directives.

For this example, you should be able to use: #pragma prefetch a_ptr:3

However, that suggested syntax is not currently usable due to a defect where the compiler is currently unable to properly connect the a_ptr load memory-ref inside the loop with the expression in the prefetch directive; therefore, a temporary solution is to use the following syntax:

#pragma prefetch *:3

Note: The asterisk means the directive will apply for "ALL" memory refs inside the loop. In this loop, b_ptr cannot be prefetched by the compiler anyway - since it is not a linear address expression. So the "*" applies only to a_ptr anyway here - and leads to vprefetchnta (on both KNC and KNL).

The defect will be fixed in a future release.