c++intel compiler-optimization knights-landing

adding "-march=native" intel compiler flag to the compilation line leads to a floating point exception on KNL

I have a code, which i launch on Intel Xeon Phi Knights Landing (KNL) 7210 (64 cores) processor (it is a PC, in native mode) and use the Intel c++ compiler (icpc) version 17.0.4. Also i launch the same code on Intel core i7 processor, where the version of icpc is 17.0.1. To be more correct, i compile the code on the machine i'm launching it (compiled on i7 and launched on i7, the same for KNL). I never make the binary file on one machine and bring it to another. The loops are parallelized and vectorized using OpenMP. For best performance i use the intel compiler flags:

-DCMAKE_CXX_COMPILER="-march=native -mtune=native -ipo16 -fp-model fast=2 -O3 -qopt-report=5 -mcmodel=large"

On i7 everything works well. But on KNL the code works withous -march=native and if to add this option the program throws floating point exception immediately. If to compile with the only flag "-march=native" the situation is the same. If to use gdb, it points at the line pp+=alpha/rd of the piece of code:

...

the code above is run in 1 thread

double K1=0.0, P=0.0;

#pragma omp parallel for reduction(+:P_x,P_y,P_z, K1,P)
for(int i=0; i<N; ++i)
{
  P_x+=p[i].vx*p[i].m;
  P_y+=p[i].vy*p[i].m;
  P_z+=p[i].vz*p[i].m;
  K1+=p[i].vx*p[i].vx+p[i].vy*p[i].vy+p[i].vz*p[i].vz;
  float pp=0.0;
#pragma simd reduction(+:pp)
  for(int j=0; j<N; ++j) if(i!=j)
  {
    float rd=sqrt((p[i].x-p[j].x)*(p[i].x-p[j].x)+(p[i].y-p[j].y)*(p[i].y-p[j].y)+(p[i].z-p[j].z)*(p[i].z-p[j].z));
    pp+=alpha/rd;
  }
  P+=pp;
}
...

Particle p[N]; - an array of particles, Particle is a structure of floats. N - maximal number of particles.

If to remove the flag -march=native or replace it with -march=knl or with -march=core-avx2, everything woks OK. This flag is doing something bad to the program, but what - I don't know.

I found in the Internet (https://software.intel.com/en-us/articles/porting-applications-from-knights-corner-to-knights-landing, https://math-linux.com/linux/tip-of-the-day/article/intel-compilation-for-mic-architecture-knl-knights-landing) that one should use the flags: -xMIC-AVX512. I tried to use this flag and -axMIC-AVX512, but they give the same error.

So, what i wanted to ask is:

Why -march=native, -xMIC-AVX512 do not work and -march=knl works; is -xMIC-AVX512 included in -march=native flag for KNL?
May I replace the flag -march=native with -march=knl when I launch the code on KNL (on i7 everything works), are they equivalent?
Is the set of flags written optimal for the best performance if using Intel compiler?

As, Peter Cordes told, i placed here the assembeler output when the program throws Floating Point Exception in GDB: 1) the output of (gdb) disas:

Program received signal SIGFPE, Arithmetic exception.
0x000000000040e3cc in randomizeBodies() ()
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5- 
16.el7.x86_64 libstdc++-4.8.5-16.el7.x86_64
(gdb) disas
Dump of assembler code for function _Z15randomizeBodiesv:
0x000000000040da70 <+0>:    push   %rbp
0x000000000040da71 <+1>:    mov    %rsp,%rbp
0x000000000040da74 <+4>:    and    $0xffffffffffffffc0,%rsp
0x000000000040da78 <+8>:    sub    $0x100,%rsp
0x000000000040da7f <+15>:   vpxor  %xmm0,%xmm0,%xmm0
0x000000000040da83 <+19>:   vmovups %xmm0,(%rsp)
0x000000000040da88 <+24>:   vxorpd %xmm5,%xmm5,%xmm5
0x000000000040da8c <+28>:   vmovq  %xmm0,0x10(%rsp)
0x000000000040da92 <+34>:   mov    $0x77359400,%ecx
0x000000000040da97 <+39>:   xor    %eax,%eax
0x000000000040da99 <+41>:   movabs $0x5deece66d,%rdx
0x000000000040daa3 <+51>:   mov    %ecx,%ecx
0x000000000040daa5 <+53>:   imul   %rdx,%rcx
0x000000000040daa9 <+57>:   add    $0xb,%rcx
0x000000000040daad <+61>:   mov    %ecx,0x9a3b00(,%rax,8)
0x000000000040dab4 <+68>:   mov    %ecx,%esi
0x000000000040dab6 <+70>:   imul   %rdx,%rsi
0x000000000040daba <+74>:   add    $0xb,%rsi
0x000000000040dabe <+78>:   mov    %esi,0x9e3d00(,%rax,8)
0x000000000040dac5 <+85>:   mov    %esi,%edi
0x000000000040dac7 <+87>:   imul   %rdx,%rdi
0x000000000040dacb <+91>:   add    $0xb,%rdi
0x000000000040dacf <+95>:   mov    %edi,0xa23f00(,%rax,8)
0x000000000040dad6 <+102>:  mov    %edi,%r8d
0x000000000040dad9 <+105>:  imul   %rdx,%r8
0x000000000040dadd <+109>:  add    $0xb,%r8
0x000000000040dae1 <+113>:  mov    %r8d,0xa64100(,%rax,8)
0x000000000040dae9 <+121>:  mov    %r8d,%r9d
0x000000000040daec <+124>:  imul   %rdx,%r9
0x000000000040daf0 <+128>:  add    $0xb,%r9
0x000000000040daf4 <+132>:  mov    %r9d,0xaa4300(,%rax,8)
0x000000000040dafc <+140>:  mov    %r9d,%r10d
0x000000000040daff <+143>:  imul   %rdx,%r10
0x000000000040db03 <+147>:  add    $0xb,%r10
0x000000000040db07 <+151>:  mov    %r10d,0x9a3b04(,%rax,8)
0x000000000040db0f <+159>:  mov    %r10d,%r11d
0x000000000040db12 <+162>:  imul   %rdx,%r11
0x000000000040db16 <+166>:  add    $0xb,%r11
0x000000000040db1a <+170>:  mov    %r11d,0x9e3d04(,%rax,8)
0x000000000040db22 <+178>:  mov    %r11d,%ecx
0x000000000040db25 <+181>:  imul   %rdx,%rcx
0x000000000040db29 <+185>:  add    $0xb,%rcx
0x000000000040db2d <+189>:  mov    %ecx,0xa23f04(,%rax,8)

2) the output of p $mxcsr:

(gdb) p $mxcsr
1 = [ ZE PE DAZ DM PM FZ ]

3) the output of p $ymm0.v8_float:

$2 = {3, 3, 3, 3, 3, 3, 3, 3}

4) the output of p $zmm0.v16_float:

gdb) p $zmm0.v16_float
$3 = {3 <repeats 16 times>}.

I shoud also mention that to detect floating point exceptions i used the standard

void handler(int sig)
{
  printf("Floating Point Exception\n");
  exit(0);
}
...
int main(int argc, char **argv)
{
  feenableexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW | FE_UNDERFLOW);
  signal(SIGFPE, handler);
  ...
}

I should stress that i have already been using feenableexcept when i got this error. I used it since the begin of program debugging because we had the errors (Floating Point Exceptions) in code and had to correct them.

Solution

You were using feenableexcept to unmask some FP exceptions, so optimizations that create invalid temporary results will crash your program.

Intel's compiler with -fp-model fast=2, like gcc -ffast-math, assumes that FP exceptions are masked so it can cause FE_INVALID in some SIMD elements in some temporary calculations, as long as everything works out in the end (e.g. blend to fix up elements where recip-sqrt went wrong). I'd assume that's what's going on here.

If you post disassembly of the actual instruction that faulted (instead of a bunch of integer multiplies at the very start of that function), we can figure out exactly what optimization caused what invalid temporary, but in general you need to use less aggressive FP options when compiling builds that turn on FP exceptions.

According to Intel's documentation:

-fp-model fast[=1|2] or /fp:fast[=1|2]

Floating-point exception semantics are disabled by default and they cannot be enabled because you cannot specify fast and except together in the same compilation. To enable exception semantics, you must explicitly specify another keyword (see other keyword descriptions for details).

You need to use -fp-model except if you want the compiler to respect the fact that FP exceptions are a visible side-effect. This is not on by default.

If you're going to call functions that modify the FP environment, ISO C says you should use #pragma STDC FENV_ACCESS ON, and that without that, modifications to the FP environment aren't "meaningful". "Otherwise the implementation is free to assume that floating-point control modes are always the default ones and that floating-point status flags are never tested or modified." I'm not sure if enabling exceptions really counts. Probably not important, as long as you're doing it once at program startup, otherwise it would matter whether or not a computation happens before or after enabling exceptions.

Similarly for gcc, -ffast-math includes -fno-trapping-math, which promises the compiler that FP instructions won't raise SIGFPE, just silently set sticky status bits in the MXCSR and produce NaN (invalid), +-Infinity (overflow), or 0.0 (underflow).