c++linkericcxeon-phiavx512

overriding function calls from SVML


The Xeon-Phi Knights Landing cores have a fast exp2 instruction vexp2pd (intrinsic _mm512_exp2a23_pd). The Intel C++ compiler can vectorize the exp function using the Short Vector Math Library (SVML) which comes with the compiler. Specifically, it calls the fuction __svml_exp8.

However, when I step through a debugger I don't see that __svml_exp8 uses the vexp2pd instruction. It is a complication function with many FMA operations. I understand that vexp2pd is less accurate than exp but if I use -fp-model fast=1 (the default) or fp-model fast=2 I expect the compiler to use this instruction but it does not.

I have two questions.

  1. Is there a way to get the compiler to use vexp2pd?
  2. How do I safely override the call to __svml_exp8?

As to the second question this is what I have done so far.

//exp(x) = exp2(log2(e)*x)  
extern "C" __m512d __svml_exp8(__m512d x) {        
    return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}

Is this safe? Is there a better solution e.g. one that inlines the function? In the test code below this is about 3 times faster than if I don't override.

//https://godbolt.org/g/adI11c
//icpc -O3 -xMIC-AVX512 foo.cpp
#include <math.h>
#include <stdio.h>
#include <x86intrin.h>

extern "C" __m512d __svml_exp8(__m512d x) {
  //exp(x) = exp2(log2(e)*x)  
  return _mm512_exp2a23_pd(_mm512_mul_pd(_mm512_set1_pd(M_LOG2E), x));
}

void foo(double * __restrict x, double * __restrict y) {
  __assume_aligned(x, 64);
  __assume_aligned(y, 64);
  for(int i=0; i<1024; i++) y[i] = exp(x[i]);
}

int main(void) {
  double x[1024], y[1024];
  for(int i=0; i<1024; i++) x[i] = 1.0*i;
  for(int r=0; r<1000000; r++) foo(x,y);
  double sum=0;
  //for(int i=0; i<1024; i++) sum+=y[i];
  for(int i=0; i<8; i++) printf("%f ", y[i]); puts("");
  //printf("%lf",sum);
}

Solution

  • ICC will generate vexp2pd but only under very much relaxed math requirements as specified by targeted -fimf* switches.

    #include <math.h>
    
    void vfoo(int n, double * a, double * r)
    {
        int i;
        #pragma simd
        for ( i = 0; i < n; i++ )
        {
            r[i] = exp(a[i]);
        }
    }
    

    E.g. compile with -xMIC-AVX512 -fimf-domain-exclusion=1 -fimf-accuracy-bits=22

    ..B1.12:
            vmovups   (%rsi,%rax,8), %zmm0
            vmulpd    .L_2il0floatpacket.2(%rip){1to8}, %zmm0, %zmm1
            vexp2pd   %zmm1, %zmm2
            vmovupd   %zmm2, (%rcx,%rax,8)
            addq      $8, %rax
            cmpq      %r8, %rax
            jb        ..B1.12
    

    Please be sure to understand the accuracy implications as not only the end result is only about 22 bits accurate, but the vexp2pd also flushes to zero any denormalized results irrespective of the FTZ/DAZ bits set in the MXCSR.

    To the second question: "How do I safely override the call to __svml_exp8?" Your approach is generally not safe. SVML routines are internal to Intel Compiler and rely on custom calling conventions, so a generic routine with the same name can potentially clobber more registers than a library routine would, and you may end up in a hard-to-debug ABI mismatch.

    A better way of providing your own vector functions would be to utilize #pragma omp declare simd, e.g. see https://software.intel.com/en-us/node/524514 and possibly the vector_variant attribute if prefer coding with intrinsics, see https://software.intel.com/en-us/node/523350. Just don't try to override standard math names or you'll get an error.