tensorflow2.0illegal-instruction

Illegal instruction. 0x00007ffff3712210 in nsync::nsync_mu_init(nsync::nsync_mu_s_*) while loading libtensorflow_cc.so


A part of my application uses tensorflow to load the model. Application code is compiled with tensorflow2.3 using devtoolset-7. While trying to run my application binary it crashes while loading libtensorflow_cc.so with stack trace

Illegal instruction.
0x00007ffff3712210 in nsync::nsync_mu_init(nsync::nsync_mu_s_*)


12:56
Program received signal SIGILL, Illegal instruction.
0x00007ffff3712210 in nsync::nsync_mu_init(nsync::nsync_mu_s_*) ()
   from /lib64/libtensorflow_cc.so.2
Missing separate debuginfos, use: debuginfo-install controller-1.0.0-20201014_19_13_07.x86_64
(gdb) bt
#0  0x00007ffff3712210 in nsync::nsync_mu_init(nsync::nsync_mu_s_*) ()
   from /lib64/libtensorflow_cc.so.2
#1  0x00007fffea72df4e in tensorflow::monitoring::Gauge<bool, 0>::Gauge(tensorflow::monitoring::Met
ricDef<(tensorflow::monitoring::MetricKind)0, bool, 0> const&) ()
   from /lib64/libtensorflow_cc.so.2
#2  0x00007fffea72e1f4 in tensorflow::monitoring::Gauge<bool, 0>* tensorflow::monitoring::Gauge<boo
l, 0>::New<char const (&) [39], char const (&) [38]>(char const (&) [39], char const (&) [38]) ()
   from /lib64/libtensorflow_cc.so.2
#3  0x00007fffea3d0f7d in _GLOBAL__sub_I_context.cc () from /lib64/libtensorflow_cc.so.2
#4  0x00007ffff7dea9b3 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#5  0x00007ffff7ddc17a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#6  0x0000000000000002 in ?? ()

The flags from /proc/cpuinfo are

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 f ma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpc id_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveop t arat md_clear spec_ctrl intel_stibp arch_capabilities

Can anyone help me out in understanding the issue in this?


Solution

  • tensor Flow heavily uses AVX instruction on x86 platforms. If the binary is compiled with AVX512 that is zmm registers the binary can run on supporting hardware. Hence as per the comments requested to check the instruction set via

    1. objdump -M intel -S /usr/lib64/libtensorflow.so.2 | grep -i zmm and
    2. print $pc in GDB to isloate the instruction.

    Note: as per the update changing from Broadwell (no AVX512) to Skylake (AVX512) has solved the issue.